Hello Jiayuan.

On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
<snip>
> Users are forced to combine memory.high with io.max as a workaround,
> but this is:
> - The wrong abstraction level (memory policy shouldn't require IO tuning)
> - Hard to configure correctly across different storage devices
> - Unintuitive for users who only want memory control

I'd say the need for IO control is as designed, not a workaround. When
you apply control on one type of resource it may manifest by increased
consumption of another like in communicating vessels. (Johannes may
explain in better.)

IIUC, the injection of extra refaul_penalty slows down the thrashing
task and in effect reduces the excessive IO.
Naïvely thinking, wouldn't it have same effect if memory.high was
lowered (to start high throttling earlier)?

<snip>
> This happens because memory.high penalty is currently based solely on
> the overage amount, not the actual impact of that overage:
> 
> 1. A memcg over memory.high reclaiming cold/unused pages
>    → minimal system impact, light penalty is appropriate
> 
> 2. A memcg over memory.high with hot pages being continuously
>    reclaimed and refaulted → severe IO pressure, needs heavy penalty
> 
> Both cases receive identical penalties today.

(If you want to avoid IO control,) the latter case indicates the memcg's
memory.high is underprovisioned given its needs, so the solution would
be to increase the memory.high (this sounds more natural than the
opposite conjecture above). In theory (don't quote me on that), it
should be visible in PSI since the latter case would accumulate more
stalls than the former, so the cases could be treated accordingly.


> Solution
> --------
> Incorporate refault recency into the penalty calculation. If a refault
> occurred recently when memory.high is triggered, it indicates active
> thrashing and warrants additional throttling.

I find it little inconsistent that IO induced by memory.high would have
this refault scaling but IO by principially equal memory.max could still
grow unlimited :-/

> 
> Why not use refault counters directly?
> - Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically,
>   not available in real-time for accurate delta calculation
> - Calling mem_cgroup_flush_stats() on every charge would be prohibitively
>   expensive in the hot path
> - Due to readahead, the same refault count can represent vastly different
>   IO loads, making counter-based estimation unreliable
> 
> The timestamp-based approach is:
> - O(1) cost: single timestamp read and comparison
> - Self-calibrating: penalty scales naturally with refault frequency

Can you explain whether this would work universally?
IIUC, you measure frequency per memcg but the scaling is applied
per task, so I imagine there is discrepancy for multi task (process)
workloads.

Regards,
Michal