Hello Jiayuan. On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen wrote: > Users are forced to combine memory.high with io.max as a workaround, > but this is: > - The wrong abstraction level (memory policy shouldn't require IO tuning) > - Hard to configure correctly across different storage devices > - Unintuitive for users who only want memory control I'd say the need for IO control is as designed, not a workaround. When you apply control on one type of resource it may manifest by increased consumption of another like in communicating vessels. (Johannes may explain in better.) IIUC, the injection of extra refaul_penalty slows down the thrashing task and in effect reduces the excessive IO. Naïvely thinking, wouldn't it have same effect if memory.high was lowered (to start high throttling earlier)? > This happens because memory.high penalty is currently based solely on > the overage amount, not the actual impact of that overage: > > 1. A memcg over memory.high reclaiming cold/unused pages > → minimal system impact, light penalty is appropriate > > 2. A memcg over memory.high with hot pages being continuously > reclaimed and refaulted → severe IO pressure, needs heavy penalty > > Both cases receive identical penalties today. (If you want to avoid IO control,) the latter case indicates the memcg's memory.high is underprovisioned given its needs, so the solution would be to increase the memory.high (this sounds more natural than the opposite conjecture above). In theory (don't quote me on that), it should be visible in PSI since the latter case would accumulate more stalls than the former, so the cases could be treated accordingly. > Solution > -------- > Incorporate refault recency into the penalty calculation. If a refault > occurred recently when memory.high is triggered, it indicates active > thrashing and warrants additional throttling. I find it little inconsistent that IO induced by memory.high would have this refault scaling but IO by principially equal memory.max could still grow unlimited :-/ > > Why not use refault counters directly? > - Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically, > not available in real-time for accurate delta calculation > - Calling mem_cgroup_flush_stats() on every charge would be prohibitively > expensive in the hot path > - Due to readahead, the same refault count can represent vastly different > IO loads, making counter-based estimation unreliable > > The timestamp-based approach is: > - O(1) cost: single timestamp read and comparison > - Self-calibrating: penalty scales naturally with refault frequency Can you explain whether this would work universally? IIUC, you measure frequency per memcg but the scaling is applied per task, so I imagine there is discrepancy for multi task (process) workloads. Regards, Michal