* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
[not found] <20251229033957.296257-1-jiayuan.chen@linux.dev>
@ 2025-12-29 10:42 ` Markus Elfring
2025-12-30 11:37 ` Michal Koutný
2026-01-05 17:08 ` Shakeel Butt
2 siblings, 0 replies; 4+ messages in thread
From: Markus Elfring @ 2025-12-29 10:42 UTC (permalink / raw)
To: Jiayuan Chen, linux-mm, cgroups
Cc: Jiayuan Chen, LKML, Andrew Morton, Axel Rasmussen,
David Hildenbrand, Johannes Weiner, Lorenzo Stoakes,
Michal Hocko, Muchun Song, Qi Zheng, Roman Gushchin,
Shakeel Butt, Wei Xu, Yuanchu Xie
…
> We observed an issue in production where a workload continuously
…
Please avoid a typo in the summary phrase.
Regards,
Markus
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
[not found] <20251229033957.296257-1-jiayuan.chen@linux.dev>
2025-12-29 10:42 ` [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Markus Elfring
@ 2025-12-30 11:37 ` Michal Koutný
2026-01-05 17:08 ` Shakeel Butt
2 siblings, 0 replies; 4+ messages in thread
From: Michal Koutný @ 2025-12-30 11:37 UTC (permalink / raw)
To: Jiayuan Chen
Cc: linux-mm, Jiayuan Chen, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
David Hildenbrand, Qi Zheng, Lorenzo Stoakes, Axel Rasmussen,
Yuanchu Xie, Wei Xu, cgroups, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]
Hello Jiayuan.
On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
<snip>
> Users are forced to combine memory.high with io.max as a workaround,
> but this is:
> - The wrong abstraction level (memory policy shouldn't require IO tuning)
> - Hard to configure correctly across different storage devices
> - Unintuitive for users who only want memory control
I'd say the need for IO control is as designed, not a workaround. When
you apply control on one type of resource it may manifest by increased
consumption of another like in communicating vessels. (Johannes may
explain in better.)
IIUC, the injection of extra refaul_penalty slows down the thrashing
task and in effect reduces the excessive IO.
Naïvely thinking, wouldn't it have same effect if memory.high was
lowered (to start high throttling earlier)?
<snip>
> This happens because memory.high penalty is currently based solely on
> the overage amount, not the actual impact of that overage:
>
> 1. A memcg over memory.high reclaiming cold/unused pages
> → minimal system impact, light penalty is appropriate
>
> 2. A memcg over memory.high with hot pages being continuously
> reclaimed and refaulted → severe IO pressure, needs heavy penalty
>
> Both cases receive identical penalties today.
(If you want to avoid IO control,) the latter case indicates the memcg's
memory.high is underprovisioned given its needs, so the solution would
be to increase the memory.high (this sounds more natural than the
opposite conjecture above). In theory (don't quote me on that), it
should be visible in PSI since the latter case would accumulate more
stalls than the former, so the cases could be treated accordingly.
> Solution
> --------
> Incorporate refault recency into the penalty calculation. If a refault
> occurred recently when memory.high is triggered, it indicates active
> thrashing and warrants additional throttling.
I find it little inconsistent that IO induced by memory.high would have
this refault scaling but IO by principially equal memory.max could still
grow unlimited :-/
>
> Why not use refault counters directly?
> - Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically,
> not available in real-time for accurate delta calculation
> - Calling mem_cgroup_flush_stats() on every charge would be prohibitively
> expensive in the hot path
> - Due to readahead, the same refault count can represent vastly different
> IO loads, making counter-based estimation unreliable
>
> The timestamp-based approach is:
> - O(1) cost: single timestamp read and comparison
> - Self-calibrating: penalty scales naturally with refault frequency
Can you explain whether this would work universally?
IIUC, you measure frequency per memcg but the scaling is applied
per task, so I imagine there is discrepancy for multi task (process)
workloads.
Regards,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
[not found] <20251229033957.296257-1-jiayuan.chen@linux.dev>
2025-12-29 10:42 ` [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Markus Elfring
2025-12-30 11:37 ` Michal Koutný
@ 2026-01-05 17:08 ` Shakeel Butt
2026-01-06 3:14 ` Jiayuan Chen
2 siblings, 1 reply; 4+ messages in thread
From: Shakeel Butt @ 2026-01-05 17:08 UTC (permalink / raw)
To: Jiayuan Chen
Cc: linux-mm, Jiayuan Chen, Johannes Weiner, Michal Hocko,
Roman Gushchin, Muchun Song, Andrew Morton, David Hildenbrand,
Qi Zheng, Lorenzo Stoakes, Axel Rasmussen, Yuanchu Xie, Wei Xu,
cgroups, linux-kernel, Hui Zhu
+Hui Zhu
Hi Jiayuan,
On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
>
> Problem
> -------
> We observed an issue in production where a workload continuously
> triggering memory.high also generates massive disk IO READ, causing
> system-wide performance degradation.
>
> This happens because memory.high penalty is currently based solely on
> the overage amount, not the actual impact of that overage:
>
> 1. A memcg over memory.high reclaiming cold/unused pages
> → minimal system impact, light penalty is appropriate
>
> 2. A memcg over memory.high with hot pages being continuously
> reclaimed and refaulted → severe IO pressure, needs heavy penalty
>
> Both cases receive identical penalties today. Users are forced to
> combine memory.high with io.max as a workaround, but this is:
> - The wrong abstraction level (memory policy shouldn't require IO tuning)
> - Hard to configure correctly across different storage devices
> - Unintuitive for users who only want memory control
>
Thanks for raising and reporting this use-case. Overall I am supportive
of making memory.high more useful but instead of adding more more
heuristic in the kernel, I would prefer to make the enforcement of
memory.high more flexible with BPF.
At the moment, Hui Zhu is working on adding BPF support for memcg but it
is very generic and I would prefer to start with specific and real
use-case. I think your use-case is real and will be beneficial to many
other users. Can you please followup on that Hui's RFC to present your
use-case? I will also try to push the effort from the review side.
thanks,
Shakeel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
2026-01-05 17:08 ` Shakeel Butt
@ 2026-01-06 3:14 ` Jiayuan Chen
0 siblings, 0 replies; 4+ messages in thread
From: Jiayuan Chen @ 2026-01-06 3:14 UTC (permalink / raw)
To: Shakeel Butt
Cc: linux-mm, Jiayuan Chen, Johannes Weiner, Michal Hocko,
Roman Gushchin, Muchun Song, Andrew Morton, David Hildenbrand,
Qi Zheng, Lorenzo Stoakes, Axel Rasmussen, Yuanchu Xie, Wei Xu,
cgroups, linux-kernel, Hui Zhu
January 6, 2026 at 01:08, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > wrote:
>
> +Hui Zhu
>
> Hi Jiayuan,
>
> On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen wrote:
>
> >
> > From: Jiayuan Chen <jiayuan.chen@shopee.com>
> >
> > Problem
> > -------
> > We observed an issue in production where a workload continuously
> > triggering memory.high also generates massive disk IO READ, causing
> > system-wide performance degradation.
> >
> > This happens because memory.high penalty is currently based solely on
> > the overage amount, not the actual impact of that overage:
> >
> > 1. A memcg over memory.high reclaiming cold/unused pages
> > → minimal system impact, light penalty is appropriate
> >
> > 2. A memcg over memory.high with hot pages being continuously
> > reclaimed and refaulted → severe IO pressure, needs heavy penalty
> >
> > Both cases receive identical penalties today. Users are forced to
> > combine memory.high with io.max as a workaround, but this is:
> > - The wrong abstraction level (memory policy shouldn't require IO tuning)
> > - Hard to configure correctly across different storage devices
> > - Unintuitive for users who only want memory control
> >
> Thanks for raising and reporting this use-case. Overall I am supportive
> of making memory.high more useful but instead of adding more more
> heuristic in the kernel, I would prefer to make the enforcement of
> memory.high more flexible with BPF.
>
> At the moment, Hui Zhu is working on adding BPF support for memcg but it
> is very generic and I would prefer to start with specific and real
> use-case. I think your use-case is real and will be beneficial to many
> other users. Can you please followup on that Hui's RFC to present your
> use-case? I will also try to push the effort from the review side.
>
> thanks,
> Shakeel
>
Hi Shakeel,
Thanks for the feedback and pointing to Hui's RFC.
I noticed Michal has already forwarded my patch to that thread, and
Hui has responded. I'll wait to see how that discussion evolves and
whether there's an opportunity to integrate my use-case into his
BPF framework.
You're right that my timestamp-based approach is heuristic. It was
designed as a simple, low-overhead approximation to detect active
thrashing without the cost of flushing refault counters on every
charge. But I agree that a more flexible BPF-based solution could
be cleaner in the long term.
I'll follow up on Hui's thread once there's more progress.
Thanks,
Jiayuan
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-01-06 3:14 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20251229033957.296257-1-jiayuan.chen@linux.dev>
2025-12-29 10:42 ` [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency Markus Elfring
2025-12-30 11:37 ` Michal Koutný
2026-01-05 17:08 ` Shakeel Butt
2026-01-06 3:14 ` Jiayuan Chen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox