Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Michal Koutný" <mkoutny@suse.com>
To: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: linux-mm@kvack.org, Jiayuan Chen <jiayuan.chen@shopee.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	 Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	cgroups@vger.kernel.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] mm/memcg: scale memory.high penalty based on refault recency
Date: Tue, 30 Dec 2025 12:37:50 +0100	[thread overview]
Message-ID: <4txrfjc5lqkmydmsesfq3l5drmzdio6pkmtfb64sk3ld6bwkhs@w4dkn76s4dbo> (raw)
In-Reply-To: <20251229033957.296257-1-jiayuan.chen@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 2966 bytes --]

Hello Jiayuan.

On Mon, Dec 29, 2025 at 11:39:55AM +0800, Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
<snip>
> Users are forced to combine memory.high with io.max as a workaround,
> but this is:
> - The wrong abstraction level (memory policy shouldn't require IO tuning)
> - Hard to configure correctly across different storage devices
> - Unintuitive for users who only want memory control

I'd say the need for IO control is as designed, not a workaround. When
you apply control on one type of resource it may manifest by increased
consumption of another like in communicating vessels. (Johannes may
explain in better.)

IIUC, the injection of extra refaul_penalty slows down the thrashing
task and in effect reduces the excessive IO.
Naïvely thinking, wouldn't it have same effect if memory.high was
lowered (to start high throttling earlier)?

<snip>
> This happens because memory.high penalty is currently based solely on
> the overage amount, not the actual impact of that overage:
> 
> 1. A memcg over memory.high reclaiming cold/unused pages
>    → minimal system impact, light penalty is appropriate
> 
> 2. A memcg over memory.high with hot pages being continuously
>    reclaimed and refaulted → severe IO pressure, needs heavy penalty
> 
> Both cases receive identical penalties today.

(If you want to avoid IO control,) the latter case indicates the memcg's
memory.high is underprovisioned given its needs, so the solution would
be to increase the memory.high (this sounds more natural than the
opposite conjecture above). In theory (don't quote me on that), it
should be visible in PSI since the latter case would accumulate more
stalls than the former, so the cases could be treated accordingly.


> Solution
> --------
> Incorporate refault recency into the penalty calculation. If a refault
> occurred recently when memory.high is triggered, it indicates active
> thrashing and warrants additional throttling.

I find it little inconsistent that IO induced by memory.high would have
this refault scaling but IO by principially equal memory.max could still
grow unlimited :-/

> 
> Why not use refault counters directly?
> - Refault statistics (WORKINGSET_REFAULT_*) are aggregated periodically,
>   not available in real-time for accurate delta calculation
> - Calling mem_cgroup_flush_stats() on every charge would be prohibitively
>   expensive in the hot path
> - Due to readahead, the same refault count can represent vastly different
>   IO loads, making counter-based estimation unreliable
> 
> The timestamp-based approach is:
> - O(1) cost: single timestamp read and comparison
> - Self-calibrating: penalty scales naturally with refault frequency

Can you explain whether this would work universally?
IIUC, you measure frequency per memcg but the scaling is applied
per task, so I imagine there is discrepancy for multi task (process)
workloads.

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

next prev parent reply	other threads:[~2025-12-30 11:37 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251229033957.296257-1-jiayuan.chen@linux.dev>
2025-12-29 10:42 ` Markus Elfring
2025-12-30 11:37 ` Michal Koutný [this message]
2026-01-05 17:08 ` Shakeel Butt
2026-01-06  3:14   ` Jiayuan Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4txrfjc5lqkmydmsesfq3l5drmzdio6pkmtfb64sk3ld6bwkhs@w4dkn76s4dbo \
    --to=mkoutny@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jiayuan.chen@linux.dev \
    --cc=jiayuan.chen@shopee.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox