Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Gregory Price <gourry@gourry.net>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Kaiyang Zhao <kaiyang2@cs.cmu.edu>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Waiman Long <longman@redhat.com>,
	Chen Ridong <chenridong@huaweicloud.com>,
	Tejun Heo <tj@kernel.org>, Michal Koutny <mkoutny@suse.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Date: Thu, 26 Feb 2026 08:08:40 -0800	[thread overview]
Message-ID: <20260226160840.1220006-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <aZ_-m7vSUPrzDj4n@tiehlicka>

On Thu, 26 Feb 2026 09:04:43 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Tue 24-02-26 08:13:56, Joshua Hahn wrote:
> > Hello Michal,
> > 
> > I hope that you are doing well! Thank you for taking the time to review my
> > work and leaving your thoughts.
> > 
> > I wanted to note that I hope to bring this discussion to LSFMMBPF as well,
> > to discuss what the scope of the project should be, what usecases there
> > are (as I will note below), how to make this scalable and sustainable
> > for the future, etc. I'll send out a topic proposal later today. I had
> > separated the series from the proposal because I imagined that this
> > series would go through many versions, so it would be helpful to have
> > the topic as a unified place for pre-conference discussions.
> 
> yes, this is a really good topic to bring to LSFMMBPF. I will not be
> attending this year unfortunately but I will keep watching progress on
> the this. I am really sure there will be people in the room that can
> help with the discussion.

Hello Michal, thank you for the encouraging words : -)
Yes, I am sure that the audience will have valuable ideas to share
as well. Hopefully I can catch you at another conference!

And by the way, I've sent out the proposal here [1] if you are interested!

[...snip...]

> > > This assumes that the active workingset size of all workloads doesn't
> > > fit into the top tier right?
> > 
> > Yes, for the scenario above, a workload that is violating its fair share
> > of toptier memory mostly hurts other workloads if the aggregate working
> > set size of all workloads exceeds the size of toptier memory.
> 
> I think it would be good to provide some more insight into how this is
> supposed to work exactly. If the real working set size doesn't fit into
> the top tier then I suspect we can expect quite a lot of disruption by
> constant promotions and demotions, right. I guess what you would like to
> achieve is to stop those from happening right? If that is the case then
> how exactly do you envision to configure the workload. Do you cap the
> each workload with max/high limits? Or do you want to rely on the
> low/min limits to protect workloads you care about. Or both? How does
> that play with promotion side of things.

Yes, thrashing is probably the biggest concern with the actual performance
if deployed to a real machine. I would like to add that this is
(arguably an even bigger) problem without this setup as well.

Once again on multi-tenant hosts, if we have three hot cgroups whose
workingset size consumes all of DRAM, and one cgroup whose memory is
colder than the other three cgroups, then it will constantly face
thrashing as it has to compete with the other cgroups for hotness.

So the question is whether the thrashing happens to a well-behaving
victim cgroup, or if it happens to the ones whose workingset sizes are
too big.

I also have two qualifying points to add here:
First is that the effective toptier memory limits is not visible to the
users. So when they are designing their workloads, specifically on how
big the workingset size can be, they have no idea how to tune it. So
cgroups that appear to be well-behaved and whose total footprint is
within its memory.high threshold would still see reclaim activity.
Maybe the solution is as simple as exposing the toptier memory limits
as a new sysfs file? But I'm hoping that there is a more clever way to
do this that doesn't add more sysfs entries to the cgroup interface ; -)

Second is that there are scenarios where on a relatively idle machine
with just one cgroup where memory.high, memory.max << toptier capacity,
we would still see reclaim activity. I would argue that this is not
so different from having a cgroup go into reclaim on an empty host, 
even when there is memory avaialble.

But I could also see the argument that those two scenarios are different.
What do you think?

[...snip...]

> > In the original cover letter I offered an example of VM hosting services
> > that care less about maximizing host-wide throughput, but more on ensuring
> > a bottomline performance guarantee for all workloads running on the system.
> > For the users on these services, they don't care that the host their VM is
> > running on is maximizing throughput; rather, they care that their VM meets
> > the performance guarantees that their provider promised. If there is no
> > way to know or enforce which tier of memory their workload lands on, either
> > the bottomline guarantee becomes very underestimated, or users must deal
> > with a high variance in performance.
> > 
> > Here's another example: Let's say there is a host with multiple workloads,
> > each serving queries for a database. The host would like to guarantee the
> > lowest maximum latency possible, while maximizing the total throughput
> > of the system. Once again in this situation, without tier-aware memcg
> > limits the host can maximize throughput, but can only make severely
> > underestimated promises on the bottom line.
> 
> Thanks useful examples. And it would be really great to provide an
> example of intended configuration (no specific numbers but something to
> demonstrate the intention). Because this will not be just about limits,
> right. It would require more tweaks to the system - at least numa
> balancing (promotions) to be controlled in some way AFAICS.

Definitely. Two components that make sense here would be to throttle
promotions when toptier is facing cgroup-local pressure (reaching the
limit), and to also have some background balancing between the two nodes,
maybe by kswapd. I'll be sure to include some of these along with
performance numbers in the next version. 

[...snip...]

> > Fixed mode is what we have here -- start limiting toptier usage whenever
> > a workload goes above its fair slice of toptier.
> > Opportunistic mode would allow workloads to use more toptier memory than
> > its fair share, but only be restricted when toptier is pressured.
> > 
> > What do you think about these two options? For the stated goal of this
> > series, which is to help maximize the bottom line for workloads, fair
> > share seemed to make sense. Implementing opportunistic mode changes
> > on top of this work would most likely just be another sysctl.
> 
> To me it would sounds like the distinction between max/high vs. low/min
> reclaim.

Ack. Makes sense to me.

[...snip...]

> > > You seem to be focusing only on the top tier with this interface, right?
> > > Is this really the right way to go long term? What makes you believe that
> > > we do not really hit the same issue with other tiers as well?
> > 
> > Yes, that's right. I'm not sure if this is the right way to go long-term
> > (say, past the next 5 years). My thinking was that I can stick with doing
> > this for toptier vs. non-toptier memory for now, and deal with having
> > 3+ tiers in the future, when we start to have systems with that many tiers.
> > AFAICT two-tiered systems are still ~relatively new, and I don't think
> > there are a lot of genuine usecases for enforcing mid-tier memory limits
> > as of now. Of course, I would be excited to learn about these usecases
> > and work this patchset to support them as well if anybody has them.
> 
> I guess a more fundamental question is whether this need to replicate
> all limits for tiers or whether we can get an extension that would
> control tier behavior for existing ones. In other words can we define
> which proportion of the max/high resp. low/min limits are reserved for
> each tier? Is that feasible? I do not have answer to that myself at this
> stage TBH.

In terms of feasibility, I think the easiest would be to enforce limits
based on capacity, since this would let us get by without defining
per-tier per-cgroup limits. So for a 4-tier system with capacity of 200G
and at each tier,  100G : 60G : 20G : 20G, and a cgroup with a 50G memory.high:

tier0.ratio: 100 / 200 = 0.5		tier0.toptier_high = 50G * 0.5 = 25G
tier1.ratio: 60 / 200  = 0.3		tier1.toptier_high = 50G * 0.3 = 15G
tier2.ratio: 20 / 200  = 0.1		tier2.toptier_high = 50G * 0.1 = 5G
tier3.ratio: 20 / 200  = 0.1		tier3.toptier_high = 50G * 0.1 = 5G

The alternative would be to have 4 sysctls here to set limits which...
doesn't sound too fun ; -) And I'm not entirely sure if we want limits
per-tier anyways. For most scenarios I think it should be enough to limit
how much to protect or limit toptier usage.

> [...]
> > > What is the reasoning for the switch to be runtime sysctl rather than
> > > boot-time or cgroup mount option?
> > 
> > Good point : -) I don't think cgroup mount options are a good idea,
> > since this would mean that we can have a set of cgroups self-policing
> > their toptier usage, while another cgroup allocates memory unrestricted.
> > This would punish the self-policing cgroup and we would lose the benefit
> > of having a bottomline performance guarantee.
> 
> I do not follow. cgroup mount option would apply to all cgroups. In
> sense whatever is achievable by sysctl should apply to kernel cmdline or
> mount option. The question is what is the best fit AFAICS.

Yup, you're right. I mixed it up in my head and got confused, in terms
of functionality I think kernel cmdline and mount option are same.

Actually everything except for runtime toggle makes sense, since this
requires the system to do the additioanl per-tier accounting even when
it is disabled. With kernel cmdline we can tell the system to completely
ignore the per-tier accounting and enforcement and the user faces no
effects at all (except well, the additional cacheline in struct page_coutner?)

Anyways, thank you very much for your thoughs and encouraging words.
I hope you have a great day, Michal!
Joshua

[1] https://lore.kernel.org/all/CAN+CAwNwpjRf9QhgAEhBQZD7r7sXCzLXqAKbNrPeMEq=7bX8Jg@mail.gmail.com/

     prev parent reply	other threads:[~2026-02-26 17:56 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 22:38 Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
2026-02-24 16:13   ` Joshua Hahn
2026-02-24 18:49     ` Gregory Price
2026-02-24 20:03       ` Kaiyang Zhao
2026-02-26  8:04     ` Michal Hocko
2026-02-26 16:08       ` Joshua Hahn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260226160840.1220006-1-joshua.hahnjy@gmail.com \
    --to=joshua.hahnjy@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=kaiyang2@cs.cmu.edu \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox