linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Gregory Price <gourry@gourry.net>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Kaiyang Zhao <kaiyang2@cs.cmu.edu>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Waiman Long <longman@redhat.com>,
	Chen Ridong <chenridong@huaweicloud.com>,
	Tejun Heo <tj@kernel.org>, Michal Koutny <mkoutny@suse.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Date: Thu, 26 Feb 2026 09:04:43 +0100	[thread overview]
Message-ID: <aZ_-m7vSUPrzDj4n@tiehlicka> (raw)
In-Reply-To: <20260224161357.2622501-1-joshua.hahnjy@gmail.com>

On Tue 24-02-26 08:13:56, Joshua Hahn wrote:
> Hello Michal,
> 
> I hope that you are doing well! Thank you for taking the time to review my
> work and leaving your thoughts.
> 
> I wanted to note that I hope to bring this discussion to LSFMMBPF as well,
> to discuss what the scope of the project should be, what usecases there
> are (as I will note below), how to make this scalable and sustainable
> for the future, etc. I'll send out a topic proposal later today. I had
> separated the series from the proposal because I imagined that this
> series would go through many versions, so it would be helpful to have
> the topic as a unified place for pre-conference discussions.

yes, this is a really good topic to bring to LSFMMBPF. I will not be
attending this year unfortunately but I will keep watching progress on
the this. I am really sure there will be people in the room that can
help with the discussion.

> > > Memory cgroups provide an interface that allow multiple workloads on a
> > > host to co-exist, and establish both weak and strong memory isolation
> > > guarantees. For large servers and small embedded systems alike, memcgs
> > > provide an effective way to provide a baseline quality of service for
> > > protected workloads.
> > > 
> > > This works, because for the most part, all memory is equal (except for
> > > zram / zswap). Restricting a cgroup's memory footprint restricts how
> > > much it can hurt other workloads competing for memory. Likewise, setting
> > > memory.low or memory.min limits can provide weak and strong guarantees
> > > to the performance of a cgroup.
> > > 
> > > However, on systems with tiered memory (e.g. CXL / compressed memory),
> > > the quality of service guarantees that memcg limits enforced become less
> > > effective, as memcg has no awareness of the physical location of its
> > > charged memory. In other words, a workload that is well-behaved within
> > > its memcg limits may still be hurting the performance of other
> > > well-behaving workloads on the system by hogging more than its
> > > "fair share" of toptier memory.
> 
> I will split up your questions to answer them individually:
> 
> > This assumes that the active workingset size of all workloads doesn't
> > fit into the top tier right?
> 
> Yes, for the scenario above, a workload that is violating its fair share
> of toptier memory mostly hurts other workloads if the aggregate working
> set size of all workloads exceeds the size of toptier memory.

I think it would be good to provide some more insight into how this is
supposed to work exactly. If the real working set size doesn't fit into
the top tier then I suspect we can expect quite a lot of disruption by
constant promotions and demotions, right. I guess what you would like to
achieve is to stop those from happening right? If that is the case then
how exactly do you envision to configure the workload. Do you cap the
each workload with max/high limits? Or do you want to rely on the
low/min limits to protect workloads you care about. Or both? How does
that play with promotion side of things.

> > Otherwise promotions would make sure to that we have the most active
> > memory in the top tier.
> 
> This is true. And for a lot of usecases, this is 100% the right thing to do.
> However, with this patch I want to encourage a different perspective,
> which is to think about things in a per-workload perspective, and not a
> per-system perspective.
> 
> Having hot memory in high tiers and cold memory in low tiers is only
> logical, since we increase the system's throughput and make the most
> optimal choices for latency. However, what about systems that care about
> objectives other than simply maximizing throughput?
> 
> In the original cover letter I offered an example of VM hosting services
> that care less about maximizing host-wide throughput, but more on ensuring
> a bottomline performance guarantee for all workloads running on the system.
> For the users on these services, they don't care that the host their VM is
> running on is maximizing throughput; rather, they care that their VM meets
> the performance guarantees that their provider promised. If there is no
> way to know or enforce which tier of memory their workload lands on, either
> the bottomline guarantee becomes very underestimated, or users must deal
> with a high variance in performance.
> 
> Here's another example: Let's say there is a host with multiple workloads,
> each serving queries for a database. The host would like to guarantee the
> lowest maximum latency possible, while maximizing the total throughput
> of the system. Once again in this situation, without tier-aware memcg
> limits the host can maximize throughput, but can only make severely
> underestimated promises on the bottom line.

Thanks useful examples. And it would be really great to provide an
example of intended configuration (no specific numbers but something to
demonstrate the intention). Because this will not be just about limits,
right. It would require more tweaks to the system - at least numa
balancing (promotions) to be controlled in some way AFAICS.

> > Is this typical in real life configurations?
> 
> I would say so. I think that the two examples above are realistic
> scenarios that cloud providers and hyperscalers might face on tiered systems.
> 
> > Or do you intend to limit memory consumption on particular tier even
> > without an external pressure?
> 
> This is a great question, and one that I hope to discuss at LSFMMBPF
> to see how people expect an interface like this to work.
> 
> Over the past few weeks, I have been discussing this idea during the
> Linux Memory Hotness and Promotion biweekly calls with Gregory Price [1].
> One of the proposals that we made there (but did not include in this
> series) is the idea of "fixed" vs. "opportunistic" reclaim.
> 
> Fixed mode is what we have here -- start limiting toptier usage whenever
> a workload goes above its fair slice of toptier.
> Opportunistic mode would allow workloads to use more toptier memory than
> its fair share, but only be restricted when toptier is pressured.
> 
> What do you think about these two options? For the stated goal of this
> series, which is to help maximize the bottom line for workloads, fair
> share seemed to make sense. Implementing opportunistic mode changes
> on top of this work would most likely just be another sysctl.

To me it would sounds like the distinction between max/high vs. low/min
reclaim.

[...]
> > You seem to be focusing only on the top tier with this interface, right?
> > Is this really the right way to go long term? What makes you believe that
> > we do not really hit the same issue with other tiers as well?
> 
> Yes, that's right. I'm not sure if this is the right way to go long-term
> (say, past the next 5 years). My thinking was that I can stick with doing
> this for toptier vs. non-toptier memory for now, and deal with having
> 3+ tiers in the future, when we start to have systems with that many tiers.
> AFAICT two-tiered systems are still ~relatively new, and I don't think
> there are a lot of genuine usecases for enforcing mid-tier memory limits
> as of now. Of course, I would be excited to learn about these usecases
> and work this patchset to support them as well if anybody has them.

I guess a more fundamental question is whether this need to replicate
all limits for tiers or whether we can get an extension that would
control tier behavior for existing ones. In other words can we define
which proportion of the max/high resp. low/min limits are reserved for
each tier? Is that feasible? I do not have answer to that myself at this
stage TBH.

[...]
> > What is the reasoning for the switch to be runtime sysctl rather than
> > boot-time or cgroup mount option?
> 
> Good point : -) I don't think cgroup mount options are a good idea,
> since this would mean that we can have a set of cgroups self-policing
> their toptier usage, while another cgroup allocates memory unrestricted.
> This would punish the self-policing cgroup and we would lose the benefit
> of having a bottomline performance guarantee.

I do not follow. cgroup mount option would apply to all cgroups. In
sense whatever is achievable by sysctl should apply to kernel cmdline or
mount option. The question is what is the best fit AFAICS.
 
> > I will likely have more questions but these are immediate ones after
> > reading the cover. Please note I haven't really looked at the
> > implementation yet. I really want to understand usecases and interface
> > first.
> 
> That sounds good to me, thank you again for reviewing this work!
> I hope you have a great day : -)
> Joshua
> 
> [1] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/

-- 
Michal Hocko
SUSE Labs


      parent reply	other threads:[~2026-02-26  8:04 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 22:38 Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
2026-02-24 16:13   ` Joshua Hahn
2026-02-24 18:49     ` Gregory Price
2026-02-24 20:03       ` Kaiyang Zhao
2026-02-26  8:04     ` Michal Hocko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ_-m7vSUPrzDj4n@tiehlicka \
    --to=mhocko@suse.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kaiyang2@cs.cmu.edu \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox