From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: lsf-pc@lists.linux-foundation.org
Cc: Gregory Price <gourry@gourry.net>,
Johannes Weiner <hannes@cmpxchg.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Roman Gushchin <roman.gushchin@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Michal Hocko <mhocko@suse.com>, linux-mm <linux-mm@kvack.org>,
kernel-team@meta.com
Subject: [LSF/MM/BPF TOPIC] Making memcg limits tier-aware
Date: Wed, 25 Feb 2026 10:44:21 -0500 [thread overview]
Message-ID: <CAN+CAwNwpjRf9QhgAEhBQZD7r7sXCzLXqAKbNrPeMEq=7bX8Jg@mail.gmail.com> (raw)
Preface
=======
I’ve sent out an RFC PATCH for this topic, which is available here [1].
The goal with separating the patch and the topic thread is so that
there can be a unified discussion thread even if the RFC moves
forwards in versions.
Introduction
============
Memory cgroups provide an interface that allow multiple workloads on a
host to co-exist, and establish both weak and strong memory isolation
guarantees. For large servers and small embedded systems alike, memcgs
provide an effective way to provide a baseline quality of service for
protected workloads.
This works, because for the most part, all memory is equal (except for
zram / zswap). Restricting a cgroup's memory footprint restricts how
much it can hurt other workloads competing for memory. Likewise, setting
memory.low or memory.min limits can provide weak and strong guarantees
to the performance of a cgroup.
However, on systems with tiered memory (e.g. CXL / compressed memory),
the quality of service guarantees that memcg limits enforced become less
effective, as memcg has no awareness of the physical location of its
charged memory. In other words, a workload that is well-behaved within
its memcg limits may still be hurting the performance of other
well-behaving workloads on the system by hogging more than its
"fair share" of toptier memory.
Usecases
========
In [2], I list out two real-life scenarios that can benefit:
VM hosting services must ensure fairness of hostwide resources and
guarantee a baseline performance. These machines benefit from maximizing
its baseline performance, rather than maximizing system throughput.
Hosts running isolated workloads with a guaranteed maximum tail latency
are also in a similar situation. They want each workload to process its
work (e.g. a query) in a fixed time window, and they would like to
maximize the system’s throughput at the same time.
In [3], Gregory Price notes a third usecase: hyperscalers deploying
hosts that run mixed workloads with different owners must also ensure
fairness across the workloads, as to not reward memory-aggressive
workloads while punishing the less aggressive workloads by pushing
them out to lowtiered memory.
Mechanism
=========
Memcg limits are made tier-aware by scaling effective memory.low/high
values to reflect the ratio of toptier:total memory available to the
cgroup. For instance, on a host where 75% of memory is toptier, a
cgroup’s effective memory.high is scaled to 75% of its value and
enforced at the toptier.
toptier_ratio = toptier_cap / total_cap
memory.toptier_{low, high} = memory.{low, high} * toptier_ratio
As an explicit example:
On a host with 3:1 toptier:lowtier, say 150G toptier, and 50G lowtier,
setting a cgroup's limits to:
memory.min: 15G
memory.low: 20G
memory.high: 40G
memory.max: 50G
Will be enforced at the toptier as:
memory.min: 15G
memory.toptier_low: 15G (20 * 150/200)
memory.toptier_high: 30G (40 * 150/200)
memory.max: 50G
This prevents the (previously possible) scenario where 3x50G containers
on the host above can hog all of toptier, while one container is
pushed out to lowtier.
Topics for Discussion
=====================
1. In this implementation, we restrict a cgroup’s ability to use more
than its fair share of toptier memory, even when there is no
competition. This extends the natural memcg limits, which don’t
let memory.high/max limits go unenforced because the host has free
memory. However, tier-aware and tier-agnostic memcg limits arguably
serve different purposes.
Concrete usecases for allowing a cgroup to use more toptier memory
than its fair share (while staying within its memcg limits) include
systems that keep their hosts underutilized, workloads with low
baseline memory usage with transient spikes in usage, and hosts whose
total workingset size never exceeds the size of toptier.
The desired effect can be achieved through a protection-based system
that relies on only memory.low to protect workloads, instead of
punishing overconsumers. Whether a purely protection-based system can
adequately protect its workloads is an open question, however.
Should this difference be encapsulated in different “modes” for the
user? Or, are existing mechanisms enough to support these usecases?
(More context can be found in the Jan. 29 Linux Memory Hotness and
Promotion meeting notes [4])
2. In this implementation, we extend the limits to memory.low/high.
Are there usecases that may necessitate extending the limits to
memory.min/max as well?
3. Are there usecases (and hardware) for systems with 3+ tiers, that
need per-tier enforcement, not just toptier enforcement?
4. Are there usecases for users to set their own toptier limits, instead
of relying simply on a tier-proportional limit?
5. Are there usecases for individual cgroups opting in, as opposed to
enforcing this toggle on a system-wide level? What would it mean for
a cgroup to be unrestricted in its toptier usage, while other
cgroups are punished?
[1] https://lore.kernel.org/linux-mm/20260223223830.586018-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/all/20260224161357.2622501-1-joshua.hahnjy@gmail.com/
[3] https://lore.kernel.org/all/aZ3ysV-k1UisnPRG@gourry-fedora-PF4VCD3F/
[4] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/
reply other threads:[~2026-02-25 15:44 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAN+CAwNwpjRf9QhgAEhBQZD7r7sXCzLXqAKbNrPeMEq=7bX8Jg@mail.gmail.com' \
--to=joshua.hahnjy@gmail.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox