[LSF/MM/BPF TOPIC] Making memcg limits tier-aware

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Making memcg limits tier-aware
@ 2026-02-25 15:44 Joshua Hahn
  0 siblings, 0 replies; only message in thread
From: Joshua Hahn @ 2026-02-25 15:44 UTC (permalink / raw)
  To: lsf-pc
  Cc: Gregory Price, Johannes Weiner, Shakeel Butt, Roman Gushchin,
	Muchun Song, Michal Hocko, linux-mm, kernel-team

Preface
=======
I’ve sent out an RFC PATCH for this topic, which is available here [1].
The goal with separating the patch and the topic thread is so that
there can be a unified discussion thread even if the RFC moves
forwards in versions.

Introduction
============
Memory cgroups provide an interface that allow multiple workloads on a
host to co-exist, and establish both weak and strong memory isolation
guarantees. For large servers and small embedded systems alike, memcgs
provide an effective way to provide a baseline quality of service for
protected workloads.

This works, because for the most part, all memory is equal (except for
zram / zswap). Restricting a cgroup's memory footprint restricts how
much it can hurt other workloads competing for memory. Likewise, setting
memory.low or memory.min limits can provide weak and strong guarantees
to the performance of a cgroup.

However, on systems with tiered memory (e.g. CXL / compressed memory),
the quality of service guarantees that memcg limits enforced become less
effective, as memcg has no awareness of the physical location of its
charged memory. In other words, a workload that is well-behaved within
its memcg limits may still be hurting the performance of other
well-behaving workloads on the system by hogging more than its
"fair share" of toptier memory.

Usecases
========
In [2], I list out two real-life scenarios that can benefit:

VM hosting services must ensure fairness of hostwide resources and
guarantee a baseline performance. These machines benefit from maximizing
its baseline performance, rather than maximizing system throughput.

Hosts running isolated workloads with a guaranteed maximum tail latency
are also in a similar situation. They want each workload to process its
work (e.g. a query) in a fixed time window, and they would like to
maximize the system’s throughput at the same time.

In [3], Gregory Price notes a third usecase: hyperscalers deploying
hosts that run mixed workloads with different owners must also ensure
fairness across the workloads, as to not reward memory-aggressive
workloads while punishing the less aggressive workloads by pushing
them out to lowtiered memory.

Mechanism
=========
Memcg limits are made tier-aware by scaling effective memory.low/high
values to reflect the ratio of toptier:total memory available to the
cgroup. For instance, on a host where 75% of memory is toptier, a
cgroup’s effective memory.high is scaled to 75% of its value and
enforced at the toptier.

toptier_ratio = toptier_cap / total_cap
memory.toptier_{low, high} = memory.{low, high} * toptier_ratio

As an explicit example:
On a host with 3:1 toptier:lowtier, say 150G toptier, and 50G lowtier,
setting a cgroup's limits to:
memory.min:  15G
memory.low:  20G
memory.high: 40G
memory.max:  50G

Will be enforced at the toptier as:
memory.min:          15G
memory.toptier_low:  15G (20 * 150/200)
memory.toptier_high: 30G (40 * 150/200)
memory.max:          50G

This prevents the (previously possible) scenario where 3x50G containers
on the host above can hog all of toptier, while one container is
pushed out to lowtier.

Topics for Discussion
=====================
1. In this implementation, we restrict a cgroup’s ability to use more
   than its fair share of toptier memory, even when there is no
   competition. This extends the natural memcg limits, which don’t
   let memory.high/max limits go unenforced because the host has free
   memory. However, tier-aware and tier-agnostic memcg limits arguably
   serve different purposes.

   Concrete usecases for allowing a cgroup to use more toptier memory
   than its fair share (while staying within its memcg limits) include
   systems that keep their hosts underutilized, workloads with low
   baseline memory usage with transient spikes in usage, and hosts whose
   total workingset size never exceeds the size of toptier.

   The desired effect can be achieved through a protection-based system
   that relies on only memory.low to protect workloads, instead of
   punishing overconsumers. Whether a purely protection-based system can
   adequately protect its workloads is an open question, however.

   Should this difference be encapsulated in different “modes” for the
   user? Or, are existing mechanisms enough to support these usecases?
   (More context can be found in the Jan. 29 Linux Memory Hotness and
   Promotion meeting notes [4])

2. In this implementation, we extend the limits to memory.low/high.
   Are there usecases that may necessitate extending the limits to
   memory.min/max as well?

3. Are there usecases (and hardware) for systems with 3+ tiers, that
   need per-tier enforcement, not just toptier enforcement?

4. Are there usecases for users to set their own toptier limits, instead
   of relying simply on a tier-proportional limit?

5. Are there usecases for individual cgroups opting in, as opposed to
   enforcing this toggle on a system-wide level? What would it mean for
   a cgroup to be unrestricted in its toptier usage, while other
   cgroups are punished?

[1] https://lore.kernel.org/linux-mm/20260223223830.586018-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/all/20260224161357.2622501-1-joshua.hahnjy@gmail.com/
[3] https://lore.kernel.org/all/aZ3ysV-k1UisnPRG@gourry-fedora-PF4VCD3F/
[4] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-02-25 15:44 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-25 15:44 [LSF/MM/BPF TOPIC] Making memcg limits tier-aware Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox