[RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
@ 2026-02-23 22:38 Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-02-23 22:38 UTC (permalink / raw)
  Cc: Gregory Price, Johannes Weiner, Kaiyang Zhao, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Waiman Long,
	Chen Ridong, Tejun Heo, Michal Koutny, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Qi Zheng, linux-mm, cgroups, linux-kernel,
	kernel-team

Memory cgroups provide an interface that allow multiple workloads on a
host to co-exist, and establish both weak and strong memory isolation
guarantees. For large servers and small embedded systems alike, memcgs
provide an effective way to provide a baseline quality of service for
protected workloads.

This works, because for the most part, all memory is equal (except for
zram / zswap). Restricting a cgroup's memory footprint restricts how
much it can hurt other workloads competing for memory. Likewise, setting
memory.low or memory.min limits can provide weak and strong guarantees
to the performance of a cgroup.

However, on systems with tiered memory (e.g. CXL / compressed memory),
the quality of service guarantees that memcg limits enforced become less
effective, as memcg has no awareness of the physical location of its
charged memory. In other words, a workload that is well-behaved within
its memcg limits may still be hurting the performance of other
well-behaving workloads on the system by hogging more than its
"fair share" of toptier memory.

Introduce tier-aware memcg limits, which scale memory.low/high to
reflect the ratio of toptier:total memory the cgroup has access.

Take the following scenario as an example:
On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
setting a cgroup's limits to:
	memory.min:  15G
	memory.low:  20G
	memory.high: 40G
	memory.max:  50G

Will be enforced at the toptier as:
	memory.min:          15G
	memory.toptier_low:  15G (20 * 150/200)
	memory.toptier_high: 30G (40 * 150/200)
	memory.max:          50G

Let's say that there are 4 such cgroups on the host. Previously, it would
be possible for 3 hosts to completely take over all of DRAM, while one
cgroup could only access the lowtier memory. In the perspective of a
tier-agnostic memcg limit enforcement, the three cgroups are all
well-behaved, consuming within their memory limits.

This is not to say that the scenario above is incorrect. In fact, for
letting the hottest cgroups run in DRAM while pushing out colder cgroups
to lowtier memory lets the system perform the most aggregate work total.

But for other scenarios, the target might not be maximizing aggregate
work, but maximizing the minimum performance guarantee for each
individual workload (think hosts shared across different users, such as
VM hosting services).

To reflect these two scenarios, introduce a sysctl tier_aware_memcg,
which allows the host to toggle between enforcing and overlooking
toptier memcg limit breaches.

This work is inspired & based off of Kaiyang Zhao's work from 2024 [1],
where he referred to this concept as "memory tiering fairness".
The biggest difference in the implementations lie in how toptier memory
is tracked; in his implementation, an lruvec stat aggregation is done on
each usage check, while in this implementation, a new cacheline is
introduced in page_coutner to keep track of toptier usage (Kaiyang also
introduces a new cachline in page_counter, but only uses it to cache
capacity and thresholds). This implementation also extends the memory
limit enforcement to memory.high as well.

[1] https://lore.kernel.org/linux-mm/20240920221202.1734227-1-kaiyang2@cs.cmu.edu/

---
Joshua Hahn (6):
  mm/memory-tiers: Introduce tier-aware memcg limit sysfs
  mm/page_counter: Introduce tiered memory awareness to page_counter
  mm/memory-tiers, memcontrol: Introduce toptier capacity updates
  mm/memcontrol: Charge and uncharge from toptier
  mm/memcontrol, page_counter: Make memory.low tier-aware
  mm/memcontrol: Make memory.high tier-aware

 include/linux/memcontrol.h   |  21 ++++-
 include/linux/memory-tiers.h |  30 +++++++
 include/linux/page_counter.h |  31 ++++++-
 include/linux/swap.h         |   3 +-
 kernel/cgroup/cpuset.c       |   2 +-
 kernel/cgroup/dmem.c         |   2 +-
 mm/memcontrol-v1.c           |   6 +-
 mm/memcontrol.c              | 155 +++++++++++++++++++++++++++++++----
 mm/memory-tiers.c            |  63 ++++++++++++++
 mm/page_counter.c            |  77 ++++++++++++++++-
 mm/vmscan.c                  |  24 ++++--
 11 files changed, 376 insertions(+), 38 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs
  2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
@ 2026-02-23 22:38 ` Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Joshua Hahn
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-02-23 22:38 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linux-mm, linux-kernel,
	kernel-team

Introduce a sysfs entry /sys/kernel/mm/numa/tier_aware_memcg to allow
users to toggle between memcg limits that are proportional to the
system's toptier:total capacity ratio.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/memory-tiers.h |  1 +
 mm/memory-tiers.c            | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 96987d9d95a8..85440473effb 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -37,6 +37,7 @@ struct access_coordinate;
 
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
+extern bool tier_aware_memcg_limits;
 extern struct memory_dev_type *default_dram_type;
 extern nodemask_t default_dram_nodes;
 struct memory_dev_type *alloc_memory_type(int adistance);
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 545e34626df7..a88256381519 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -939,6 +939,8 @@ subsys_initcall(memory_tier_init);
 
 bool numa_demotion_enabled = false;
 
+bool tier_aware_memcg_limits;
+
 #ifdef CONFIG_MIGRATION
 #ifdef CONFIG_SYSFS
 static ssize_t demotion_enabled_show(struct kobject *kobj,
@@ -975,8 +977,28 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
 static struct kobj_attribute numa_demotion_enabled_attr =
 	__ATTR_RW(demotion_enabled);
 
+static ssize_t tier_aware_memcg_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n", str_true_false(tier_aware_memcg_limits));
+}
+
+static ssize_t tier_aware_memcg_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	if (kstrtobool(buf, &tier_aware_memcg_limits))
+		return -EINVAL;
+
+	return count;
+}
+
+static struct kobj_attribute numa_tier_aware_memcg_attr =
+	__ATTR_RW(tier_aware_memcg);
+
 static struct attribute *numa_attrs[] = {
 	&numa_demotion_enabled_attr.attr,
+	&numa_tier_aware_memcg_attr.attr,
 	NULL,
 };
 
-- 
2.47.3



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter
  2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
@ 2026-02-23 22:38 ` Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Joshua Hahn
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-02-23 22:38 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, linux-mm, cgroups,
	linux-kernel, kernel-team

On systems with tiered memory, there is currently no tracking of memory
at the tier-memcg granularity. While per-memcg-lruvec serves at a finer
granularity that can be accumulated to give us the desired
per-tier-memcg accounting, relying on these lruvec stats for limit
checking can prove touch too many hot paths too frequently and can
introduce increased latency for other memcg users.

Instead, add a new cacheline in struct page_counter to track toptier
memcg limits and usage, as well as cached capacity values. This
cacheline is only used by the mem_cgroup->memory page_counter.

Also, introduce helpers that use these new fields to calculate
proportional toptier high and low values, based on the system's
toptier:total capacity ratio.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/page_counter.h | 22 +++++++++++++++++++++-
 mm/page_counter.c            | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index d649b6bbbc87..128c1272c88c 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -5,6 +5,7 @@
 #include <linux/atomic.h>
 #include <linux/cache.h>
 #include <linux/limits.h>
+#include <linux/nodemask.h>
 #include <asm/page.h>
 
 struct page_counter {
@@ -31,9 +32,23 @@ struct page_counter {
 	/* Latest cg2 reset watermark */
 	unsigned long local_watermark;
 
-	/* Keep all the read most fields in a separete cacheline. */
+	/* Keep all the tiered memory fields in a separate cacheline. */
 	CACHELINE_PADDING(_pad2_);
 
+	atomic_long_t toptier_usage;
+
+	/* effective toptier-proportional low protection */
+	unsigned long etoptier_low;
+	atomic_long_t toptier_low_usage;
+	atomic_long_t children_toptier_low_usage;
+
+	/* Cached toptier capacity for proportional limit calculations */
+	unsigned long toptier_capacity;
+	unsigned long total_capacity;
+
+	/* Keep all the read most fields in a separate cacheline. */
+	CACHELINE_PADDING(_pad3_);
+
 	bool protection_support;
 	bool track_failcnt;
 	unsigned long min;
@@ -61,6 +76,9 @@ static inline void page_counter_init(struct page_counter *counter,
 	counter->parent = parent;
 	counter->protection_support = protection_support;
 	counter->track_failcnt = false;
+	counter->toptier_usage = (atomic_long_t)ATOMIC_LONG_INIT(0);
+	counter->toptier_capacity = 0;
+	counter->total_capacity = 0;
 }
 
 static inline unsigned long page_counter_read(struct page_counter *counter)
@@ -103,6 +121,8 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
 void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
 				       bool recursive_protection);
+unsigned long page_counter_toptier_high(struct page_counter *counter);
+unsigned long page_counter_toptier_low(struct page_counter *counter);
 #else
 static inline void page_counter_calculate_protection(struct page_counter *root,
 						     struct page_counter *counter,
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 661e0f2a5127..5ec97811c418 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -462,4 +462,38 @@ void page_counter_calculate_protection(struct page_counter *root,
 			atomic_long_read(&parent->children_low_usage),
 			recursive_protection));
 }
+
+unsigned long page_counter_toptier_high(struct page_counter *counter)
+{
+	unsigned long high = READ_ONCE(counter->high);
+	unsigned long toptier_cap, total_cap;
+
+	if (high == PAGE_COUNTER_MAX)
+		return PAGE_COUNTER_MAX;
+
+	toptier_cap = counter->toptier_capacity;
+	total_cap = counter->total_capacity;
+
+	if (!total_cap)
+		return PAGE_COUNTER_MAX;
+
+	return mult_frac(high, toptier_cap, total_cap);
+}
+
+unsigned long page_counter_toptier_low(struct page_counter *counter)
+{
+	unsigned long low = READ_ONCE(counter->low);
+	unsigned long toptier_cap, total_cap;
+
+	if (!low)
+		return 0;
+
+	toptier_cap = counter->toptier_capacity;
+	total_cap = counter->total_capacity;
+
+	if (!total_cap)
+		return 0;
+
+	return mult_frac(low, toptier_cap, total_cap);
+}
 #endif /* CONFIG_MEMCG || CONFIG_CGROUP_DMEM */
-- 
2.47.3



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates
  2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Joshua Hahn
@ 2026-02-23 22:38 ` Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Joshua Hahn
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-02-23 22:38 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Waiman Long,
	Chen Ridong, Tejun Heo, Michal Koutny, linux-mm, cgroups,
	linux-kernel, kernel-team

What a memcg considers to be a valid toptier node is defined by three
criteria: (1) The node has CPUs, (2) The node has online memory,
and (3) The node is within the cgroup's cpuset.mems.

Of the three, the second and third criteria are the only ones that can
change dynamically during runtime, via memory hotplug events and
cpuset.mems changes, respectively.

Introduce functions to calculate and update toptier capacity, and call
them during cpuset.mems changes and memory hotplug events.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/memcontrol.h   |  6 ++++++
 include/linux/memory-tiers.h | 29 +++++++++++++++++++++++++
 include/linux/page_counter.h |  2 ++
 kernel/cgroup/cpuset.c       |  2 +-
 mm/memcontrol.c              | 17 +++++++++++++++
 mm/memory-tiers.c            | 41 ++++++++++++++++++++++++++++++++++++
 mm/page_counter.c            |  8 +++++++
 7 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5173a9f16721..900a36112b62 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -608,6 +608,8 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 				     struct mem_cgroup *memcg);
 
+void update_memcg_toptier_capacity(void);
+
 static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 					  struct mem_cgroup *memcg)
 {
@@ -1116,6 +1118,10 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 }
 
+static inline void update_memcg_toptier_capacity(void)
+{
+}
+
 static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 					  struct mem_cgroup *memcg)
 {
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 85440473effb..cf616885e0db 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -53,6 +53,9 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
 struct memory_dev_type *mt_find_alloc_memory_type(int adist,
 						  struct list_head *memory_types);
 void mt_put_memory_types(struct list_head *memory_types);
+void mt_get_toptier_nodemask(nodemask_t *mask, const nodemask_t *allowed);
+unsigned long mt_get_toptier_capacity(const nodemask_t *allowed);
+unsigned long mt_get_total_capacity(const nodemask_t *allowed);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node, const nodemask_t *allowed_mask);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -152,5 +155,31 @@ static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
 static inline void mt_put_memory_types(struct list_head *memory_types)
 {
 }
+
+static inline void mt_get_toptier_nodemask(nodemask_t *mask,
+					   const nodemask_t *allowed)
+{
+	*mask = node_states[N_MEMORY];
+	if (allowed)
+		nodes_and(*mask, *mask, *allowed);
+}
+
+static inline unsigned long mt_get_toptier_capacity(const nodemask_t *allowed)
+{
+	int nid;
+	unsigned long capacity = 0;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (allowed && !node_isset(nid, *allowed))
+			continue;
+		capacity += NODE_DATA(nid)->node_present_pages;
+	}
+	return capacity;
+}
+
+static inline unsigned long mt_get_total_capacity(const nodemask_t *allowed)
+{
+	return mt_get_toptier_capacity(allowed);
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 128c1272c88c..ada5f1dd75d4 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -121,6 +121,8 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
 void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
 				       bool recursive_protection);
+void page_counter_update_toptier_capacity(struct page_counter *counter,
+					  const nodemask_t *allowed);
 unsigned long page_counter_toptier_high(struct page_counter *counter);
 unsigned long page_counter_toptier_low(struct page_counter *counter);
 #else
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7607dfe516e6..e5641dc1af88 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2620,7 +2620,6 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
 	rcu_read_lock();
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
-
 		bool has_mems = nodes_and(*new_mems, cp->mems_allowed, parent->effective_mems);
 
 		/*
@@ -2701,6 +2700,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 
 	/* use trialcs->mems_allowed as a temp variable */
 	update_nodemasks_hier(cs, &trialcs->mems_allowed);
+	update_memcg_toptier_capacity();
 	return 0;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0be1e823d813..f3e4a6ce7181 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,7 @@
 #include <linux/seq_file.h>
 #include <linux/vmpressure.h>
 #include <linux/memremap.h>
+#include <linux/memory-tiers.h>
 #include <linux/mm_inline.h>
 #include <linux/swap_cgroup.h>
 #include <linux/cpu.h>
@@ -3906,6 +3907,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 		page_counter_init(&memcg->memory, &parent->memory, memcg_on_dfl);
 		page_counter_init(&memcg->swap, &parent->swap, false);
+		page_counter_update_toptier_capacity(&memcg->memory, NULL);
 #ifdef CONFIG_MEMCG_V1
 		memcg->memory.track_failcnt = !memcg_on_dfl;
 		WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
@@ -3917,6 +3919,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		init_memcg_events();
 		page_counter_init(&memcg->memory, NULL, true);
 		page_counter_init(&memcg->swap, NULL, false);
+		page_counter_update_toptier_capacity(&memcg->memory, NULL);
 #ifdef CONFIG_MEMCG_V1
 		page_counter_init(&memcg->kmem, NULL, false);
 		page_counter_init(&memcg->tcpmem, NULL, false);
@@ -4804,6 +4807,20 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 	page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
 }
 
+void update_memcg_toptier_capacity(void)
+{
+	struct mem_cgroup *memcg;
+	nodemask_t allowed;
+
+	for_each_mem_cgroup(memcg) {
+		if (memcg == root_mem_cgroup)
+			continue;
+
+		cpuset_nodes_allowed(memcg->css.cgroup, &allowed);
+		page_counter_update_toptier_capacity(&memcg->memory, &allowed);
+	}
+}
+
 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 			gfp_t gfp)
 {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index a88256381519..259caaf4be8f 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -889,6 +889,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 		mutex_lock(&memory_tier_lock);
 		if (clear_node_memory_tier(nn->nid))
 			establish_demotion_targets();
+		update_memcg_toptier_capacity();
 		mutex_unlock(&memory_tier_lock);
 		break;
 	case NODE_ADDED_FIRST_MEMORY:
@@ -896,6 +897,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 		memtier = set_node_memory_tier(nn->nid);
 		if (!IS_ERR(memtier))
 			establish_demotion_targets();
+		update_memcg_toptier_capacity();
 		mutex_unlock(&memory_tier_lock);
 		break;
 	}
@@ -941,6 +943,45 @@ bool numa_demotion_enabled = false;
 
 bool tier_aware_memcg_limits;
 
+void mt_get_toptier_nodemask(nodemask_t *mask, const nodemask_t *allowed)
+{
+	int nid;
+
+	*mask = NODE_MASK_NONE;
+	for_each_node_state(nid, N_MEMORY) {
+		if (node_is_toptier(nid))
+			node_set(nid, *mask);
+	}
+	if (allowed)
+		nodes_and(*mask, *mask, *allowed);
+}
+
+unsigned long mt_get_toptier_capacity(const nodemask_t *allowed)
+{
+	int nid;
+	unsigned long capacity = 0;
+	nodemask_t mask;
+
+	mt_get_toptier_nodemask(&mask, allowed);
+	for_each_node_mask(nid, mask)
+		capacity += NODE_DATA(nid)->node_present_pages;
+
+	return capacity;
+}
+
+unsigned long mt_get_total_capacity(const nodemask_t *allowed)
+{
+	int nid;
+	unsigned long capacity = 0;
+
+	for_each_node_state(nid, N_MEMORY) {
+		if (allowed && !node_isset(nid, *allowed))
+			continue;
+		capacity += NODE_DATA(nid)->node_present_pages;
+	}
+	return capacity;
+}
+
 #ifdef CONFIG_MIGRATION
 #ifdef CONFIG_SYSFS
 static ssize_t demotion_enabled_show(struct kobject *kobj,
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 5ec97811c418..cf21c72bfd4e 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -11,6 +11,7 @@
 #include <linux/string.h>
 #include <linux/sched.h>
 #include <linux/bug.h>
+#include <linux/memory-tiers.h>
 #include <asm/page.h>
 
 static bool track_protection(struct page_counter *c)
@@ -463,6 +464,13 @@ void page_counter_calculate_protection(struct page_counter *root,
 			recursive_protection));
 }
 
+void page_counter_update_toptier_capacity(struct page_counter *counter,
+					  const nodemask_t *allowed)
+{
+	counter->toptier_capacity = mt_get_toptier_capacity(allowed);
+	counter->total_capacity = mt_get_total_capacity(allowed);
+}
+
 unsigned long page_counter_toptier_high(struct page_counter *counter)
 {
 	unsigned long high = READ_ONCE(counter->high);
-- 
2.47.3



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier
  2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
                   ` (2 preceding siblings ...)
  2026-02-23 22:38 ` [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Joshua Hahn
@ 2026-02-23 22:38 ` Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Joshua Hahn
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-02-23 22:38 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, linux-mm, cgroups, linux-kernel,
	kernel-team

Modify memcg charging and uncharging sites to also update toptier
statistics.

Unfortunately, try_charge_memcg is unaware of the physical folio being
charged; it only deals with nr_pages. Instead of modifying
try_charge_memcg, instead adjust the toptier fields once
try_charge_memcg succeeds, inside charge_memcg.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/memcontrol.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f3e4a6ce7181..07464f02c529 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1948,6 +1948,24 @@ static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 }
 
+static void memcg_charge_toptier(struct mem_cgroup *memcg,
+				 unsigned long nr_pages)
+{
+	struct page_counter *c;
+
+	for (c = &memcg->memory; c; c = c->parent)
+		atomic_long_add(nr_pages, &c->toptier_usage);
+}
+
+static void memcg_uncharge_toptier(struct mem_cgroup *memcg,
+				   unsigned long nr_pages)
+{
+	struct page_counter *c;
+
+	for (c = &memcg->memory; c; c = c->parent)
+		atomic_long_sub(nr_pages, &c->toptier_usage);
+}
+
 /*
  * Returns stocks cached in percpu and reset cached information.
  */
@@ -4830,6 +4848,9 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 	if (ret)
 		goto out;
 
+	if (node_is_toptier(folio_nid(folio)))
+		memcg_charge_toptier(memcg, folio_nr_pages(folio));
+
 	css_get(&memcg->css);
 	commit_charge(folio, memcg);
 	memcg1_commit_charge(folio, memcg);
@@ -4921,6 +4942,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 struct uncharge_gather {
 	struct mem_cgroup *memcg;
 	unsigned long nr_memory;
+	unsigned long nr_toptier;
 	unsigned long pgpgout;
 	unsigned long nr_kmem;
 	int nid;
@@ -4941,6 +4963,8 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 		}
 		memcg1_oom_recover(ug->memcg);
 	}
+	if (ug->nr_toptier)
+		memcg_uncharge_toptier(ug->memcg, ug->nr_toptier);
 
 	memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid);
 
@@ -4989,6 +5013,9 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 
 	nr_pages = folio_nr_pages(folio);
 
+	if (node_is_toptier(folio_nid(folio)))
+		ug->nr_toptier += nr_pages;
+
 	if (folio_memcg_kmem(folio)) {
 		ug->nr_memory += nr_pages;
 		ug->nr_kmem += nr_pages;
@@ -5072,6 +5099,10 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
 			page_counter_charge(&memcg->memsw, nr_pages);
 	}
 
+	/* The old folio's toptier_usage will be decremented when it is freed */
+	if (node_is_toptier(folio_nid(new)))
+		memcg_charge_toptier(memcg, nr_pages);
+
 	css_get(&memcg->css);
 	commit_charge(new, memcg);
 	memcg1_commit_charge(new, memcg);
@@ -5091,6 +5122,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
 void mem_cgroup_migrate(struct folio *old, struct folio *new)
 {
 	struct mem_cgroup *memcg;
+	int old_toptier, new_toptier;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
 	VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
@@ -5111,6 +5143,13 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 	if (!memcg)
 		return;
 
+	old_toptier = node_is_toptier(folio_nid(old));
+	new_toptier = node_is_toptier(folio_nid(new));
+	if (old_toptier && !new_toptier)
+		memcg_uncharge_toptier(memcg, folio_nr_pages(old));
+	else if (!old_toptier && new_toptier)
+		memcg_charge_toptier(memcg, folio_nr_pages(old));
+
 	/* Transfer the charge and the css ref */
 	commit_charge(new, memcg);
 
-- 
2.47.3



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware
  2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
                   ` (3 preceding siblings ...)
  2026-02-23 22:38 ` [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Joshua Hahn
@ 2026-02-23 22:38 ` Joshua Hahn
  2026-02-23 22:38 ` [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
  2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
  6 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-02-23 22:38 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Tejun Heo,
	Michal Koutny, Axel Rasmussen, Yuanchu Xie, Wei Xu, Qi Zheng,
	linux-mm, cgroups, linux-kernel, kernel-team

On machines serving multiple workloads whose memory is isolated via
the memory cgroup controller, it is currently impossible to enforce a
fair distribution of toptier memory among the worloads, as the only
enforcable limits have to do with total memory footprint, but not where
that memory resides.

This makes ensuring a consistent and baseline performance difficult, as
each workload's performance is heavily impacted by workload-external
factors such as which other workloads are co-located in the same host,
and the order at which different workloads are started.

Extend the existing memory.low protection to be tier-aware in the
charging, enforcement, and protection calculation to provide
best-effort attempts at protecting a fair proportion of toptier memory.

Updates to protection and charging are performed in the same path as
the standard memcontrol equivalents. Enforcing tier-aware memcg limits
however, are gated behind the sysctl tier_aware_memcg. This is so that
runtime-enabling of tier aware limits can account for memory already
present in the system.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/memcontrol.h   | 15 +++++++++++----
 include/linux/page_counter.h |  7 ++++---
 kernel/cgroup/dmem.c         |  2 +-
 mm/memcontrol.c              | 14 ++++++++++++--
 mm/page_counter.c            | 35 ++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                  | 13 +++++++++----
 6 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 900a36112b62..a998a1e3b8b0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -606,7 +606,9 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
 }
 
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
-				     struct mem_cgroup *memcg);
+				     struct mem_cgroup *memcg, bool toptier);
+
+unsigned long mem_cgroup_toptier_usage(struct mem_cgroup *memcg);
 
 void update_memcg_toptier_capacity(void);
 
@@ -623,11 +625,15 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 }
 
 static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
-					struct mem_cgroup *memcg)
+					struct mem_cgroup *memcg, bool toptier)
 {
 	if (mem_cgroup_unprotected(target, memcg))
 		return false;
 
+	if (toptier)
+		return READ_ONCE(memcg->memory.etoptier_low) >=
+				 mem_cgroup_toptier_usage(memcg);
+
 	return READ_ONCE(memcg->memory.elow) >=
 		page_counter_read(&memcg->memory);
 }
@@ -1114,7 +1120,8 @@ static inline void mem_cgroup_protection(struct mem_cgroup *root,
 }
 
 static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
-						   struct mem_cgroup *memcg)
+						   struct mem_cgroup *memcg,
+						   bool toptier)
 {
 }
 
@@ -1128,7 +1135,7 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
 	return true;
 }
 static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
-					struct mem_cgroup *memcg)
+					struct mem_cgroup *memcg, bool toptier)
 {
 	return false;
 }
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index ada5f1dd75d4..6635ee7b9575 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -120,15 +120,16 @@ static inline void page_counter_reset_watermark(struct page_counter *counter)
 #if IS_ENABLED(CONFIG_MEMCG) || IS_ENABLED(CONFIG_CGROUP_DMEM)
 void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
-				       bool recursive_protection);
+				       bool recursive_protection, bool toptier);
 void page_counter_update_toptier_capacity(struct page_counter *counter,
 					  const nodemask_t *allowed);
 unsigned long page_counter_toptier_high(struct page_counter *counter);
 unsigned long page_counter_toptier_low(struct page_counter *counter);
 #else
 static inline void page_counter_calculate_protection(struct page_counter *root,
-						     struct page_counter *counter,
-						     bool recursive_protection) {}
+						struct page_counter *counter,
+						bool recursive_protection,
+						bool toptier) {}
 #endif
 
 #endif /* _LINUX_PAGE_COUNTER_H */
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 1ea6afffa985..536d43c42de8 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -277,7 +277,7 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
 			continue;
 
 		page_counter_calculate_protection(
-			climit, &found_pool->cnt, true);
+			climit, &found_pool->cnt, true, false);
 
 		if (found_pool == test_pool)
 			break;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 07464f02c529..8aa7ae361a73 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4806,12 +4806,13 @@ struct cgroup_subsys memory_cgrp_subsys = {
  * mem_cgroup_calculate_protection - check if memory consumption is in the normal range
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
+ * @toptier: whether the caller is in a toptier node
  *
  * WARNING: This function is not stateless! It can only be used as part
  *          of a top-down tree iteration, not for isolated queries.
  */
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
-				     struct mem_cgroup *memcg)
+				     struct mem_cgroup *memcg, bool toptier)
 {
 	bool recursive_protection =
 		cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT;
@@ -4822,7 +4823,16 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 	if (!root)
 		root = root_mem_cgroup;
 
-	page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
+	page_counter_calculate_protection(&root->memory, &memcg->memory,
+					  recursive_protection, toptier);
+}
+
+unsigned long mem_cgroup_toptier_usage(struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled() || !memcg)
+		return 0;
+
+	return atomic_long_read(&memcg->memory.toptier_usage);
 }
 
 void update_memcg_toptier_capacity(void)
diff --git a/mm/page_counter.c b/mm/page_counter.c
index cf21c72bfd4e..79d46a1c4c0c 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -410,12 +410,39 @@ static unsigned long effective_protection(unsigned long usage,
 	return ep;
 }
 
+static void calculate_protection_toptier(struct page_counter *counter,
+					 bool recursive_protection)
+{
+	struct page_counter *parent = counter->parent;
+	unsigned long toptier_low;
+	unsigned long toptier_usage, parent_toptier_usage;
+	unsigned long toptier_protected, old_toptier_protected;
+	long delta;
+
+	toptier_low = page_counter_toptier_low(counter);
+	toptier_usage = atomic_long_read(&counter->toptier_usage);
+	parent_toptier_usage = atomic_long_read(&parent->toptier_usage);
+
+	/* Propagate toptier low usage to parent for sibling distribution */
+	toptier_protected = min(toptier_usage, toptier_low);
+	old_toptier_protected = atomic_long_xchg(&counter->toptier_low_usage,
+						 toptier_protected);
+	delta = toptier_protected - old_toptier_protected;
+	atomic_long_add(delta, &parent->children_toptier_low_usage);
+
+	WRITE_ONCE(counter->etoptier_low,
+		   effective_protection(toptier_usage, parent_toptier_usage,
+		   toptier_low, READ_ONCE(parent->etoptier_low),
+		   atomic_long_read(&parent->children_toptier_low_usage),
+		   recursive_protection));
+}
 
 /**
  * page_counter_calculate_protection - check if memory consumption is in the normal range
  * @root: the top ancestor of the sub-tree being checked
  * @counter: the page_counter the counter to update
  * @recursive_protection: Whether to use memory_recursiveprot behavior.
+ * @toptier: Whether to calculate toptier-proportional protection
  *
  * Calculates elow/emin thresholds for given page_counter.
  *
@@ -424,7 +451,7 @@ static unsigned long effective_protection(unsigned long usage,
  */
 void page_counter_calculate_protection(struct page_counter *root,
 				       struct page_counter *counter,
-				       bool recursive_protection)
+				       bool recursive_protection, bool toptier)
 {
 	unsigned long usage, parent_usage;
 	struct page_counter *parent = counter->parent;
@@ -446,6 +473,9 @@ void page_counter_calculate_protection(struct page_counter *root,
 	if (parent == root) {
 		counter->emin = READ_ONCE(counter->min);
 		counter->elow = READ_ONCE(counter->low);
+		if (toptier)
+			WRITE_ONCE(counter->etoptier_low,
+				   page_counter_toptier_low(counter));
 		return;
 	}
 
@@ -462,6 +492,9 @@ void page_counter_calculate_protection(struct page_counter *root,
 			READ_ONCE(parent->elow),
 			atomic_long_read(&parent->children_low_usage),
 			recursive_protection));
+
+	if (toptier)
+		calculate_protection_toptier(counter, recursive_protection);
 }
 
 void page_counter_update_toptier_capacity(struct page_counter *counter,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a87ac7be43c..5b4cb030a477 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4144,6 +4144,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	struct mem_cgroup *memcg;
 	unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl);
 	bool reclaimable = !min_ttl;
+	bool toptier = node_is_toptier(pgdat->node_id);
 
 	VM_WARN_ON_ONCE(!current_is_kswapd());
 
@@ -4153,7 +4154,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	do {
 		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-		mem_cgroup_calculate_protection(NULL, memcg);
+		mem_cgroup_calculate_protection(NULL, memcg, toptier);
 
 		if (!reclaimable)
 			reclaimable = lruvec_is_reclaimable(lruvec, sc, min_ttl);
@@ -4905,12 +4906,14 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	bool toptier = tier_aware_memcg_limits &&
+		       node_is_toptier(pgdat->node_id);
 
 	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
 	if (mem_cgroup_below_min(NULL, memcg))
 		return MEMCG_LRU_YOUNG;
 
-	if (mem_cgroup_below_low(NULL, memcg)) {
+	if (mem_cgroup_below_low(NULL, memcg, toptier)) {
 		/* see the comment on MEMCG_NR_GENS */
 		if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
 			return MEMCG_LRU_TAIL;
@@ -5960,6 +5963,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 	};
 	struct mem_cgroup_reclaim_cookie *partial = &reclaim;
 	struct mem_cgroup *memcg;
+	bool toptier = node_is_toptier(pgdat->node_id);
 
 	/*
 	 * In most cases, direct reclaimers can do partial walks
@@ -5987,7 +5991,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		 */
 		cond_resched();
 
-		mem_cgroup_calculate_protection(target_memcg, memcg);
+		mem_cgroup_calculate_protection(target_memcg, memcg, toptier);
 
 		if (mem_cgroup_below_min(target_memcg, memcg)) {
 			/*
@@ -5995,7 +5999,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * If there is no reclaimable memory, OOM.
 			 */
 			continue;
-		} else if (mem_cgroup_below_low(target_memcg, memcg)) {
+		} else if (mem_cgroup_below_low(target_memcg, memcg,
+					tier_aware_memcg_limits && toptier)) {
 			/*
 			 * Soft protection.
 			 * Respect the protection only as long as
-- 
2.47.3



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware
  2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
                   ` (4 preceding siblings ...)
  2026-02-23 22:38 ` [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Joshua Hahn
@ 2026-02-23 22:38 ` Joshua Hahn
  2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
  6 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-02-23 22:38 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Qi Zheng, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	linux-mm, cgroups, linux-kernel, kernel-team

On machines serving multiple workloads whose memory is isolated via the
memory cgroup controller, it is currently impossible to enforce a fair
distribution of toptier memory among the workloads, as the only
enforcable limits have to do with total memory footprint, but not where
that memory resides.

This makes ensuring a consistent and baseline performance difficult, as
each workload's performance is heavily impacted by workload-external
factors wuch as which other workloads are co-located in the same host,
and the order at which different workloads are started.

Extend the existing memory.high protection to be tier-aware in the
charging and enforcement to limit toptier-hogging for workloads.

Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
which can be used to selectively reclaim from memory at the
memcg-tier interection of a cgroup.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/swap.h |  3 +-
 mm/memcontrol-v1.c   |  6 ++--
 mm/memcontrol.c      | 85 +++++++++++++++++++++++++++++++++++++-------
 mm/vmscan.c          | 11 +++---
 4 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0effe3cc50f5..c6037ac7bf6e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
 						  unsigned int reclaim_options,
-						  int *swappiness);
+						  int *swappiness,
+						  nodemask_t *allowed);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 0b39ba608109..29630c7f3567 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
+				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+				NULL, NULL)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP, NULL))
+						  MEMCG_RECLAIM_MAY_SWAP,
+						  NULL, NULL))
 			nr_retries--;
 	}
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8aa7ae361a73..ebd4a1b73c51 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 
 	do {
 		unsigned long pflags;
-
-		if (page_counter_read(&memcg->memory) <=
-		    READ_ONCE(memcg->memory.high))
+		nodemask_t toptier_nodes, *reclaim_nodes;
+		bool mem_high_ok, toptier_high_ok;
+
+		mt_get_toptier_nodemask(&toptier_nodes, NULL);
+		mem_high_ok = page_counter_read(&memcg->memory) <=
+			      READ_ONCE(memcg->memory.high);
+		toptier_high_ok = !(tier_aware_memcg_limits &&
+				    mem_cgroup_toptier_usage(memcg) >
+				    page_counter_toptier_high(&memcg->memory));
+		if (mem_high_ok && toptier_high_ok)
 			continue;
 
+		if (mem_high_ok && !toptier_high_ok)
+			reclaim_nodes = &toptier_nodes;
+		else
+			reclaim_nodes = NULL;
+
 		memcg_memory_event(memcg, MEMCG_HIGH);
 
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
 							MEMCG_RECLAIM_MAY_SWAP,
-							NULL);
+							NULL, reclaim_nodes);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2296,6 +2308,24 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
 	return max_overage;
 }
 
+static u64 toptier_find_max_overage(struct mem_cgroup *memcg)
+{
+	u64 overage, max_overage = 0;
+
+	if (!tier_aware_memcg_limits)
+		return 0;
+
+	do {
+		unsigned long usage = mem_cgroup_toptier_usage(memcg);
+		unsigned long high = page_counter_toptier_high(&memcg->memory);
+
+		overage = calculate_overage(usage, high);
+		max_overage = max(overage, max_overage);
+	} while ((memcg = parent_mem_cgroup(memcg)) &&
+		  !mem_cgroup_is_root(memcg));
+
+	return max_overage;
+}
 static u64 swap_find_max_overage(struct mem_cgroup *memcg)
 {
 	u64 overage, max_overage = 0;
@@ -2401,6 +2431,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
 						swap_find_max_overage(memcg));
 
+	/*
+	 * Don't double-penalize for toptier high overage if system-wide
+	 * memory.high has already been breached.
+	 */
+	if (!penalty_jiffies)
+		penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+					toptier_find_max_overage(memcg));
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -2503,7 +2541,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options, NULL);
+						    gfp_mask, reclaim_options,
+						    NULL, NULL);
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -2592,23 +2631,26 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * reclaim, the cost of mismatch is negligible.
 	 */
 	do {
-		bool mem_high, swap_high;
+		bool mem_high, swap_high, toptier_high = false;
 
 		mem_high = page_counter_read(&memcg->memory) >
 			READ_ONCE(memcg->memory.high);
 		swap_high = page_counter_read(&memcg->swap) >
 			READ_ONCE(memcg->swap.high);
+		toptier_high = tier_aware_memcg_limits &&
+			       (mem_cgroup_toptier_usage(memcg) >
+				page_counter_toptier_high(&memcg->memory));
 
 		/* Don't bother a random interrupted task */
 		if (!in_task()) {
-			if (mem_high) {
+			if (mem_high || toptier_high) {
 				schedule_work(&memcg->high_work);
 				break;
 			}
 			continue;
 		}
 
-		if (mem_high || swap_high) {
+		if (mem_high || swap_high || toptier_high) {
 			/*
 			 * The allocating tasks in this cgroup will need to do
 			 * reclaim or be throttled to prevent further growth
@@ -4476,7 +4518,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	bool drained = false;
-	unsigned long high;
+	unsigned long high, toptier_high;
 	int err;
 
 	buf = strstrip(buf);
@@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		return err;
 
 	page_counter_set_high(&memcg->memory, high);
+	toptier_high = page_counter_toptier_high(&memcg->memory);
 
 	if (of->file->f_flags & O_NONBLOCK)
 		goto out;
 
 	for (;;) {
 		unsigned long nr_pages = page_counter_read(&memcg->memory);
+		unsigned long toptier_pages = mem_cgroup_toptier_usage(memcg);
 		unsigned long reclaimed;
+		unsigned long to_free;
+		nodemask_t toptier_nodes, *reclaim_nodes;
+		bool mem_high_ok = nr_pages <= high;
+		bool toptier_high_ok = !(tier_aware_memcg_limits &&
+					 toptier_pages > toptier_high);
 
-		if (nr_pages <= high)
+		if (mem_high_ok && toptier_high_ok)
 			break;
 
 		if (signal_pending(current))
@@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 			continue;
 		}
 
-		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
+		mt_get_toptier_nodemask(&toptier_nodes, NULL);
+		if (mem_high_ok && !toptier_high_ok) {
+			reclaim_nodes = &toptier_nodes;
+			to_free = toptier_pages - toptier_high;
+		} else {
+			reclaim_nodes = NULL;
+			to_free = nr_pages - high;
+		}
+		reclaimed = try_to_free_mem_cgroup_pages(memcg, to_free,
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL, reclaim_nodes);
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -4558,7 +4616,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+					NULL, NULL))
 				nr_reclaims--;
 			continue;
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5b4cb030a477..94498734b4f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6652,7 +6652,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
 					   unsigned int reclaim_options,
-					   int *swappiness)
+					   int *swappiness, nodemask_t *allowed)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
@@ -6668,6 +6668,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
 		.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+		.nodemask = allowed,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
@@ -6693,7 +6694,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
 					   unsigned int reclaim_options,
-					   int *swappiness)
+					   int *swappiness, nodemask_t *allowed)
 {
 	return 0;
 }
@@ -7806,9 +7807,9 @@ int user_proactive_reclaim(char *buf,
 			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
 					  MEMCG_RECLAIM_PROACTIVE;
 			reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						 batch_size, gfp_mask,
-						 reclaim_options,
-						 swappiness == -1 ? NULL : &swappiness);
+					batch_size, gfp_mask, reclaim_options,
+					swappiness == -1 ? NULL : &swappiness,
+					NULL);
 		} else {
 			struct scan_control sc = {
 				.gfp_mask = current_gfp_context(gfp_mask),
-- 
2.47.3



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
  2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
                   ` (5 preceding siblings ...)
  2026-02-23 22:38 ` [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
@ 2026-02-24 11:27 ` Michal Hocko
  2026-02-24 16:13   ` Joshua Hahn
  6 siblings, 1 reply; 11+ messages in thread
From: Michal Hocko @ 2026-02-24 11:27 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Gregory Price, Johannes Weiner, Kaiyang Zhao, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Roman Gushchin, Shakeel Butt, Muchun Song, Waiman Long,
	Chen Ridong, Tejun Heo, Michal Koutny, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Qi Zheng, linux-mm, cgroups, linux-kernel,
	kernel-team

On Mon 23-02-26 14:38:23, Joshua Hahn wrote:
> Memory cgroups provide an interface that allow multiple workloads on a
> host to co-exist, and establish both weak and strong memory isolation
> guarantees. For large servers and small embedded systems alike, memcgs
> provide an effective way to provide a baseline quality of service for
> protected workloads.
> 
> This works, because for the most part, all memory is equal (except for
> zram / zswap). Restricting a cgroup's memory footprint restricts how
> much it can hurt other workloads competing for memory. Likewise, setting
> memory.low or memory.min limits can provide weak and strong guarantees
> to the performance of a cgroup.
> 
> However, on systems with tiered memory (e.g. CXL / compressed memory),
> the quality of service guarantees that memcg limits enforced become less
> effective, as memcg has no awareness of the physical location of its
> charged memory. In other words, a workload that is well-behaved within
> its memcg limits may still be hurting the performance of other
> well-behaving workloads on the system by hogging more than its
> "fair share" of toptier memory.

This assumes that the active workingset size of all workloads doesn't
fit into the top tier right? Otherwise promotions would make sure to
that we have the most active memory in the top tier. Is this typical in
real life configurations?

Or do you intend to limit memory consumption on particular tier even
without an external pressure?

> Introduce tier-aware memcg limits, which scale memory.low/high to
> reflect the ratio of toptier:total memory the cgroup has access.
> 
> Take the following scenario as an example:
> On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
> setting a cgroup's limits to:
> 	memory.min:  15G
> 	memory.low:  20G
> 	memory.high: 40G
> 	memory.max:  50G
> 
> Will be enforced at the toptier as:
> 	memory.min:          15G
> 	memory.toptier_low:  15G (20 * 150/200)
> 	memory.toptier_high: 30G (40 * 150/200)
> 	memory.max:          50G

Let's spend some more time with the interface first. You seem to be
focusing only on the top tier with this interface, right? Is this really the
right way to go long term? What makes you believe that we do not really
hit the same issue with other tiers as well? Also do we want/need to
duplicate all the limits for each/top tier? What is the reasoning for
the switch to be runtime sysctl rather than boot-time or cgroup mount
option?

I will likely have more questions but these are immediate ones after
reading the cover. Please note I haven't really looked at the
implementation yet. I really want to understand usecases and interface
first.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
  2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
@ 2026-02-24 16:13   ` Joshua Hahn
  2026-02-24 18:49     ` Gregory Price
  0 siblings, 1 reply; 11+ messages in thread
From: Joshua Hahn @ 2026-02-24 16:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gregory Price, Johannes Weiner, Kaiyang Zhao, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Roman Gushchin, Shakeel Butt, Muchun Song, Waiman Long,
	Chen Ridong, Tejun Heo, Michal Koutny, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Qi Zheng, linux-mm, cgroups, linux-kernel,
	kernel-team

Hello Michal,

I hope that you are doing well! Thank you for taking the time to review my
work and leaving your thoughts.

I wanted to note that I hope to bring this discussion to LSFMMBPF as well,
to discuss what the scope of the project should be, what usecases there
are (as I will note below), how to make this scalable and sustainable
for the future, etc. I'll send out a topic proposal later today. I had
separated the series from the proposal because I imagined that this
series would go through many versions, so it would be helpful to have
the topic as a unified place for pre-conference discussions.

> > Memory cgroups provide an interface that allow multiple workloads on a
> > host to co-exist, and establish both weak and strong memory isolation
> > guarantees. For large servers and small embedded systems alike, memcgs
> > provide an effective way to provide a baseline quality of service for
> > protected workloads.
> > 
> > This works, because for the most part, all memory is equal (except for
> > zram / zswap). Restricting a cgroup's memory footprint restricts how
> > much it can hurt other workloads competing for memory. Likewise, setting
> > memory.low or memory.min limits can provide weak and strong guarantees
> > to the performance of a cgroup.
> > 
> > However, on systems with tiered memory (e.g. CXL / compressed memory),
> > the quality of service guarantees that memcg limits enforced become less
> > effective, as memcg has no awareness of the physical location of its
> > charged memory. In other words, a workload that is well-behaved within
> > its memcg limits may still be hurting the performance of other
> > well-behaving workloads on the system by hogging more than its
> > "fair share" of toptier memory.

I will split up your questions to answer them individually:

> This assumes that the active workingset size of all workloads doesn't
> fit into the top tier right?

Yes, for the scenario above, a workload that is violating its fair share
of toptier memory mostly hurts other workloads if the aggregate working
set size of all workloads exceeds the size of toptier memory.

> Otherwise promotions would make sure to that we have the most active
> memory in the top tier.

This is true. And for a lot of usecases, this is 100% the right thing to do.
However, with this patch I want to encourage a different perspective,
which is to think about things in a per-workload perspective, and not a
per-system perspective.

Having hot memory in high tiers and cold memory in low tiers is only
logical, since we increase the system's throughput and make the most
optimal choices for latency. However, what about systems that care about
objectives other than simply maximizing throughput?

In the original cover letter I offered an example of VM hosting services
that care less about maximizing host-wide throughput, but more on ensuring
a bottomline performance guarantee for all workloads running on the system.
For the users on these services, they don't care that the host their VM is
running on is maximizing throughput; rather, they care that their VM meets
the performance guarantees that their provider promised. If there is no
way to know or enforce which tier of memory their workload lands on, either
the bottomline guarantee becomes very underestimated, or users must deal
with a high variance in performance.

Here's another example: Let's say there is a host with multiple workloads,
each serving queries for a database. The host would like to guarantee the
lowest maximum latency possible, while maximizing the total throughput
of the system. Once again in this situation, without tier-aware memcg
limits the host can maximize throughput, but can only make severely
underestimated promises on the bottom line.

> Is this typical in real life configurations?

I would say so. I think that the two examples above are realistic
scenarios that cloud providers and hyperscalers might face on tiered systems.

> Or do you intend to limit memory consumption on particular tier even
> without an external pressure?

This is a great question, and one that I hope to discuss at LSFMMBPF
to see how people expect an interface like this to work.

Over the past few weeks, I have been discussing this idea during the
Linux Memory Hotness and Promotion biweekly calls with Gregory Price [1].
One of the proposals that we made there (but did not include in this
series) is the idea of "fixed" vs. "opportunistic" reclaim.

Fixed mode is what we have here -- start limiting toptier usage whenever
a workload goes above its fair slice of toptier.
Opportunistic mode would allow workloads to use more toptier memory than
its fair share, but only be restricted when toptier is pressured.

What do you think about these two options? For the stated goal of this
series, which is to help maximize the bottom line for workloads, fair
share seemed to make sense. Implementing opportunistic mode changes
on top of this work would most likely just be another sysctl.

> > Introduce tier-aware memcg limits, which scale memory.low/high to
> > reflect the ratio of toptier:total memory the cgroup has access.
> > 
> > Take the following scenario as an example:
> > On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
> > setting a cgroup's limits to:
> > 	memory.min:  15G
> > 	memory.low:  20G
> > 	memory.high: 40G
> > 	memory.max:  50G
> > 
> > Will be enforced at the toptier as:
> > 	memory.min:          15G
> > 	memory.toptier_low:  15G (20 * 150/200)
> > 	memory.toptier_high: 30G (40 * 150/200)
> > 	memory.max:          50G

I will split up the following points to answer them individually as well:

> Let's spend some more time with the interface first.

That sounds good with me, my goal was to bring this out as an RFC patchset
so folks could look at the code and understand the motivation, and then send
out the LSFMMBPF topic proposal. In retrospect I think I should have done
it in the opposite order. I'm sorry if this caused any confusion.

> You seem to be focusing only on the top tier with this interface, right?
> Is this really the right way to go long term? What makes you believe that
> we do not really hit the same issue with other tiers as well?

Yes, that's right. I'm not sure if this is the right way to go long-term
(say, past the next 5 years). My thinking was that I can stick with doing
this for toptier vs. non-toptier memory for now, and deal with having
3+ tiers in the future, when we start to have systems with that many tiers.
AFAICT two-tiered systems are still ~relatively new, and I don't think
there are a lot of genuine usecases for enforcing mid-tier memory limits
as of now. Of course, I would be excited to learn about these usecases
and work this patchset to support them as well if anybody has them.

> Also do we want/need to duplicate all the limits for each/top tier?

Sorry, I'm not sure that I completely understood this question. Are you
referring to the case where we have multiple nodes in the toptier?
If so, then all of those nodes are treated the same, and don't have
unique limits. Or are you referring to the case where we have multiple
tiers in the toptier? If so, I hope the answer above can answer this too.

> What is the reasoning for the switch to be runtime sysctl rather than
> boot-time or cgroup mount option?

Good point : -) I don't think cgroup mount options are a good idea,
since this would mean that we can have a set of cgroups self-policing
their toptier usage, while another cgroup allocates memory unrestricted.
This would punish the self-policing cgroup and we would lose the benefit
of having a bottomline performance guarantee.

> I will likely have more questions but these are immediate ones after
> reading the cover. Please note I haven't really looked at the
> implementation yet. I really want to understand usecases and interface
> first.

That sounds good to me, thank you again for reviewing this work!
I hope you have a great day : -)
Joshua

[1] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
  2026-02-24 16:13   ` Joshua Hahn
@ 2026-02-24 18:49     ` Gregory Price
  2026-02-24 20:03       ` Kaiyang Zhao
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Price @ 2026-02-24 18:49 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Michal Hocko, Johannes Weiner, Kaiyang Zhao, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Roman Gushchin, Shakeel Butt, Muchun Song, Waiman Long,
	Chen Ridong, Tejun Heo, Michal Koutny, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Qi Zheng, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Feb 24, 2026 at 08:13:56AM -0800, Joshua Hahn wrote:
> ... snip ...

Just injecting a few points here
(disclosure: I have been in the development loop for this feature)

> 
> > Otherwise promotions would make sure to that we have the most active
> > memory in the top tier.
> 

Yes / No.  This makes the assumption that you always want this.

Barring a minimum Quality of Service mechanism (as Joshua explains)
this reduces the usefulness of a secondary tier of memory.

Services will just prefer not to be deployed to these kinds of
machines because the performance variance is too high.

> 
> > Is this typical in real life configurations?
> 
> I would say so. I think that the two examples above are realistic
> scenarios that cloud providers and hyperscalers might face on tiered systems.
> 

The answer is unequivocally yes.

Lacking tier-awareness is actually a huge blocker for deploying mixed
workloads on large, dense memory systems with multiple tiers (2+).

Technically we're already at 4-ish tiers: DDR, CXL, ZSWAP, SWAP.

We have zswap/swap controls in cgroups already, we just lack that same
control for coherent memory tiers.  This tries to use the existing nobs
(max/high/low/min) to do what they already do - just proportionally.

~Gregory

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
  2026-02-24 18:49     ` Gregory Price
@ 2026-02-24 20:03       ` Kaiyang Zhao
  0 siblings, 0 replies; 11+ messages in thread
From: Kaiyang Zhao @ 2026-02-24 20:03 UTC (permalink / raw)
  To: Gregory Price
  Cc: Joshua Hahn, Michal Hocko, Johannes Weiner, Andrew Morton,
	David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Roman Gushchin, Shakeel Butt, Muchun Song, Waiman Long,
	Chen Ridong, Tejun Heo, Michal Koutny, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Qi Zheng, linux-mm, cgroups, linux-kernel,
	kernel-team

On Tue, Feb 24, 2026 at 01:49:21PM -0500, Gregory Price wrote:
> 
> > 
> > > Is this typical in real life configurations?
> > 
> > I would say so. I think that the two examples above are realistic
> > scenarios that cloud providers and hyperscalers might face on tiered systems.
> > 
> 
> The answer is unequivocally yes.
> 
> Lacking tier-awareness is actually a huge blocker for deploying mixed
> workloads on large, dense memory systems with multiple tiers (2+).

Hello! I'm the author of the RFC in 2024. Just want to add that we've
recently released a preprint paper on arXiv that includes case studies
with a few of Meta's production workloads using a prototype version of
the patches.

The results confirmed that co-colocated workloads can have working set
sizes exceeding the limited top-tier memory capacity given today's
server memory shapes and workload stacking settings, causing contention
of top-tier memory. Workloads see significant variations in tail
latency and throughput depending on the share of top-tier tier memory
they get, which this patch set will alleviate.

Best,
Kaiyang

[1] https://arxiv.org/pdf/2602.08800

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-02-24 20:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
2026-02-24 16:13   ` Joshua Hahn
2026-02-24 18:49     ` Gregory Price
2026-02-24 20:03       ` Kaiyang Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox