[RFC PATCH -next 0/2] mm/mglru: remove memcg lru

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH -next 0/2]  mm/mglru: remove memcg lru
@ 2025-12-04 12:31 Chen Ridong
  2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
  2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
  0 siblings, 2 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-04 12:31 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
	roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch
  Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

The memcg LRU was introduced for global reclaim to improve scalability,
but its implementation has grown complex. Moreover, it can cause
performance regressions when creating a large number of memory cgroups [1].

This series implements mem_cgroup_iter with a reclaim cookie in
shrink_many() for global reclaim, following the pattern already established
in shrink_node_memcgs(), an approach suggested by Johannes [1]. The new
approach provides good fairness across cgroups by preserving iteration
state between reclaim passes.

Testing was performed using the original stress test from Zhao Yu [2] on a
1 TB, 4-node NUMA system. The results show:

                                            before         after
    stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
    sum(pgsteal) / sum(requested)             216.4%        230.5%

The new implementation reduces the standard deviation relative to the mean
by 15.5 percentage points, indicating improved fairness in memory reclaim
distribution. The total pages reclaimed increased from 85,086,871 to
90,633,890 (6.5% increase), resulting in a higher ratio of actual to
requested reclaim.

To simplify review:
- Patch 1 uses mem_cgroup_iter with reclaim cookie in shrink_many()
- Patch 2 removes the now-unused memcg LRU code

[1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
[2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com

Chen Ridong (2):
  mm/mglru: use mem_cgroup_iter for global reclaim
  mm/mglru: remove memcg lru

 Documentation/mm/multigen_lru.rst |  30 ----
 include/linux/mmzone.h            |  89 ----------
 mm/memcontrol-v1.c                |   6 -
 mm/memcontrol.c                   |   4 -
 mm/mm_init.c                      |   1 -
 mm/vmscan.c                       | 270 ++++--------------------------
 6 files changed, 37 insertions(+), 363 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
  2025-12-04 12:31 [RFC PATCH -next 0/2] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-04 12:31 ` Chen Ridong
  2025-12-04 18:28   ` Johannes Weiner
  2025-12-04 22:29   ` Shakeel Butt
  2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
  1 sibling, 2 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-04 12:31 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
	roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch
  Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

The memcg LRU was originally introduced for global reclaim to enhance
scalability. However, its implementation complexity has led to performance
regressions when dealing with a large number of memory cgroups [1].

As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
cookie-based iteration for global reclaim, aligning with the approach
already used in shrink_node_memcgs. This simplification removes the
dedicated memcg LRU tracking while maintaining the core functionality.

It performed a stress test based on Zhao Yu's methodology [2] on a
1 TB, 4-node NUMA system. The results are summarized below:

					memcg LRU    memcg iter
stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
sum(pgsteal) / sum(requested)             216.4%        230.5%

The new implementation demonstrates a significant improvement in
fairness, reducing the standard deviation relative to the mean by
15.5 percentage points. While the reclaim accuracy shows a slight
increase in overscan (from 85086871 to 90633890, 6.5%).

The primary benefits of this change are:
1. Simplified codebase by removing custom memcg LRU infrastructure
2. Improved fairness in memory reclaim across multiple cgroups
3. Better performance when creating many memory cgroups

[1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
[2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 mm/vmscan.c | 117 ++++++++++++++++------------------------------------
 1 file changed, 36 insertions(+), 81 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index fddd168a9737..70b0e7e5393c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	return nr_to_scan < 0;
 }
 
-static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
-	if (mem_cgroup_below_min(NULL, memcg))
-		return MEMCG_LRU_YOUNG;
-
-	if (mem_cgroup_below_low(NULL, memcg)) {
-		/* see the comment on MEMCG_NR_GENS */
-		if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
-			return MEMCG_LRU_TAIL;
-
-		memcg_memory_event(memcg, MEMCG_LOW);
-	}
-
-	success = try_to_shrink_lruvec(lruvec, sc);
+	try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 			   sc->nr_reclaimed - reclaimed);
 
 	flush_reclaim_state(sc);
-
-	if (success && mem_cgroup_online(memcg))
-		return MEMCG_LRU_YOUNG;
-
-	if (!success && lruvec_is_sizable(lruvec, sc))
-		return 0;
-
-	/* one retry if offlined or too small */
-	return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
-	       MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
 }
 
 static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
 {
-	int op;
-	int gen;
-	int bin;
-	int first_bin;
-	struct lruvec *lruvec;
-	struct lru_gen_folio *lrugen;
+	struct mem_cgroup *target = sc->target_mem_cgroup;
+	struct mem_cgroup_reclaim_cookie reclaim = {
+		.pgdat = pgdat,
+	};
+	struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
 	struct mem_cgroup *memcg;
-	struct hlist_nulls_node *pos;
 
-	gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
-	bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
-restart:
-	op = 0;
-	memcg = NULL;
-
-	rcu_read_lock();
+	if (current_is_kswapd() || sc->memcg_full_walk)
+		cookie = NULL;
 
-	hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
-		if (op) {
-			lru_gen_rotate_memcg(lruvec, op);
-			op = 0;
-		}
+	memcg = mem_cgroup_iter(target, NULL, cookie);
+	while (memcg) {
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
-		mem_cgroup_put(memcg);
-		memcg = NULL;
+		cond_resched();
 
-		if (gen != READ_ONCE(lrugen->gen))
-			continue;
+		mem_cgroup_calculate_protection(target, memcg);
 
-		lruvec = container_of(lrugen, struct lruvec, lrugen);
-		memcg = lruvec_memcg(lruvec);
+		if (mem_cgroup_below_min(target, memcg))
+			goto next;
 
-		if (!mem_cgroup_tryget(memcg)) {
-			lru_gen_release_memcg(memcg);
-			memcg = NULL;
-			continue;
+		if (mem_cgroup_below_low(target, memcg)) {
+			if (!sc->memcg_low_reclaim) {
+				sc->memcg_low_skipped = 1;
+				goto next;
+			}
+			memcg_memory_event(memcg, MEMCG_LOW);
 		}
 
-		rcu_read_unlock();
+		shrink_one(lruvec, sc);
 
-		op = shrink_one(lruvec, sc);
-
-		rcu_read_lock();
-
-		if (should_abort_scan(lruvec, sc))
+		if (should_abort_scan(lruvec, sc)) {
+			if (cookie)
+				mem_cgroup_iter_break(target, memcg);
 			break;
-	}
-
-	rcu_read_unlock();
-
-	if (op)
-		lru_gen_rotate_memcg(lruvec, op);
-
-	mem_cgroup_put(memcg);
-
-	if (!is_a_nulls(pos))
-		return;
+		}
 
-	/* restart if raced with lru_gen_rotate_memcg() */
-	if (gen != get_nulls_value(pos))
-		goto restart;
+next:
+		if (cookie && sc->nr_reclaimed >= sc->nr_to_reclaim) {
+			mem_cgroup_iter_break(target, memcg);
+			break;
+		}
 
-	/* try the rest of the bins of the current generation */
-	bin = get_memcg_bin(bin + 1);
-	if (bin != first_bin)
-		goto restart;
+		memcg = mem_cgroup_iter(target, memcg, cookie);
+	}
 }
 
 static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -5019,8 +4975,7 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
 
 	set_mm_walk(NULL, sc->proactive);
 
-	if (try_to_shrink_lruvec(lruvec, sc))
-		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
+	try_to_shrink_lruvec(lruvec, sc);
 
 	clear_mm_walk();
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH -next 2/2] mm/mglru: remove memcg lru
  2025-12-04 12:31 [RFC PATCH -next 0/2] mm/mglru: remove memcg lru Chen Ridong
  2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
@ 2025-12-04 12:31 ` Chen Ridong
  2025-12-04 18:34   ` Johannes Weiner
  1 sibling, 1 reply; 10+ messages in thread
From: Chen Ridong @ 2025-12-04 12:31 UTC (permalink / raw)
  To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
	roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch
  Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

Now that the previous patch has switched global reclaim to use
mem_cgroup_iter, the specialized memcg LRU infrastructure is no longer
needed. This patch removes all related code:

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 Documentation/mm/multigen_lru.rst |  30 ------
 include/linux/mmzone.h            |  89 -----------------
 mm/memcontrol-v1.c                |   6 --
 mm/memcontrol.c                   |   4 -
 mm/mm_init.c                      |   1 -
 mm/vmscan.c                       | 153 +-----------------------------
 6 files changed, 1 insertion(+), 282 deletions(-)

diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst
index 52ed5092022f..bf8547e2f592 100644
--- a/Documentation/mm/multigen_lru.rst
+++ b/Documentation/mm/multigen_lru.rst
@@ -220,36 +220,6 @@ time domain because a CPU can scan pages at different rates under
 varying memory pressure. It calculates a moving average for each new
 generation to avoid being permanently locked in a suboptimal state.
 
-Memcg LRU
----------
-An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
-since each node and memcg combination has an LRU of folios (see
-``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
-global reclaim, which is critical to system-wide memory overcommit in
-data centers. Note that memcg LRU only applies to global reclaim.
-
-The basic structure of an memcg LRU can be understood by an analogy to
-the active/inactive LRU (of folios):
-
-1. It has the young and the old (generations), i.e., the counterparts
-   to the active and the inactive;
-2. The increment of ``max_seq`` triggers promotion, i.e., the
-   counterpart to activation;
-3. Other events trigger similar operations, e.g., offlining an memcg
-   triggers demotion, i.e., the counterpart to deactivation.
-
-In terms of global reclaim, it has two distinct features:
-
-1. Sharding, which allows each thread to start at a random memcg (in
-   the old generation) and improves parallelism;
-2. Eventual fairness, which allows direct reclaim to bail out at will
-   and reduces latency without affecting fairness over some time.
-
-In terms of traversing memcgs during global reclaim, it improves the
-best-case complexity from O(n) to O(1) and does not affect the
-worst-case complexity O(n). Therefore, on average, it has a sublinear
-complexity.
-
 Summary
 -------
 The multi-gen LRU (of folios) can be disassembled into the following
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75ef7c9f9307..49952301ff3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,12 +509,6 @@ struct lru_gen_folio {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multi-gen LRU is enabled */
 	bool enabled;
-	/* the memcg generation this lru_gen_folio belongs to */
-	u8 gen;
-	/* the list segment this lru_gen_folio belongs to */
-	u8 seg;
-	/* per-node lru_gen_folio list for global reclaim */
-	struct hlist_nulls_node list;
 };
 
 enum {
@@ -558,79 +552,14 @@ struct lru_gen_mm_walk {
 	bool force_scan;
 };
 
-/*
- * For each node, memcgs are divided into two generations: the old and the
- * young. For each generation, memcgs are randomly sharded into multiple bins
- * to improve scalability. For each bin, the hlist_nulls is virtually divided
- * into three segments: the head, the tail and the default.
- *
- * An onlining memcg is added to the tail of a random bin in the old generation.
- * The eviction starts at the head of a random bin in the old generation. The
- * per-node memcg generation counter, whose reminder (mod MEMCG_NR_GENS) indexes
- * the old generation, is incremented when all its bins become empty.
- *
- * There are four operations:
- * 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
- *    current generation (old or young) and updates its "seg" to "head";
- * 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
- *    current generation (old or young) and updates its "seg" to "tail";
- * 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
- *    generation, updates its "gen" to "old" and resets its "seg" to "default";
- * 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
- *    young generation, updates its "gen" to "young" and resets its "seg" to
- *    "default".
- *
- * The events that trigger the above operations are:
- * 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
- * 2. The first attempt to reclaim a memcg below low, which triggers
- *    MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim a memcg offlined or below reclaimable size
- *    threshold, which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim a memcg offlined or below reclaimable size
- *    threshold, which triggers MEMCG_LRU_YOUNG;
- * 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
- * 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
- * 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
- *
- * Notes:
- * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
- *    of their max_seq counters ensures the eventual fairness to all eligible
- *    memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
- * 2. There are only two valid generations: old (seq) and young (seq+1).
- *    MEMCG_NR_GENS is set to three so that when reading the generation counter
- *    locklessly, a stale value (seq-1) does not wraparound to young.
- */
-#define MEMCG_NR_GENS	3
-#define MEMCG_NR_BINS	8
-
-struct lru_gen_memcg {
-	/* the per-node memcg generation counter */
-	unsigned long seq;
-	/* each memcg has one lru_gen_folio per node */
-	unsigned long nr_memcgs[MEMCG_NR_GENS];
-	/* per-node lru_gen_folio list for global reclaim */
-	struct hlist_nulls_head	fifo[MEMCG_NR_GENS][MEMCG_NR_BINS];
-	/* protects the above */
-	spinlock_t lock;
-};
-
-void lru_gen_init_pgdat(struct pglist_data *pgdat);
 void lru_gen_init_lruvec(struct lruvec *lruvec);
 bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
 void lru_gen_exit_memcg(struct mem_cgroup *memcg);
-void lru_gen_online_memcg(struct mem_cgroup *memcg);
-void lru_gen_offline_memcg(struct mem_cgroup *memcg);
-void lru_gen_release_memcg(struct mem_cgroup *memcg);
-void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
 
 #else /* !CONFIG_LRU_GEN */
 
-static inline void lru_gen_init_pgdat(struct pglist_data *pgdat)
-{
-}
-
 static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 }
@@ -648,22 +577,6 @@ static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 {
 }
 
-static inline void lru_gen_online_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_offline_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_release_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
-{
-}
-
 #endif /* CONFIG_LRU_GEN */
 
 struct lruvec {
@@ -1503,8 +1416,6 @@ typedef struct pglist_data {
 #ifdef CONFIG_LRU_GEN
 	/* kswap mm walk data */
 	struct lru_gen_mm_walk mm_walk;
-	/* lru_gen_folio list */
-	struct lru_gen_memcg memcg_lru;
 #endif
 
 	CACHELINE_PADDING(_pad2_);
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..8f41e72ae7f0 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -182,12 +182,6 @@ static void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
 	struct mem_cgroup_per_node *mz;
 	struct mem_cgroup_tree_per_node *mctz;
 
-	if (lru_gen_enabled()) {
-		if (soft_limit_excess(memcg))
-			lru_gen_soft_reclaim(memcg, nid);
-		return;
-	}
-
 	mctz = soft_limit_tree.rb_tree_per_node[nid];
 	if (!mctz)
 		return;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index be810c1fbfc3..ab3ebecb5ec7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3874,8 +3874,6 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
 		queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
 				   FLUSH_TIME);
-	lru_gen_online_memcg(memcg);
-
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
@@ -3915,7 +3913,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	reparent_deferred_split_queue(memcg);
 	reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
-	lru_gen_offline_memcg(memcg);
 
 	drain_all_stock(memcg);
 
@@ -3927,7 +3924,6 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	invalidate_reclaim_iterators(memcg);
-	lru_gen_release_memcg(memcg);
 }
 
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fc2a6f1e518f..6e5e1fe6ff31 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1745,7 +1745,6 @@ static void __init free_area_init_node(int nid)
 	pgdat_set_deferred_range(pgdat);
 
 	free_area_init_core(pgdat);
-	lru_gen_init_pgdat(pgdat);
 }
 
 /* Any regular or high memory on that node ? */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70b0e7e5393c..584f41eb4c14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2698,9 +2698,6 @@ static bool should_clear_pmd_young(void)
 #define for_each_evictable_type(type, swappiness)			\
 	for ((type) = min_type(swappiness); (type) <= max_type(swappiness); (type)++)
 
-#define get_memcg_gen(seq)	((seq) % MEMCG_NR_GENS)
-#define get_memcg_bin(bin)	((bin) % MEMCG_NR_BINS)
-
 static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
@@ -4287,140 +4284,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	return true;
 }
 
-/******************************************************************************
- *                          memcg LRU
- ******************************************************************************/
-
-/* see the comment on MEMCG_NR_GENS */
-enum {
-	MEMCG_LRU_NOP,
-	MEMCG_LRU_HEAD,
-	MEMCG_LRU_TAIL,
-	MEMCG_LRU_OLD,
-	MEMCG_LRU_YOUNG,
-};
-
-static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
-{
-	int seg;
-	int old, new;
-	unsigned long flags;
-	int bin = get_random_u32_below(MEMCG_NR_BINS);
-	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-	spin_lock_irqsave(&pgdat->memcg_lru.lock, flags);
-
-	VM_WARN_ON_ONCE(hlist_nulls_unhashed(&lruvec->lrugen.list));
-
-	seg = 0;
-	new = old = lruvec->lrugen.gen;
-
-	/* see the comment on MEMCG_NR_GENS */
-	if (op == MEMCG_LRU_HEAD)
-		seg = MEMCG_LRU_HEAD;
-	else if (op == MEMCG_LRU_TAIL)
-		seg = MEMCG_LRU_TAIL;
-	else if (op == MEMCG_LRU_OLD)
-		new = get_memcg_gen(pgdat->memcg_lru.seq);
-	else if (op == MEMCG_LRU_YOUNG)
-		new = get_memcg_gen(pgdat->memcg_lru.seq + 1);
-	else
-		VM_WARN_ON_ONCE(true);
-
-	WRITE_ONCE(lruvec->lrugen.seg, seg);
-	WRITE_ONCE(lruvec->lrugen.gen, new);
-
-	hlist_nulls_del_rcu(&lruvec->lrugen.list);
-
-	if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
-		hlist_nulls_add_head_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[new][bin]);
-	else
-		hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[new][bin]);
-
-	pgdat->memcg_lru.nr_memcgs[old]--;
-	pgdat->memcg_lru.nr_memcgs[new]++;
-
-	if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
-		WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
-
-	spin_unlock_irqrestore(&pgdat->memcg_lru.lock, flags);
-}
-
-#ifdef CONFIG_MEMCG
-
-void lru_gen_online_memcg(struct mem_cgroup *memcg)
-{
-	int gen;
-	int nid;
-	int bin = get_random_u32_below(MEMCG_NR_BINS);
-
-	for_each_node(nid) {
-		struct pglist_data *pgdat = NODE_DATA(nid);
-		struct lruvec *lruvec = get_lruvec(memcg, nid);
-
-		spin_lock_irq(&pgdat->memcg_lru.lock);
-
-		VM_WARN_ON_ONCE(!hlist_nulls_unhashed(&lruvec->lrugen.list));
-
-		gen = get_memcg_gen(pgdat->memcg_lru.seq);
-
-		lruvec->lrugen.gen = gen;
-
-		hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
-		pgdat->memcg_lru.nr_memcgs[gen]++;
-
-		spin_unlock_irq(&pgdat->memcg_lru.lock);
-	}
-}
-
-void lru_gen_offline_memcg(struct mem_cgroup *memcg)
-{
-	int nid;
-
-	for_each_node(nid) {
-		struct lruvec *lruvec = get_lruvec(memcg, nid);
-
-		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_OLD);
-	}
-}
-
-void lru_gen_release_memcg(struct mem_cgroup *memcg)
-{
-	int gen;
-	int nid;
-
-	for_each_node(nid) {
-		struct pglist_data *pgdat = NODE_DATA(nid);
-		struct lruvec *lruvec = get_lruvec(memcg, nid);
-
-		spin_lock_irq(&pgdat->memcg_lru.lock);
-
-		if (hlist_nulls_unhashed(&lruvec->lrugen.list))
-			goto unlock;
-
-		gen = lruvec->lrugen.gen;
-
-		hlist_nulls_del_init_rcu(&lruvec->lrugen.list);
-		pgdat->memcg_lru.nr_memcgs[gen]--;
-
-		if (!pgdat->memcg_lru.nr_memcgs[gen] && gen == get_memcg_gen(pgdat->memcg_lru.seq))
-			WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
-unlock:
-		spin_unlock_irq(&pgdat->memcg_lru.lock);
-	}
-}
-
-void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
-{
-	struct lruvec *lruvec = get_lruvec(memcg, nid);
-
-	/* see the comment on MEMCG_NR_GENS */
-	if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_HEAD)
-		lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
-}
-
-#endif /* CONFIG_MEMCG */
-
 /******************************************************************************
  *                          the eviction
  ******************************************************************************/
@@ -5613,18 +5476,6 @@ static const struct file_operations lru_gen_ro_fops = {
  *                          initialization
  ******************************************************************************/
 
-void lru_gen_init_pgdat(struct pglist_data *pgdat)
-{
-	int i, j;
-
-	spin_lock_init(&pgdat->memcg_lru.lock);
-
-	for (i = 0; i < MEMCG_NR_GENS; i++) {
-		for (j = 0; j < MEMCG_NR_BINS; j++)
-			INIT_HLIST_NULLS_HEAD(&pgdat->memcg_lru.fifo[i][j], i);
-	}
-}
-
 void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 	int i;
@@ -5671,9 +5522,7 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
 		struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
 
 		VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
-					   sizeof(lruvec->lrugen.nr_pages)));
-
-		lruvec->lrugen.list.next = LIST_POISON1;
+				   sizeof(lruvec->lrugen.nr_pages)));
 
 		if (!mm_state)
 			continue;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
  2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
@ 2025-12-04 18:28   ` Johannes Weiner
  2025-12-04 22:29   ` Shakeel Butt
  1 sibling, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2025-12-04 18:28 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
	roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch,
	linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
	chenridong

On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
> 
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
> 
> It performed a stress test based on Zhao Yu's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
> 
> 					memcg LRU    memcg iter
> stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
> sum(pgsteal) / sum(requested)             216.4%        230.5%
> 
> The new implementation demonstrates a significant improvement in
> fairness, reducing the standard deviation relative to the mean by
> 15.5 percentage points. While the reclaim accuracy shows a slight
> increase in overscan (from 85086871 to 90633890, 6.5%).
> 
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
> 
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Signed-off-by: Chen Ridong <chenridong@huawei.com>

Acked-by: Johannes Weiner <hannes@cmxpchg.org>

The diff and the test results look good to me. Comparing the resulting
shrink_many() with shrink_node_memcgs(), this also looks like a great
step towards maintainability and unification.

Thanks!



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH -next 2/2] mm/mglru: remove memcg lru
  2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-04 18:34   ` Johannes Weiner
  2025-12-05  2:57     ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim zhongjinji
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2025-12-04 18:34 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
	roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch,
	linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
	chenridong

On Thu, Dec 04, 2025 at 12:31:24PM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> Now that the previous patch has switched global reclaim to use
> mem_cgroup_iter, the specialized memcg LRU infrastructure is no longer
> needed. This patch removes all related code:
> 
> Signed-off-by: Chen Ridong <chenridong@huawei.com>

Looks good to me!

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  Documentation/mm/multigen_lru.rst |  30 ------
>  include/linux/mmzone.h            |  89 -----------------
>  mm/memcontrol-v1.c                |   6 --
>  mm/memcontrol.c                   |   4 -
>  mm/mm_init.c                      |   1 -
>  mm/vmscan.c                       | 153 +-----------------------------
>  6 files changed, 1 insertion(+), 282 deletions(-)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
  2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
  2025-12-04 18:28   ` Johannes Weiner
@ 2025-12-04 22:29   ` Shakeel Butt
  2025-12-08  2:26     ` Chen Ridong
  2025-12-08  3:10     ` Chen Ridong
  1 sibling, 2 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-12-04 22:29 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
	roman.gushchin, muchun.song, yuzhao, zhengqi.arch, linux-mm,
	linux-doc, linux-kernel, cgroups, lujialin4, chenridong

Hi Chen,

On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
> 
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
> 
> It performed a stress test based on Zhao Yu's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
> 
> 					memcg LRU    memcg iter
> stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
> sum(pgsteal) / sum(requested)             216.4%        230.5%
> 
> The new implementation demonstrates a significant improvement in
> fairness, reducing the standard deviation relative to the mean by
> 15.5 percentage points. While the reclaim accuracy shows a slight
> increase in overscan (from 85086871 to 90633890, 6.5%).
> 
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
> 
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Signed-off-by: Chen Ridong <chenridong@huawei.com>

Thanks a lot of this awesome work.

> ---
>  mm/vmscan.c | 117 ++++++++++++++++------------------------------------
>  1 file changed, 36 insertions(+), 81 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fddd168a9737..70b0e7e5393c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  	return nr_to_scan < 0;
>  }
>  
> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -	bool success;
>  	unsigned long scanned = sc->nr_scanned;
>  	unsigned long reclaimed = sc->nr_reclaimed;
> -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  
> -	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
> -	if (mem_cgroup_below_min(NULL, memcg))
> -		return MEMCG_LRU_YOUNG;
> -
> -	if (mem_cgroup_below_low(NULL, memcg)) {
> -		/* see the comment on MEMCG_NR_GENS */
> -		if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
> -			return MEMCG_LRU_TAIL;
> -
> -		memcg_memory_event(memcg, MEMCG_LOW);
> -	}
> -
> -	success = try_to_shrink_lruvec(lruvec, sc);
> +	try_to_shrink_lruvec(lruvec, sc);
>  
>  	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>  
> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  			   sc->nr_reclaimed - reclaimed);
>  
>  	flush_reclaim_state(sc);

Unrealted to your patch but why this flush_reclaim_state() is at
different place from the non-MGLRU code path?

> -
> -	if (success && mem_cgroup_online(memcg))
> -		return MEMCG_LRU_YOUNG;
> -
> -	if (!success && lruvec_is_sizable(lruvec, sc))
> -		return 0;
> -
> -	/* one retry if offlined or too small */
> -	return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
> -	       MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
>  }
>  
>  static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)

This function kind of become very similar to shrink_node_memcgs()
function other than shrink_one vs shrink_lruvec. Can you try to combine
them and see if it looks not-ugly? Otherwise the code looks good to me.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
  2025-12-04 18:34   ` Johannes Weiner
@ 2025-12-05  2:57     ` zhongjinji
  2025-12-08  2:35       ` Chen Ridong
  0 siblings, 1 reply; 10+ messages in thread
From: zhongjinji @ 2025-12-05  2:57 UTC (permalink / raw)
  To: hannes
  Cc: Liam.Howlett, akpm, axelrasmussen, cgroups, chenridong,
	chenridong, corbet, david, linux-doc, linux-kernel, linux-mm,
	lorenzo.stoakes, lujialin4, mhocko, muchun.song, roman.gushchin,
	rppt, shakeel.butt, surenb, vbabka, weixugc, yuanchu, yuzhao,
	zhengqi.arch

> From: Chen Ridong <chenridong@huawei.com>
> 
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
> 
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
> 
> It performed a stress test based on Zhao Yu's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
> 
> 					memcg LRU    memcg iter
> stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
> sum(pgsteal) / sum(requested)             216.4%        230.5%

Are there more data available? For example, the load of kswapd or the refault values.

I am concerned about these two data points because Yu Zhao's implementation controls
the fairness of aging through memcg gen (get_memcg_gen). This helps reduce excessive
aging for certain cgroups, which is beneficial for kswapd's power consumption.

At the same time, pages that age earlier can be considered colder pages (in the entire system),
so reclaiming them should also help with the refault values.

> The new implementation demonstrates a significant improvement in
> fairness, reducing the standard deviation relative to the mean by
> 15.5 percentage points. While the reclaim accuracy shows a slight
> increase in overscan (from 85086871 to 90633890, 6.5%).
> 
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
> 
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Signed-off-by: Chen Ridong <chenridong@huawei.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
  2025-12-04 22:29   ` Shakeel Butt
@ 2025-12-08  2:26     ` Chen Ridong
  2025-12-08  3:10     ` Chen Ridong
  1 sibling, 0 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-08  2:26 UTC (permalink / raw)
  To: Shakeel Butt, Johannes Weiner
  Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
	roman.gushchin, muchun.song, yuzhao, zhengqi.arch, linux-mm,
	linux-doc, linux-kernel, cgroups, lujialin4, chenridong



On 2025/12/5 6:29, Shakeel Butt wrote:
> Hi Chen,
> 
> On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Zhao Yu's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> 					memcg LRU    memcg iter
>> stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
>> sum(pgsteal) / sum(requested)             216.4%        230.5%
>>
>> The new implementation demonstrates a significant improvement in
>> fairness, reducing the standard deviation relative to the mean by
>> 15.5 percentage points. While the reclaim accuracy shows a slight
>> increase in overscan (from 85086871 to 90633890, 6.5%).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> 
> Thanks a lot of this awesome work.
> 

Hello Shakeel and Johannes,

I apologize for the incorrect results I provided earlier. I initially used an AI tool to process the
data (I admit that was lazy of me—please forget that). When I re-ran the test to re-extracted the
refault data and processed it again, I found that the AI tool had given me the wrong output.

I have now processed the data manually in Excel, and the correct results are:

pgsteal：
						memcg LRU    memcg iter
stddev(pgsteal) / mean(pgsteal)			106.03%		93.20%
sum(pgsteal) / sum(requested)			98.10%		99.28%

workingset_refault_anon：
						memcg LRU    memcg iter
stddev(refault) / mean(refault)			193.97%		134.67%
sum(refault)					1963229		2027567

I believe these final results are much better than the previous incorrect ones, especially since the
pgsteal ratio is now close to 100%, indicating we are not over-scanning. Additionally, refaults
increased by 64,238 (a 3.2% rise).

Let me know if you have any questions.

----------------------------------------------------------------------

The original data memcg LRU:
pgsteal:
SUM: 38572704 AVERAGE: 301349.25 STDEV: 319518.5965
refault:
SUM: 1963229 AVERAGE: 15337.72656 STDEV: 29750.03391

pgsteal	655392				workingset_refault_anon	17131
pgsteal	657308				workingset_refault_anon	24841
pgsteal	103777				workingset_refault_anon	430
pgsteal	103134				workingset_refault_anon	884
pgsteal	964772				workingset_refault_anon	117159
pgsteal	103462				workingset_refault_anon	539
pgsteal	102878				workingset_refault_anon	25
pgsteal	707851				workingset_refault_anon	30634
pgsteal	103925				workingset_refault_anon	497
pgsteal	103913				workingset_refault_anon	953
pgsteal	103020				workingset_refault_anon	110
pgsteal	102871				workingset_refault_anon	607
pgsteal	697775				workingset_refault_anon	21529
pgsteal	102944				workingset_refault_anon	57
pgsteal	103090				workingset_refault_anon	819
pgsteal	102988				workingset_refault_anon	583
pgsteal	102987				workingset_refault_anon	108
pgsteal	103093				workingset_refault_anon	17
pgsteal	778016				workingset_refault_anon	79000
pgsteal	102920				workingset_refault_anon	14
pgsteal	655447				workingset_refault_anon	9069
pgsteal	102869				workingset_refault_anon	6
pgsteal	699920				workingset_refault_anon	34409
pgsteal	103127				workingset_refault_anon	223
pgsteal	102876				workingset_refault_anon	646
pgsteal	103642				workingset_refault_anon	439
pgsteal	102881				workingset_refault_anon	110
pgsteal	863202				workingset_refault_anon	77605
pgsteal	651786				workingset_refault_anon	8322
pgsteal	102981				workingset_refault_anon	51
pgsteal	103380				workingset_refault_anon	877
pgsteal	706377				workingset_refault_anon	27729
pgsteal	103436				workingset_refault_anon	682
pgsteal	103839				workingset_refault_anon	336
pgsteal	103012				workingset_refault_anon	23
pgsteal	103476				workingset_refault_anon	729
pgsteal	102867				workingset_refault_anon	12
pgsteal	102914				workingset_refault_anon	122
pgsteal	102886				workingset_refault_anon	627
pgsteal	103736				workingset_refault_anon	514
pgsteal	102879				workingset_refault_anon	618
pgsteal	102860				workingset_refault_anon	3
pgsteal	102877				workingset_refault_anon	27
pgsteal	103255				workingset_refault_anon	384
pgsteal	982183				workingset_refault_anon	85362
pgsteal	102947				workingset_refault_anon	158
pgsteal	102880				workingset_refault_anon	651
pgsteal	973764				workingset_refault_anon	81542
pgsteal	923711				workingset_refault_anon	94596
pgsteal	102938				workingset_refault_anon	660
pgsteal	888882				workingset_refault_anon	69549
pgsteal	102868				workingset_refault_anon	14
pgsteal	103130				workingset_refault_anon	166
pgsteal	103388				workingset_refault_anon	467
pgsteal	102965				workingset_refault_anon	197
pgsteal	964699				workingset_refault_anon	74903
pgsteal	103263				workingset_refault_anon	373
pgsteal	103614				workingset_refault_anon	781
pgsteal	962228				workingset_refault_anon	72108
pgsteal	672174				workingset_refault_anon	19739
pgsteal	102920				workingset_refault_anon	19
pgsteal	670248				workingset_refault_anon	18411
pgsteal	102877				workingset_refault_anon	581
pgsteal	103758				workingset_refault_anon	871
pgsteal	102874				workingset_refault_anon	609
pgsteal	103075				workingset_refault_anon	274
pgsteal	103550				workingset_refault_anon	102
pgsteal	755180				workingset_refault_anon	44303
pgsteal	951252				workingset_refault_anon	84566
pgsteal	929144				workingset_refault_anon	99081
pgsteal	103207				workingset_refault_anon	30
pgsteal	103292				workingset_refault_anon	427
pgsteal	103271				workingset_refault_anon	332
pgsteal	102865				workingset_refault_anon	4
pgsteal	923280				workingset_refault_anon	72715
pgsteal	104682				workingset_refault_anon	372
pgsteal	102870				workingset_refault_anon	7
pgsteal	102902				workingset_refault_anon	661
pgsteal	103053				workingset_refault_anon	40
pgsteal	103685				workingset_refault_anon	540
pgsteal	103857				workingset_refault_anon	970
pgsteal	109210				workingset_refault_anon	2806
pgsteal	103627				workingset_refault_anon	319
pgsteal	104029				workingset_refault_anon	42
pgsteal	918361				workingset_refault_anon	90387
pgsteal	103489				workingset_refault_anon	626
pgsteal	103188				workingset_refault_anon	801
pgsteal	102875				workingset_refault_anon	11
pgsteal	102994				workingset_refault_anon	79
pgsteal	102910				workingset_refault_anon	43
pgsteal	102922				workingset_refault_anon	687
pgsteal	103941				workingset_refault_anon	1219
pgsteal	903622				workingset_refault_anon	113751
pgsteal	664357				workingset_refault_anon	27959
pgsteal	104947				workingset_refault_anon	11
pgsteal	701084				workingset_refault_anon	30665
pgsteal	650719				workingset_refault_anon	20810
pgsteal	641924				workingset_refault_anon	17137
pgsteal	933870				workingset_refault_anon	98393
pgsteal	633231				workingset_refault_anon	15924
pgsteal	102936				workingset_refault_anon	34
pgsteal	104020				workingset_refault_anon	781
pgsteal	104274				workingset_refault_anon	1841
pgsteal	621672				workingset_refault_anon	5891
pgsteal	103307				workingset_refault_anon	474
pgsteal	103386				workingset_refault_anon	27
pgsteal	103266				workingset_refault_anon	243
pgsteal	102896				workingset_refault_anon	15
pgsteal	103905				workingset_refault_anon	988
pgsteal	103104				workingset_refault_anon	304
pgsteal	104277				workingset_refault_anon	285
pgsteal	696374				workingset_refault_anon	24971
pgsteal	103009				workingset_refault_anon	775
pgsteal	103849				workingset_refault_anon	747
pgsteal	102867				workingset_refault_anon	9
pgsteal	700211				workingset_refault_anon	35289
pgsteal	102923				workingset_refault_anon	88
pgsteal	104139				workingset_refault_anon	789
pgsteal	105152				workingset_refault_anon	1257
pgsteal	102945				workingset_refault_anon	76
pgsteal	103227				workingset_refault_anon	343
pgsteal	102880				workingset_refault_anon	95
pgsteal	102967				workingset_refault_anon	101
pgsteal	989176				workingset_refault_anon	89597
pgsteal	694181				workingset_refault_anon	22499
pgsteal	784354				workingset_refault_anon	68311
pgsteal	102882				workingset_refault_anon	24
pgsteal	103108				workingset_refault_anon	24

-------------------------------------------------------------------

The original data memcg iter:
pgsteal:
SUM: 39036863 AVERAGE: 304975.4922 STDEV: 284226.526
refault:
SUM: 2027567 AVERAGE: 15840.36719 STDEV: 21332.00262

pgsteal	103167				workingset_refault_anon	203
pgsteal	714044				workingset_refault_anon	42633
pgsteal	103209				workingset_refault_anon	581
pgsteal	103605				workingset_refault_anon	240
pgsteal	740909				workingset_refault_anon	53177
pgsteal	103089				workingset_refault_anon	141
pgsteal	726760				workingset_refault_anon	32624
pgsteal	104039				workingset_refault_anon	397
pgsteal	754667				workingset_refault_anon	56144
pgsteal	713916				workingset_refault_anon	41813
pgsteal	104104				workingset_refault_anon	307
pgsteal	109567				workingset_refault_anon	244
pgsteal	714194				workingset_refault_anon	47076
pgsteal	711693				workingset_refault_anon	35616
pgsteal	105026				workingset_refault_anon	2221
pgsteal	103442				workingset_refault_anon	269
pgsteal	112773				workingset_refault_anon	5086
pgsteal	715969				workingset_refault_anon	32457
pgsteal	127828				workingset_refault_anon	9579
pgsteal	102885				workingset_refault_anon	109
pgsteal	112156				workingset_refault_anon	2974
pgsteal	104242				workingset_refault_anon	948
pgsteal	701184				workingset_refault_anon	47940
pgsteal	104080				workingset_refault_anon	836
pgsteal	106606				workingset_refault_anon	2420
pgsteal	103666				workingset_refault_anon	129
pgsteal	103330				workingset_refault_anon	532
pgsteal	103639				workingset_refault_anon	275
pgsteal	108494				workingset_refault_anon	3814
pgsteal	103626				workingset_refault_anon	412
pgsteal	103697				workingset_refault_anon	577
pgsteal	103736				workingset_refault_anon	582
pgsteal	103360				workingset_refault_anon	281
pgsteal	116733				workingset_refault_anon	6674
pgsteal	102978				workingset_refault_anon	5
pgsteal	108945				workingset_refault_anon	3141
pgsteal	706630				workingset_refault_anon	33241
pgsteal	103426				workingset_refault_anon	134
pgsteal	715070				workingset_refault_anon	33575
pgsteal	102871				workingset_refault_anon	12
pgsteal	103617				workingset_refault_anon	776
pgsteal	767084				workingset_refault_anon	64710
pgsteal	104197				workingset_refault_anon	176
pgsteal	104488				workingset_refault_anon	1469
pgsteal	103253				workingset_refault_anon	228
pgsteal	702800				workingset_refault_anon	26424
pgsteal	107469				workingset_refault_anon	2838
pgsteal	104441				workingset_refault_anon	1562
pgsteal	123013				workingset_refault_anon	13117
pgsteal	737817				workingset_refault_anon	53330
pgsteal	103939				workingset_refault_anon	759
pgsteal	103568				workingset_refault_anon	783
pgsteal	122707				workingset_refault_anon	11944
pgsteal	103690				workingset_refault_anon	885
pgsteal	103456				workingset_refault_anon	145
pgsteal	104068				workingset_refault_anon	632
pgsteal	319368				workingset_refault_anon	12579
pgsteal	103912				workingset_refault_anon	304
pgsteal	119416				workingset_refault_anon	3350
pgsteal	717107				workingset_refault_anon	34764
pgsteal	107163				workingset_refault_anon	535
pgsteal	103299				workingset_refault_anon	142
pgsteal	103825				workingset_refault_anon	176
pgsteal	408564				workingset_refault_anon	14606
pgsteal	115785				workingset_refault_anon	4622
pgsteal	119234				workingset_refault_anon	9225
pgsteal	729060				workingset_refault_anon	54309
pgsteal	107149				workingset_refault_anon	536
pgsteal	708839				workingset_refault_anon	43133
pgsteal	695961				workingset_refault_anon	40182
pgsteal	723303				workingset_refault_anon	32298
pgsteal	103581				workingset_refault_anon	1305
pgsteal	699646				workingset_refault_anon	49924
pgsteal	717867				workingset_refault_anon	39229
pgsteal	104148				workingset_refault_anon	1318
pgsteal	104127				workingset_refault_anon	568
pgsteal	103168				workingset_refault_anon	322
pgsteal	103477				workingset_refault_anon	538
pgsteal	103022				workingset_refault_anon	60
pgsteal	103305				workingset_refault_anon	323
pgsteal	103812				workingset_refault_anon	1324
pgsteal	103139				workingset_refault_anon	126
pgsteal	723251				workingset_refault_anon	34206
pgsteal	103068				workingset_refault_anon	861
pgsteal	742515				workingset_refault_anon	54439
pgsteal	762161				workingset_refault_anon	52654
pgsteal	103934				workingset_refault_anon	889
pgsteal	104065				workingset_refault_anon	315
pgsteal	383893				workingset_refault_anon	25036
pgsteal	107929				workingset_refault_anon	2367
pgsteal	726127				workingset_refault_anon	45809
pgsteal	675291				workingset_refault_anon	66534
pgsteal	105585				workingset_refault_anon	2323
pgsteal	105098				workingset_refault_anon	1625
pgsteal	104264				workingset_refault_anon	718
pgsteal	741873				workingset_refault_anon	47045
pgsteal	103466				workingset_refault_anon	70
pgsteal	723870				workingset_refault_anon	58780
pgsteal	104740				workingset_refault_anon	521
pgsteal	740739				workingset_refault_anon	45099
pgsteal	752994				workingset_refault_anon	53713
pgsteal	110164				workingset_refault_anon	2572
pgsteal	711304				workingset_refault_anon	41135
pgsteal	746870				workingset_refault_anon	60298
pgsteal	729166				workingset_refault_anon	42594
pgsteal	110138				workingset_refault_anon	1511
pgsteal	103836				workingset_refault_anon	675
pgsteal	116821				workingset_refault_anon	3952
pgsteal	104967				workingset_refault_anon	2035
pgsteal	711362				workingset_refault_anon	31458
pgsteal	103835				workingset_refault_anon	507
pgsteal	113846				workingset_refault_anon	2997
pgsteal	104406				workingset_refault_anon	1724
pgsteal	103551				workingset_refault_anon	1293
pgsteal	705340				workingset_refault_anon	44234
pgsteal	728076				workingset_refault_anon	29849
pgsteal	103829				workingset_refault_anon	254
pgsteal	103700				workingset_refault_anon	712
pgsteal	103382				workingset_refault_anon	506
pgsteal	728881				workingset_refault_anon	60152
pgsteal	614645				workingset_refault_anon	43956
pgsteal	107672				workingset_refault_anon	2768
pgsteal	123550				workingset_refault_anon	11937
pgsteal	103747				workingset_refault_anon	899
pgsteal	747657				workingset_refault_anon	50264
pgsteal	110949				workingset_refault_anon	1422
pgsteal	103596				workingset_refault_anon	278
pgsteal	742471				workingset_refault_anon	69586

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
  2025-12-05  2:57     ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim zhongjinji
@ 2025-12-08  2:35       ` Chen Ridong
  0 siblings, 0 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-08  2:35 UTC (permalink / raw)
  To: zhongjinji, hannes
  Cc: Liam.Howlett, akpm, axelrasmussen, cgroups, chenridong, corbet,
	david, linux-doc, linux-kernel, linux-mm, lorenzo.stoakes,
	lujialin4, mhocko, muchun.song, roman.gushchin, rppt,
	shakeel.butt, surenb, vbabka, weixugc, yuanchu, yuzhao,
	zhengqi.arch



On 2025/12/5 10:57, zhongjinji wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Zhao Yu's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> 					memcg LRU    memcg iter
>> stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
>> sum(pgsteal) / sum(requested)             216.4%        230.5%
> 
> Are there more data available? For example, the load of kswapd or the refault values.
> 
> I am concerned about these two data points because Yu Zhao's implementation controls
> the fairness of aging through memcg gen (get_memcg_gen). This helps reduce excessive
> aging for certain cgroups, which is beneficial for kswapd's power consumption.
> 
> At the same time, pages that age earlier can be considered colder pages (in the entire system),
> so reclaiming them should also help with the refault values.
> 

I re-ran the test and observed a 3.2% increase in refaults. Is this enough for what you were
concerned about?

The complete data set is offered in my earlier email:

https://lore.kernel.org/all/e657d5ac-6f92-4dbb-bf32-76084988d024@huaweicloud.com/

>> The new implementation demonstrates a significant improvement in
>> fairness, reducing the standard deviation relative to the mean by
>> 15.5 percentage points. While the reclaim accuracy shows a slight
>> increase in overscan (from 85086871 to 90633890, 6.5%).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
  2025-12-04 22:29   ` Shakeel Butt
  2025-12-08  2:26     ` Chen Ridong
@ 2025-12-08  3:10     ` Chen Ridong
  1 sibling, 0 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-08  3:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
	roman.gushchin, muchun.song, yuzhao, zhengqi.arch, linux-mm,
	linux-doc, linux-kernel, cgroups, lujialin4, chenridong



On 2025/12/5 6:29, Shakeel Butt wrote:
> Hi Chen,
> 
> On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Zhao Yu's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> 					memcg LRU    memcg iter
>> stddev(pgsteal) / mean(pgsteal)            91.2%         75.7%
>> sum(pgsteal) / sum(requested)             216.4%        230.5%
>>
>> The new implementation demonstrates a significant improvement in
>> fairness, reducing the standard deviation relative to the mean by
>> 15.5 percentage points. While the reclaim accuracy shows a slight
>> increase in overscan (from 85086871 to 90633890, 6.5%).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> 
> Thanks a lot of this awesome work.
> 
>> ---
>>  mm/vmscan.c | 117 ++++++++++++++++------------------------------------
>>  1 file changed, 36 insertions(+), 81 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index fddd168a9737..70b0e7e5393c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>  	return nr_to_scan < 0;
>>  }
>>  
>> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>>  {
>> -	bool success;
>>  	unsigned long scanned = sc->nr_scanned;
>>  	unsigned long reclaimed = sc->nr_reclaimed;
>> -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>  
>> -	/* lru_gen_age_node() called mem_cgroup_calculate_protection() */
>> -	if (mem_cgroup_below_min(NULL, memcg))
>> -		return MEMCG_LRU_YOUNG;
>> -
>> -	if (mem_cgroup_below_low(NULL, memcg)) {
>> -		/* see the comment on MEMCG_NR_GENS */
>> -		if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
>> -			return MEMCG_LRU_TAIL;
>> -
>> -		memcg_memory_event(memcg, MEMCG_LOW);
>> -	}
>> -
>> -	success = try_to_shrink_lruvec(lruvec, sc);
>> +	try_to_shrink_lruvec(lruvec, sc);
>>  
>>  	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>>  
>> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>>  			   sc->nr_reclaimed - reclaimed);
>>  
>>  	flush_reclaim_state(sc);
> 
> Unrealted to your patch but why this flush_reclaim_state() is at
> different place from the non-MGLRU code path?
> 

Thank you Shakeel for you reply.

IIUC, I think adding flush_reclaim_state here makes sense. Currently, shrink_one is only used for
root-level reclaim in gen-LRU, and flush_reclaim_state is only relevant during root reclaim.
Flushing after each lruvec is shrunk could help the reclaim loop terminate earlier, as
sc->nr_reclaimed += current->reclaim_state->reclaimed; may reach nr_to_reclaim sooner.

That said, I'm also wondering whether we should apply flush_reclaim_state for every iteration in
non-MGLLU reclaim as well. For non-root reclaim, it should be negligible since it effectively does
nothing. But for root-level reclaim under non-MGLRU, it might similarly help stop the iteration earlier.

>> -
>> -	if (success && mem_cgroup_online(memcg))
>> -		return MEMCG_LRU_YOUNG;
>> -
>> -	if (!success && lruvec_is_sizable(lruvec, sc))
>> -		return 0;
>> -
>> -	/* one retry if offlined or too small */
>> -	return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
>> -	       MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
>>  }
>>  
>>  static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
> 
> This function kind of become very similar to shrink_node_memcgs()
> function other than shrink_one vs shrink_lruvec. Can you try to combine
> them and see if it looks not-ugly? Otherwise the code looks good to me.
> 

Will try to.

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-08  3:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-04 12:31 [RFC PATCH -next 0/2] mm/mglru: remove memcg lru Chen Ridong
2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
2025-12-04 18:28   ` Johannes Weiner
2025-12-04 22:29   ` Shakeel Butt
2025-12-08  2:26     ` Chen Ridong
2025-12-08  3:10     ` Chen Ridong
2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
2025-12-04 18:34   ` Johannes Weiner
2025-12-05  2:57     ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim zhongjinji
2025-12-08  2:35       ` Chen Ridong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox