* [RFC PATCH -next 0/2] mm/mglru: remove memcg lru
@ 2025-12-04 12:31 Chen Ridong
2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
0 siblings, 2 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-04 12:31 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4, chenridong
From: Chen Ridong <chenridong@huawei.com>
The memcg LRU was introduced for global reclaim to improve scalability,
but its implementation has grown complex. Moreover, it can cause
performance regressions when creating a large number of memory cgroups [1].
This series implements mem_cgroup_iter with a reclaim cookie in
shrink_many() for global reclaim, following the pattern already established
in shrink_node_memcgs(), an approach suggested by Johannes [1]. The new
approach provides good fairness across cgroups by preserving iteration
state between reclaim passes.
Testing was performed using the original stress test from Zhao Yu [2] on a
1 TB, 4-node NUMA system. The results show:
before after
stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
sum(pgsteal) / sum(requested) 216.4% 230.5%
The new implementation reduces the standard deviation relative to the mean
by 15.5 percentage points, indicating improved fairness in memory reclaim
distribution. The total pages reclaimed increased from 85,086,871 to
90,633,890 (6.5% increase), resulting in a higher ratio of actual to
requested reclaim.
To simplify review:
- Patch 1 uses mem_cgroup_iter with reclaim cookie in shrink_many()
- Patch 2 removes the now-unused memcg LRU code
[1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
[2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Chen Ridong (2):
mm/mglru: use mem_cgroup_iter for global reclaim
mm/mglru: remove memcg lru
Documentation/mm/multigen_lru.rst | 30 ----
include/linux/mmzone.h | 89 ----------
mm/memcontrol-v1.c | 6 -
mm/memcontrol.c | 4 -
mm/mm_init.c | 1 -
mm/vmscan.c | 270 ++++--------------------------
6 files changed, 37 insertions(+), 363 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-04 12:31 [RFC PATCH -next 0/2] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-04 12:31 ` Chen Ridong
2025-12-04 18:28 ` Johannes Weiner
2025-12-04 22:29 ` Shakeel Butt
2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
1 sibling, 2 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-04 12:31 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4, chenridong
From: Chen Ridong <chenridong@huawei.com>
The memcg LRU was originally introduced for global reclaim to enhance
scalability. However, its implementation complexity has led to performance
regressions when dealing with a large number of memory cgroups [1].
As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
cookie-based iteration for global reclaim, aligning with the approach
already used in shrink_node_memcgs. This simplification removes the
dedicated memcg LRU tracking while maintaining the core functionality.
It performed a stress test based on Zhao Yu's methodology [2] on a
1 TB, 4-node NUMA system. The results are summarized below:
memcg LRU memcg iter
stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
sum(pgsteal) / sum(requested) 216.4% 230.5%
The new implementation demonstrates a significant improvement in
fairness, reducing the standard deviation relative to the mean by
15.5 percentage points. While the reclaim accuracy shows a slight
increase in overscan (from 85086871 to 90633890, 6.5%).
The primary benefits of this change are:
1. Simplified codebase by removing custom memcg LRU infrastructure
2. Improved fairness in memory reclaim across multiple cgroups
3. Better performance when creating many memory cgroups
[1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
[2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
mm/vmscan.c | 117 ++++++++++++++++------------------------------------
1 file changed, 36 insertions(+), 81 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fddd168a9737..70b0e7e5393c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
return nr_to_scan < 0;
}
-static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
{
- bool success;
unsigned long scanned = sc->nr_scanned;
unsigned long reclaimed = sc->nr_reclaimed;
- struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
- /* lru_gen_age_node() called mem_cgroup_calculate_protection() */
- if (mem_cgroup_below_min(NULL, memcg))
- return MEMCG_LRU_YOUNG;
-
- if (mem_cgroup_below_low(NULL, memcg)) {
- /* see the comment on MEMCG_NR_GENS */
- if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
- return MEMCG_LRU_TAIL;
-
- memcg_memory_event(memcg, MEMCG_LOW);
- }
-
- success = try_to_shrink_lruvec(lruvec, sc);
+ try_to_shrink_lruvec(lruvec, sc);
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
@@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
sc->nr_reclaimed - reclaimed);
flush_reclaim_state(sc);
-
- if (success && mem_cgroup_online(memcg))
- return MEMCG_LRU_YOUNG;
-
- if (!success && lruvec_is_sizable(lruvec, sc))
- return 0;
-
- /* one retry if offlined or too small */
- return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
- MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
}
static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
{
- int op;
- int gen;
- int bin;
- int first_bin;
- struct lruvec *lruvec;
- struct lru_gen_folio *lrugen;
+ struct mem_cgroup *target = sc->target_mem_cgroup;
+ struct mem_cgroup_reclaim_cookie reclaim = {
+ .pgdat = pgdat,
+ };
+ struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
struct mem_cgroup *memcg;
- struct hlist_nulls_node *pos;
- gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
- bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
-restart:
- op = 0;
- memcg = NULL;
-
- rcu_read_lock();
+ if (current_is_kswapd() || sc->memcg_full_walk)
+ cookie = NULL;
- hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
- if (op) {
- lru_gen_rotate_memcg(lruvec, op);
- op = 0;
- }
+ memcg = mem_cgroup_iter(target, NULL, cookie);
+ while (memcg) {
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- mem_cgroup_put(memcg);
- memcg = NULL;
+ cond_resched();
- if (gen != READ_ONCE(lrugen->gen))
- continue;
+ mem_cgroup_calculate_protection(target, memcg);
- lruvec = container_of(lrugen, struct lruvec, lrugen);
- memcg = lruvec_memcg(lruvec);
+ if (mem_cgroup_below_min(target, memcg))
+ goto next;
- if (!mem_cgroup_tryget(memcg)) {
- lru_gen_release_memcg(memcg);
- memcg = NULL;
- continue;
+ if (mem_cgroup_below_low(target, memcg)) {
+ if (!sc->memcg_low_reclaim) {
+ sc->memcg_low_skipped = 1;
+ goto next;
+ }
+ memcg_memory_event(memcg, MEMCG_LOW);
}
- rcu_read_unlock();
+ shrink_one(lruvec, sc);
- op = shrink_one(lruvec, sc);
-
- rcu_read_lock();
-
- if (should_abort_scan(lruvec, sc))
+ if (should_abort_scan(lruvec, sc)) {
+ if (cookie)
+ mem_cgroup_iter_break(target, memcg);
break;
- }
-
- rcu_read_unlock();
-
- if (op)
- lru_gen_rotate_memcg(lruvec, op);
-
- mem_cgroup_put(memcg);
-
- if (!is_a_nulls(pos))
- return;
+ }
- /* restart if raced with lru_gen_rotate_memcg() */
- if (gen != get_nulls_value(pos))
- goto restart;
+next:
+ if (cookie && sc->nr_reclaimed >= sc->nr_to_reclaim) {
+ mem_cgroup_iter_break(target, memcg);
+ break;
+ }
- /* try the rest of the bins of the current generation */
- bin = get_memcg_bin(bin + 1);
- if (bin != first_bin)
- goto restart;
+ memcg = mem_cgroup_iter(target, memcg, cookie);
+ }
}
static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -5019,8 +4975,7 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
set_mm_walk(NULL, sc->proactive);
- if (try_to_shrink_lruvec(lruvec, sc))
- lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
+ try_to_shrink_lruvec(lruvec, sc);
clear_mm_walk();
--
2.34.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH -next 2/2] mm/mglru: remove memcg lru
2025-12-04 12:31 [RFC PATCH -next 0/2] mm/mglru: remove memcg lru Chen Ridong
2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
@ 2025-12-04 12:31 ` Chen Ridong
2025-12-04 18:34 ` Johannes Weiner
1 sibling, 1 reply; 10+ messages in thread
From: Chen Ridong @ 2025-12-04 12:31 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4, chenridong
From: Chen Ridong <chenridong@huawei.com>
Now that the previous patch has switched global reclaim to use
mem_cgroup_iter, the specialized memcg LRU infrastructure is no longer
needed. This patch removes all related code:
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
Documentation/mm/multigen_lru.rst | 30 ------
include/linux/mmzone.h | 89 -----------------
mm/memcontrol-v1.c | 6 --
mm/memcontrol.c | 4 -
mm/mm_init.c | 1 -
mm/vmscan.c | 153 +-----------------------------
6 files changed, 1 insertion(+), 282 deletions(-)
diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst
index 52ed5092022f..bf8547e2f592 100644
--- a/Documentation/mm/multigen_lru.rst
+++ b/Documentation/mm/multigen_lru.rst
@@ -220,36 +220,6 @@ time domain because a CPU can scan pages at different rates under
varying memory pressure. It calculates a moving average for each new
generation to avoid being permanently locked in a suboptimal state.
-Memcg LRU
----------
-An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
-since each node and memcg combination has an LRU of folios (see
-``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
-global reclaim, which is critical to system-wide memory overcommit in
-data centers. Note that memcg LRU only applies to global reclaim.
-
-The basic structure of an memcg LRU can be understood by an analogy to
-the active/inactive LRU (of folios):
-
-1. It has the young and the old (generations), i.e., the counterparts
- to the active and the inactive;
-2. The increment of ``max_seq`` triggers promotion, i.e., the
- counterpart to activation;
-3. Other events trigger similar operations, e.g., offlining an memcg
- triggers demotion, i.e., the counterpart to deactivation.
-
-In terms of global reclaim, it has two distinct features:
-
-1. Sharding, which allows each thread to start at a random memcg (in
- the old generation) and improves parallelism;
-2. Eventual fairness, which allows direct reclaim to bail out at will
- and reduces latency without affecting fairness over some time.
-
-In terms of traversing memcgs during global reclaim, it improves the
-best-case complexity from O(n) to O(1) and does not affect the
-worst-case complexity O(n). Therefore, on average, it has a sublinear
-complexity.
-
Summary
-------
The multi-gen LRU (of folios) can be disassembled into the following
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75ef7c9f9307..49952301ff3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,12 +509,6 @@ struct lru_gen_folio {
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
/* whether the multi-gen LRU is enabled */
bool enabled;
- /* the memcg generation this lru_gen_folio belongs to */
- u8 gen;
- /* the list segment this lru_gen_folio belongs to */
- u8 seg;
- /* per-node lru_gen_folio list for global reclaim */
- struct hlist_nulls_node list;
};
enum {
@@ -558,79 +552,14 @@ struct lru_gen_mm_walk {
bool force_scan;
};
-/*
- * For each node, memcgs are divided into two generations: the old and the
- * young. For each generation, memcgs are randomly sharded into multiple bins
- * to improve scalability. For each bin, the hlist_nulls is virtually divided
- * into three segments: the head, the tail and the default.
- *
- * An onlining memcg is added to the tail of a random bin in the old generation.
- * The eviction starts at the head of a random bin in the old generation. The
- * per-node memcg generation counter, whose reminder (mod MEMCG_NR_GENS) indexes
- * the old generation, is incremented when all its bins become empty.
- *
- * There are four operations:
- * 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
- * current generation (old or young) and updates its "seg" to "head";
- * 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
- * current generation (old or young) and updates its "seg" to "tail";
- * 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
- * generation, updates its "gen" to "old" and resets its "seg" to "default";
- * 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
- * young generation, updates its "gen" to "young" and resets its "seg" to
- * "default".
- *
- * The events that trigger the above operations are:
- * 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
- * 2. The first attempt to reclaim a memcg below low, which triggers
- * MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim a memcg offlined or below reclaimable size
- * threshold, which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim a memcg offlined or below reclaimable size
- * threshold, which triggers MEMCG_LRU_YOUNG;
- * 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
- * 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
- * 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
- *
- * Notes:
- * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
- * of their max_seq counters ensures the eventual fairness to all eligible
- * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
- * 2. There are only two valid generations: old (seq) and young (seq+1).
- * MEMCG_NR_GENS is set to three so that when reading the generation counter
- * locklessly, a stale value (seq-1) does not wraparound to young.
- */
-#define MEMCG_NR_GENS 3
-#define MEMCG_NR_BINS 8
-
-struct lru_gen_memcg {
- /* the per-node memcg generation counter */
- unsigned long seq;
- /* each memcg has one lru_gen_folio per node */
- unsigned long nr_memcgs[MEMCG_NR_GENS];
- /* per-node lru_gen_folio list for global reclaim */
- struct hlist_nulls_head fifo[MEMCG_NR_GENS][MEMCG_NR_BINS];
- /* protects the above */
- spinlock_t lock;
-};
-
-void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
-void lru_gen_online_memcg(struct mem_cgroup *memcg);
-void lru_gen_offline_memcg(struct mem_cgroup *memcg);
-void lru_gen_release_memcg(struct mem_cgroup *memcg);
-void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
#else /* !CONFIG_LRU_GEN */
-static inline void lru_gen_init_pgdat(struct pglist_data *pgdat)
-{
-}
-
static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}
@@ -648,22 +577,6 @@ static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
{
}
-static inline void lru_gen_online_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_offline_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_release_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
-{
-}
-
#endif /* CONFIG_LRU_GEN */
struct lruvec {
@@ -1503,8 +1416,6 @@ typedef struct pglist_data {
#ifdef CONFIG_LRU_GEN
/* kswap mm walk data */
struct lru_gen_mm_walk mm_walk;
- /* lru_gen_folio list */
- struct lru_gen_memcg memcg_lru;
#endif
CACHELINE_PADDING(_pad2_);
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..8f41e72ae7f0 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -182,12 +182,6 @@ static void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
struct mem_cgroup_per_node *mz;
struct mem_cgroup_tree_per_node *mctz;
- if (lru_gen_enabled()) {
- if (soft_limit_excess(memcg))
- lru_gen_soft_reclaim(memcg, nid);
- return;
- }
-
mctz = soft_limit_tree.rb_tree_per_node[nid];
if (!mctz)
return;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index be810c1fbfc3..ab3ebecb5ec7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3874,8 +3874,6 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
FLUSH_TIME);
- lru_gen_online_memcg(memcg);
-
/* Online state pins memcg ID, memcg ID pins CSS */
refcount_set(&memcg->id.ref, 1);
css_get(css);
@@ -3915,7 +3913,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
reparent_deferred_split_queue(memcg);
reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);
- lru_gen_offline_memcg(memcg);
drain_all_stock(memcg);
@@ -3927,7 +3924,6 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
invalidate_reclaim_iterators(memcg);
- lru_gen_release_memcg(memcg);
}
static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fc2a6f1e518f..6e5e1fe6ff31 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1745,7 +1745,6 @@ static void __init free_area_init_node(int nid)
pgdat_set_deferred_range(pgdat);
free_area_init_core(pgdat);
- lru_gen_init_pgdat(pgdat);
}
/* Any regular or high memory on that node ? */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70b0e7e5393c..584f41eb4c14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2698,9 +2698,6 @@ static bool should_clear_pmd_young(void)
#define for_each_evictable_type(type, swappiness) \
for ((type) = min_type(swappiness); (type) <= max_type(swappiness); (type)++)
-#define get_memcg_gen(seq) ((seq) % MEMCG_NR_GENS)
-#define get_memcg_bin(bin) ((bin) % MEMCG_NR_BINS)
-
static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
{
struct pglist_data *pgdat = NODE_DATA(nid);
@@ -4287,140 +4284,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
return true;
}
-/******************************************************************************
- * memcg LRU
- ******************************************************************************/
-
-/* see the comment on MEMCG_NR_GENS */
-enum {
- MEMCG_LRU_NOP,
- MEMCG_LRU_HEAD,
- MEMCG_LRU_TAIL,
- MEMCG_LRU_OLD,
- MEMCG_LRU_YOUNG,
-};
-
-static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
-{
- int seg;
- int old, new;
- unsigned long flags;
- int bin = get_random_u32_below(MEMCG_NR_BINS);
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
- spin_lock_irqsave(&pgdat->memcg_lru.lock, flags);
-
- VM_WARN_ON_ONCE(hlist_nulls_unhashed(&lruvec->lrugen.list));
-
- seg = 0;
- new = old = lruvec->lrugen.gen;
-
- /* see the comment on MEMCG_NR_GENS */
- if (op == MEMCG_LRU_HEAD)
- seg = MEMCG_LRU_HEAD;
- else if (op == MEMCG_LRU_TAIL)
- seg = MEMCG_LRU_TAIL;
- else if (op == MEMCG_LRU_OLD)
- new = get_memcg_gen(pgdat->memcg_lru.seq);
- else if (op == MEMCG_LRU_YOUNG)
- new = get_memcg_gen(pgdat->memcg_lru.seq + 1);
- else
- VM_WARN_ON_ONCE(true);
-
- WRITE_ONCE(lruvec->lrugen.seg, seg);
- WRITE_ONCE(lruvec->lrugen.gen, new);
-
- hlist_nulls_del_rcu(&lruvec->lrugen.list);
-
- if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
- hlist_nulls_add_head_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[new][bin]);
- else
- hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[new][bin]);
-
- pgdat->memcg_lru.nr_memcgs[old]--;
- pgdat->memcg_lru.nr_memcgs[new]++;
-
- if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
- WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
-
- spin_unlock_irqrestore(&pgdat->memcg_lru.lock, flags);
-}
-
-#ifdef CONFIG_MEMCG
-
-void lru_gen_online_memcg(struct mem_cgroup *memcg)
-{
- int gen;
- int nid;
- int bin = get_random_u32_below(MEMCG_NR_BINS);
-
- for_each_node(nid) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- spin_lock_irq(&pgdat->memcg_lru.lock);
-
- VM_WARN_ON_ONCE(!hlist_nulls_unhashed(&lruvec->lrugen.list));
-
- gen = get_memcg_gen(pgdat->memcg_lru.seq);
-
- lruvec->lrugen.gen = gen;
-
- hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
- pgdat->memcg_lru.nr_memcgs[gen]++;
-
- spin_unlock_irq(&pgdat->memcg_lru.lock);
- }
-}
-
-void lru_gen_offline_memcg(struct mem_cgroup *memcg)
-{
- int nid;
-
- for_each_node(nid) {
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- lru_gen_rotate_memcg(lruvec, MEMCG_LRU_OLD);
- }
-}
-
-void lru_gen_release_memcg(struct mem_cgroup *memcg)
-{
- int gen;
- int nid;
-
- for_each_node(nid) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- spin_lock_irq(&pgdat->memcg_lru.lock);
-
- if (hlist_nulls_unhashed(&lruvec->lrugen.list))
- goto unlock;
-
- gen = lruvec->lrugen.gen;
-
- hlist_nulls_del_init_rcu(&lruvec->lrugen.list);
- pgdat->memcg_lru.nr_memcgs[gen]--;
-
- if (!pgdat->memcg_lru.nr_memcgs[gen] && gen == get_memcg_gen(pgdat->memcg_lru.seq))
- WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
-unlock:
- spin_unlock_irq(&pgdat->memcg_lru.lock);
- }
-}
-
-void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
-{
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- /* see the comment on MEMCG_NR_GENS */
- if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_HEAD)
- lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
-}
-
-#endif /* CONFIG_MEMCG */
-
/******************************************************************************
* the eviction
******************************************************************************/
@@ -5613,18 +5476,6 @@ static const struct file_operations lru_gen_ro_fops = {
* initialization
******************************************************************************/
-void lru_gen_init_pgdat(struct pglist_data *pgdat)
-{
- int i, j;
-
- spin_lock_init(&pgdat->memcg_lru.lock);
-
- for (i = 0; i < MEMCG_NR_GENS; i++) {
- for (j = 0; j < MEMCG_NR_BINS; j++)
- INIT_HLIST_NULLS_HEAD(&pgdat->memcg_lru.fifo[i][j], i);
- }
-}
-
void lru_gen_init_lruvec(struct lruvec *lruvec)
{
int i;
@@ -5671,9 +5522,7 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
- sizeof(lruvec->lrugen.nr_pages)));
-
- lruvec->lrugen.list.next = LIST_POISON1;
+ sizeof(lruvec->lrugen.nr_pages)));
if (!mm_state)
continue;
--
2.34.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
@ 2025-12-04 18:28 ` Johannes Weiner
2025-12-04 22:29 ` Shakeel Butt
1 sibling, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2025-12-04 18:28 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
chenridong
On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
>
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
>
> It performed a stress test based on Zhao Yu's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
>
> memcg LRU memcg iter
> stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
> sum(pgsteal) / sum(requested) 216.4% 230.5%
>
> The new implementation demonstrates a significant improvement in
> fairness, reducing the standard deviation relative to the mean by
> 15.5 percentage points. While the reclaim accuracy shows a slight
> increase in overscan (from 85086871 to 90633890, 6.5%).
>
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
>
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Johannes Weiner <hannes@cmxpchg.org>
The diff and the test results look good to me. Comparing the resulting
shrink_many() with shrink_node_memcgs(), this also looks like a great
step towards maintainability and unification.
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH -next 2/2] mm/mglru: remove memcg lru
2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-04 18:34 ` Johannes Weiner
2025-12-05 2:57 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim zhongjinji
0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2025-12-04 18:34 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
roman.gushchin, shakeel.butt, muchun.song, yuzhao, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
chenridong
On Thu, Dec 04, 2025 at 12:31:24PM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> Now that the previous patch has switched global reclaim to use
> mem_cgroup_iter, the specialized memcg LRU infrastructure is no longer
> needed. This patch removes all related code:
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
Looks good to me!
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> Documentation/mm/multigen_lru.rst | 30 ------
> include/linux/mmzone.h | 89 -----------------
> mm/memcontrol-v1.c | 6 --
> mm/memcontrol.c | 4 -
> mm/mm_init.c | 1 -
> mm/vmscan.c | 153 +-----------------------------
> 6 files changed, 1 insertion(+), 282 deletions(-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
2025-12-04 18:28 ` Johannes Weiner
@ 2025-12-04 22:29 ` Shakeel Butt
2025-12-08 2:26 ` Chen Ridong
2025-12-08 3:10 ` Chen Ridong
1 sibling, 2 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-12-04 22:29 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, yuzhao, zhengqi.arch, linux-mm,
linux-doc, linux-kernel, cgroups, lujialin4, chenridong
Hi Chen,
On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
>
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
>
> It performed a stress test based on Zhao Yu's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
>
> memcg LRU memcg iter
> stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
> sum(pgsteal) / sum(requested) 216.4% 230.5%
>
> The new implementation demonstrates a significant improvement in
> fairness, reducing the standard deviation relative to the mean by
> 15.5 percentage points. While the reclaim accuracy shows a slight
> increase in overscan (from 85086871 to 90633890, 6.5%).
>
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
>
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
Thanks a lot of this awesome work.
> ---
> mm/vmscan.c | 117 ++++++++++++++++------------------------------------
> 1 file changed, 36 insertions(+), 81 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fddd168a9737..70b0e7e5393c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> return nr_to_scan < 0;
> }
>
> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> {
> - bool success;
> unsigned long scanned = sc->nr_scanned;
> unsigned long reclaimed = sc->nr_reclaimed;
> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>
> - /* lru_gen_age_node() called mem_cgroup_calculate_protection() */
> - if (mem_cgroup_below_min(NULL, memcg))
> - return MEMCG_LRU_YOUNG;
> -
> - if (mem_cgroup_below_low(NULL, memcg)) {
> - /* see the comment on MEMCG_NR_GENS */
> - if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
> - return MEMCG_LRU_TAIL;
> -
> - memcg_memory_event(memcg, MEMCG_LOW);
> - }
> -
> - success = try_to_shrink_lruvec(lruvec, sc);
> + try_to_shrink_lruvec(lruvec, sc);
>
> shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>
> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> sc->nr_reclaimed - reclaimed);
>
> flush_reclaim_state(sc);
Unrealted to your patch but why this flush_reclaim_state() is at
different place from the non-MGLRU code path?
> -
> - if (success && mem_cgroup_online(memcg))
> - return MEMCG_LRU_YOUNG;
> -
> - if (!success && lruvec_is_sizable(lruvec, sc))
> - return 0;
> -
> - /* one retry if offlined or too small */
> - return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
> - MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
> }
>
> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
This function kind of become very similar to shrink_node_memcgs()
function other than shrink_one vs shrink_lruvec. Can you try to combine
them and see if it looks not-ugly? Otherwise the code looks good to me.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-04 18:34 ` Johannes Weiner
@ 2025-12-05 2:57 ` zhongjinji
2025-12-08 2:35 ` Chen Ridong
0 siblings, 1 reply; 10+ messages in thread
From: zhongjinji @ 2025-12-05 2:57 UTC (permalink / raw)
To: hannes
Cc: Liam.Howlett, akpm, axelrasmussen, cgroups, chenridong,
chenridong, corbet, david, linux-doc, linux-kernel, linux-mm,
lorenzo.stoakes, lujialin4, mhocko, muchun.song, roman.gushchin,
rppt, shakeel.butt, surenb, vbabka, weixugc, yuanchu, yuzhao,
zhengqi.arch
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
>
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
>
> It performed a stress test based on Zhao Yu's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
>
> memcg LRU memcg iter
> stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
> sum(pgsteal) / sum(requested) 216.4% 230.5%
Are there more data available? For example, the load of kswapd or the refault values.
I am concerned about these two data points because Yu Zhao's implementation controls
the fairness of aging through memcg gen (get_memcg_gen). This helps reduce excessive
aging for certain cgroups, which is beneficial for kswapd's power consumption.
At the same time, pages that age earlier can be considered colder pages (in the entire system),
so reclaiming them should also help with the refault values.
> The new implementation demonstrates a significant improvement in
> fairness, reducing the standard deviation relative to the mean by
> 15.5 percentage points. While the reclaim accuracy shows a slight
> increase in overscan (from 85086871 to 90633890, 6.5%).
>
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
>
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-04 22:29 ` Shakeel Butt
@ 2025-12-08 2:26 ` Chen Ridong
2025-12-08 3:10 ` Chen Ridong
1 sibling, 0 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-08 2:26 UTC (permalink / raw)
To: Shakeel Butt, Johannes Weiner
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, yuzhao, zhengqi.arch, linux-mm,
linux-doc, linux-kernel, cgroups, lujialin4, chenridong
On 2025/12/5 6:29, Shakeel Butt wrote:
> Hi Chen,
>
> On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Zhao Yu's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> memcg LRU memcg iter
>> stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
>> sum(pgsteal) / sum(requested) 216.4% 230.5%
>>
>> The new implementation demonstrates a significant improvement in
>> fairness, reducing the standard deviation relative to the mean by
>> 15.5 percentage points. While the reclaim accuracy shows a slight
>> increase in overscan (from 85086871 to 90633890, 6.5%).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>
> Thanks a lot of this awesome work.
>
Hello Shakeel and Johannes,
I apologize for the incorrect results I provided earlier. I initially used an AI tool to process the
data (I admit that was lazy of me—please forget that). When I re-ran the test to re-extracted the
refault data and processed it again, I found that the AI tool had given me the wrong output.
I have now processed the data manually in Excel, and the correct results are:
pgsteal:
memcg LRU memcg iter
stddev(pgsteal) / mean(pgsteal) 106.03% 93.20%
sum(pgsteal) / sum(requested) 98.10% 99.28%
workingset_refault_anon:
memcg LRU memcg iter
stddev(refault) / mean(refault) 193.97% 134.67%
sum(refault) 1963229 2027567
I believe these final results are much better than the previous incorrect ones, especially since the
pgsteal ratio is now close to 100%, indicating we are not over-scanning. Additionally, refaults
increased by 64,238 (a 3.2% rise).
Let me know if you have any questions.
----------------------------------------------------------------------
The original data memcg LRU:
pgsteal:
SUM: 38572704 AVERAGE: 301349.25 STDEV: 319518.5965
refault:
SUM: 1963229 AVERAGE: 15337.72656 STDEV: 29750.03391
pgsteal 655392 workingset_refault_anon 17131
pgsteal 657308 workingset_refault_anon 24841
pgsteal 103777 workingset_refault_anon 430
pgsteal 103134 workingset_refault_anon 884
pgsteal 964772 workingset_refault_anon 117159
pgsteal 103462 workingset_refault_anon 539
pgsteal 102878 workingset_refault_anon 25
pgsteal 707851 workingset_refault_anon 30634
pgsteal 103925 workingset_refault_anon 497
pgsteal 103913 workingset_refault_anon 953
pgsteal 103020 workingset_refault_anon 110
pgsteal 102871 workingset_refault_anon 607
pgsteal 697775 workingset_refault_anon 21529
pgsteal 102944 workingset_refault_anon 57
pgsteal 103090 workingset_refault_anon 819
pgsteal 102988 workingset_refault_anon 583
pgsteal 102987 workingset_refault_anon 108
pgsteal 103093 workingset_refault_anon 17
pgsteal 778016 workingset_refault_anon 79000
pgsteal 102920 workingset_refault_anon 14
pgsteal 655447 workingset_refault_anon 9069
pgsteal 102869 workingset_refault_anon 6
pgsteal 699920 workingset_refault_anon 34409
pgsteal 103127 workingset_refault_anon 223
pgsteal 102876 workingset_refault_anon 646
pgsteal 103642 workingset_refault_anon 439
pgsteal 102881 workingset_refault_anon 110
pgsteal 863202 workingset_refault_anon 77605
pgsteal 651786 workingset_refault_anon 8322
pgsteal 102981 workingset_refault_anon 51
pgsteal 103380 workingset_refault_anon 877
pgsteal 706377 workingset_refault_anon 27729
pgsteal 103436 workingset_refault_anon 682
pgsteal 103839 workingset_refault_anon 336
pgsteal 103012 workingset_refault_anon 23
pgsteal 103476 workingset_refault_anon 729
pgsteal 102867 workingset_refault_anon 12
pgsteal 102914 workingset_refault_anon 122
pgsteal 102886 workingset_refault_anon 627
pgsteal 103736 workingset_refault_anon 514
pgsteal 102879 workingset_refault_anon 618
pgsteal 102860 workingset_refault_anon 3
pgsteal 102877 workingset_refault_anon 27
pgsteal 103255 workingset_refault_anon 384
pgsteal 982183 workingset_refault_anon 85362
pgsteal 102947 workingset_refault_anon 158
pgsteal 102880 workingset_refault_anon 651
pgsteal 973764 workingset_refault_anon 81542
pgsteal 923711 workingset_refault_anon 94596
pgsteal 102938 workingset_refault_anon 660
pgsteal 888882 workingset_refault_anon 69549
pgsteal 102868 workingset_refault_anon 14
pgsteal 103130 workingset_refault_anon 166
pgsteal 103388 workingset_refault_anon 467
pgsteal 102965 workingset_refault_anon 197
pgsteal 964699 workingset_refault_anon 74903
pgsteal 103263 workingset_refault_anon 373
pgsteal 103614 workingset_refault_anon 781
pgsteal 962228 workingset_refault_anon 72108
pgsteal 672174 workingset_refault_anon 19739
pgsteal 102920 workingset_refault_anon 19
pgsteal 670248 workingset_refault_anon 18411
pgsteal 102877 workingset_refault_anon 581
pgsteal 103758 workingset_refault_anon 871
pgsteal 102874 workingset_refault_anon 609
pgsteal 103075 workingset_refault_anon 274
pgsteal 103550 workingset_refault_anon 102
pgsteal 755180 workingset_refault_anon 44303
pgsteal 951252 workingset_refault_anon 84566
pgsteal 929144 workingset_refault_anon 99081
pgsteal 103207 workingset_refault_anon 30
pgsteal 103292 workingset_refault_anon 427
pgsteal 103271 workingset_refault_anon 332
pgsteal 102865 workingset_refault_anon 4
pgsteal 923280 workingset_refault_anon 72715
pgsteal 104682 workingset_refault_anon 372
pgsteal 102870 workingset_refault_anon 7
pgsteal 102902 workingset_refault_anon 661
pgsteal 103053 workingset_refault_anon 40
pgsteal 103685 workingset_refault_anon 540
pgsteal 103857 workingset_refault_anon 970
pgsteal 109210 workingset_refault_anon 2806
pgsteal 103627 workingset_refault_anon 319
pgsteal 104029 workingset_refault_anon 42
pgsteal 918361 workingset_refault_anon 90387
pgsteal 103489 workingset_refault_anon 626
pgsteal 103188 workingset_refault_anon 801
pgsteal 102875 workingset_refault_anon 11
pgsteal 102994 workingset_refault_anon 79
pgsteal 102910 workingset_refault_anon 43
pgsteal 102922 workingset_refault_anon 687
pgsteal 103941 workingset_refault_anon 1219
pgsteal 903622 workingset_refault_anon 113751
pgsteal 664357 workingset_refault_anon 27959
pgsteal 104947 workingset_refault_anon 11
pgsteal 701084 workingset_refault_anon 30665
pgsteal 650719 workingset_refault_anon 20810
pgsteal 641924 workingset_refault_anon 17137
pgsteal 933870 workingset_refault_anon 98393
pgsteal 633231 workingset_refault_anon 15924
pgsteal 102936 workingset_refault_anon 34
pgsteal 104020 workingset_refault_anon 781
pgsteal 104274 workingset_refault_anon 1841
pgsteal 621672 workingset_refault_anon 5891
pgsteal 103307 workingset_refault_anon 474
pgsteal 103386 workingset_refault_anon 27
pgsteal 103266 workingset_refault_anon 243
pgsteal 102896 workingset_refault_anon 15
pgsteal 103905 workingset_refault_anon 988
pgsteal 103104 workingset_refault_anon 304
pgsteal 104277 workingset_refault_anon 285
pgsteal 696374 workingset_refault_anon 24971
pgsteal 103009 workingset_refault_anon 775
pgsteal 103849 workingset_refault_anon 747
pgsteal 102867 workingset_refault_anon 9
pgsteal 700211 workingset_refault_anon 35289
pgsteal 102923 workingset_refault_anon 88
pgsteal 104139 workingset_refault_anon 789
pgsteal 105152 workingset_refault_anon 1257
pgsteal 102945 workingset_refault_anon 76
pgsteal 103227 workingset_refault_anon 343
pgsteal 102880 workingset_refault_anon 95
pgsteal 102967 workingset_refault_anon 101
pgsteal 989176 workingset_refault_anon 89597
pgsteal 694181 workingset_refault_anon 22499
pgsteal 784354 workingset_refault_anon 68311
pgsteal 102882 workingset_refault_anon 24
pgsteal 103108 workingset_refault_anon 24
-------------------------------------------------------------------
The original data memcg iter:
pgsteal:
SUM: 39036863 AVERAGE: 304975.4922 STDEV: 284226.526
refault:
SUM: 2027567 AVERAGE: 15840.36719 STDEV: 21332.00262
pgsteal 103167 workingset_refault_anon 203
pgsteal 714044 workingset_refault_anon 42633
pgsteal 103209 workingset_refault_anon 581
pgsteal 103605 workingset_refault_anon 240
pgsteal 740909 workingset_refault_anon 53177
pgsteal 103089 workingset_refault_anon 141
pgsteal 726760 workingset_refault_anon 32624
pgsteal 104039 workingset_refault_anon 397
pgsteal 754667 workingset_refault_anon 56144
pgsteal 713916 workingset_refault_anon 41813
pgsteal 104104 workingset_refault_anon 307
pgsteal 109567 workingset_refault_anon 244
pgsteal 714194 workingset_refault_anon 47076
pgsteal 711693 workingset_refault_anon 35616
pgsteal 105026 workingset_refault_anon 2221
pgsteal 103442 workingset_refault_anon 269
pgsteal 112773 workingset_refault_anon 5086
pgsteal 715969 workingset_refault_anon 32457
pgsteal 127828 workingset_refault_anon 9579
pgsteal 102885 workingset_refault_anon 109
pgsteal 112156 workingset_refault_anon 2974
pgsteal 104242 workingset_refault_anon 948
pgsteal 701184 workingset_refault_anon 47940
pgsteal 104080 workingset_refault_anon 836
pgsteal 106606 workingset_refault_anon 2420
pgsteal 103666 workingset_refault_anon 129
pgsteal 103330 workingset_refault_anon 532
pgsteal 103639 workingset_refault_anon 275
pgsteal 108494 workingset_refault_anon 3814
pgsteal 103626 workingset_refault_anon 412
pgsteal 103697 workingset_refault_anon 577
pgsteal 103736 workingset_refault_anon 582
pgsteal 103360 workingset_refault_anon 281
pgsteal 116733 workingset_refault_anon 6674
pgsteal 102978 workingset_refault_anon 5
pgsteal 108945 workingset_refault_anon 3141
pgsteal 706630 workingset_refault_anon 33241
pgsteal 103426 workingset_refault_anon 134
pgsteal 715070 workingset_refault_anon 33575
pgsteal 102871 workingset_refault_anon 12
pgsteal 103617 workingset_refault_anon 776
pgsteal 767084 workingset_refault_anon 64710
pgsteal 104197 workingset_refault_anon 176
pgsteal 104488 workingset_refault_anon 1469
pgsteal 103253 workingset_refault_anon 228
pgsteal 702800 workingset_refault_anon 26424
pgsteal 107469 workingset_refault_anon 2838
pgsteal 104441 workingset_refault_anon 1562
pgsteal 123013 workingset_refault_anon 13117
pgsteal 737817 workingset_refault_anon 53330
pgsteal 103939 workingset_refault_anon 759
pgsteal 103568 workingset_refault_anon 783
pgsteal 122707 workingset_refault_anon 11944
pgsteal 103690 workingset_refault_anon 885
pgsteal 103456 workingset_refault_anon 145
pgsteal 104068 workingset_refault_anon 632
pgsteal 319368 workingset_refault_anon 12579
pgsteal 103912 workingset_refault_anon 304
pgsteal 119416 workingset_refault_anon 3350
pgsteal 717107 workingset_refault_anon 34764
pgsteal 107163 workingset_refault_anon 535
pgsteal 103299 workingset_refault_anon 142
pgsteal 103825 workingset_refault_anon 176
pgsteal 408564 workingset_refault_anon 14606
pgsteal 115785 workingset_refault_anon 4622
pgsteal 119234 workingset_refault_anon 9225
pgsteal 729060 workingset_refault_anon 54309
pgsteal 107149 workingset_refault_anon 536
pgsteal 708839 workingset_refault_anon 43133
pgsteal 695961 workingset_refault_anon 40182
pgsteal 723303 workingset_refault_anon 32298
pgsteal 103581 workingset_refault_anon 1305
pgsteal 699646 workingset_refault_anon 49924
pgsteal 717867 workingset_refault_anon 39229
pgsteal 104148 workingset_refault_anon 1318
pgsteal 104127 workingset_refault_anon 568
pgsteal 103168 workingset_refault_anon 322
pgsteal 103477 workingset_refault_anon 538
pgsteal 103022 workingset_refault_anon 60
pgsteal 103305 workingset_refault_anon 323
pgsteal 103812 workingset_refault_anon 1324
pgsteal 103139 workingset_refault_anon 126
pgsteal 723251 workingset_refault_anon 34206
pgsteal 103068 workingset_refault_anon 861
pgsteal 742515 workingset_refault_anon 54439
pgsteal 762161 workingset_refault_anon 52654
pgsteal 103934 workingset_refault_anon 889
pgsteal 104065 workingset_refault_anon 315
pgsteal 383893 workingset_refault_anon 25036
pgsteal 107929 workingset_refault_anon 2367
pgsteal 726127 workingset_refault_anon 45809
pgsteal 675291 workingset_refault_anon 66534
pgsteal 105585 workingset_refault_anon 2323
pgsteal 105098 workingset_refault_anon 1625
pgsteal 104264 workingset_refault_anon 718
pgsteal 741873 workingset_refault_anon 47045
pgsteal 103466 workingset_refault_anon 70
pgsteal 723870 workingset_refault_anon 58780
pgsteal 104740 workingset_refault_anon 521
pgsteal 740739 workingset_refault_anon 45099
pgsteal 752994 workingset_refault_anon 53713
pgsteal 110164 workingset_refault_anon 2572
pgsteal 711304 workingset_refault_anon 41135
pgsteal 746870 workingset_refault_anon 60298
pgsteal 729166 workingset_refault_anon 42594
pgsteal 110138 workingset_refault_anon 1511
pgsteal 103836 workingset_refault_anon 675
pgsteal 116821 workingset_refault_anon 3952
pgsteal 104967 workingset_refault_anon 2035
pgsteal 711362 workingset_refault_anon 31458
pgsteal 103835 workingset_refault_anon 507
pgsteal 113846 workingset_refault_anon 2997
pgsteal 104406 workingset_refault_anon 1724
pgsteal 103551 workingset_refault_anon 1293
pgsteal 705340 workingset_refault_anon 44234
pgsteal 728076 workingset_refault_anon 29849
pgsteal 103829 workingset_refault_anon 254
pgsteal 103700 workingset_refault_anon 712
pgsteal 103382 workingset_refault_anon 506
pgsteal 728881 workingset_refault_anon 60152
pgsteal 614645 workingset_refault_anon 43956
pgsteal 107672 workingset_refault_anon 2768
pgsteal 123550 workingset_refault_anon 11937
pgsteal 103747 workingset_refault_anon 899
pgsteal 747657 workingset_refault_anon 50264
pgsteal 110949 workingset_refault_anon 1422
pgsteal 103596 workingset_refault_anon 278
pgsteal 742471 workingset_refault_anon 69586
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-05 2:57 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim zhongjinji
@ 2025-12-08 2:35 ` Chen Ridong
0 siblings, 0 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-08 2:35 UTC (permalink / raw)
To: zhongjinji, hannes
Cc: Liam.Howlett, akpm, axelrasmussen, cgroups, chenridong, corbet,
david, linux-doc, linux-kernel, linux-mm, lorenzo.stoakes,
lujialin4, mhocko, muchun.song, roman.gushchin, rppt,
shakeel.butt, surenb, vbabka, weixugc, yuanchu, yuzhao,
zhengqi.arch
On 2025/12/5 10:57, zhongjinji wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Zhao Yu's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> memcg LRU memcg iter
>> stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
>> sum(pgsteal) / sum(requested) 216.4% 230.5%
>
> Are there more data available? For example, the load of kswapd or the refault values.
>
> I am concerned about these two data points because Yu Zhao's implementation controls
> the fairness of aging through memcg gen (get_memcg_gen). This helps reduce excessive
> aging for certain cgroups, which is beneficial for kswapd's power consumption.
>
> At the same time, pages that age earlier can be considered colder pages (in the entire system),
> so reclaiming them should also help with the refault values.
>
I re-ran the test and observed a 3.2% increase in refaults. Is this enough for what you were
concerned about?
The complete data set is offered in my earlier email:
https://lore.kernel.org/all/e657d5ac-6f92-4dbb-bf32-76084988d024@huaweicloud.com/
>> The new implementation demonstrates a significant improvement in
>> fairness, reducing the standard deviation relative to the mean by
>> 15.5 percentage points. While the reclaim accuracy shows a slight
>> increase in overscan (from 85086871 to 90633890, 6.5%).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-04 22:29 ` Shakeel Butt
2025-12-08 2:26 ` Chen Ridong
@ 2025-12-08 3:10 ` Chen Ridong
1 sibling, 0 replies; 10+ messages in thread
From: Chen Ridong @ 2025-12-08 3:10 UTC (permalink / raw)
To: Shakeel Butt
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, yuzhao, zhengqi.arch, linux-mm,
linux-doc, linux-kernel, cgroups, lujialin4, chenridong
On 2025/12/5 6:29, Shakeel Butt wrote:
> Hi Chen,
>
> On Thu, Dec 04, 2025 at 12:31:23PM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Zhao Yu's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> memcg LRU memcg iter
>> stddev(pgsteal) / mean(pgsteal) 91.2% 75.7%
>> sum(pgsteal) / sum(requested) 216.4% 230.5%
>>
>> The new implementation demonstrates a significant improvement in
>> fairness, reducing the standard deviation relative to the mean by
>> 15.5 percentage points. While the reclaim accuracy shows a slight
>> increase in overscan (from 85086871 to 90633890, 6.5%).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>
> Thanks a lot of this awesome work.
>
>> ---
>> mm/vmscan.c | 117 ++++++++++++++++------------------------------------
>> 1 file changed, 36 insertions(+), 81 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index fddd168a9737..70b0e7e5393c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>> return nr_to_scan < 0;
>> }
>>
>> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> {
>> - bool success;
>> unsigned long scanned = sc->nr_scanned;
>> unsigned long reclaimed = sc->nr_reclaimed;
>> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>
>> - /* lru_gen_age_node() called mem_cgroup_calculate_protection() */
>> - if (mem_cgroup_below_min(NULL, memcg))
>> - return MEMCG_LRU_YOUNG;
>> -
>> - if (mem_cgroup_below_low(NULL, memcg)) {
>> - /* see the comment on MEMCG_NR_GENS */
>> - if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
>> - return MEMCG_LRU_TAIL;
>> -
>> - memcg_memory_event(memcg, MEMCG_LOW);
>> - }
>> -
>> - success = try_to_shrink_lruvec(lruvec, sc);
>> + try_to_shrink_lruvec(lruvec, sc);
>>
>> shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>>
>> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> sc->nr_reclaimed - reclaimed);
>>
>> flush_reclaim_state(sc);
>
> Unrealted to your patch but why this flush_reclaim_state() is at
> different place from the non-MGLRU code path?
>
Thank you Shakeel for you reply.
IIUC, I think adding flush_reclaim_state here makes sense. Currently, shrink_one is only used for
root-level reclaim in gen-LRU, and flush_reclaim_state is only relevant during root reclaim.
Flushing after each lruvec is shrunk could help the reclaim loop terminate earlier, as
sc->nr_reclaimed += current->reclaim_state->reclaimed; may reach nr_to_reclaim sooner.
That said, I'm also wondering whether we should apply flush_reclaim_state for every iteration in
non-MGLLU reclaim as well. For non-root reclaim, it should be negligible since it effectively does
nothing. But for root-level reclaim under non-MGLRU, it might similarly help stop the iteration earlier.
>> -
>> - if (success && mem_cgroup_online(memcg))
>> - return MEMCG_LRU_YOUNG;
>> -
>> - if (!success && lruvec_is_sizable(lruvec, sc))
>> - return 0;
>> -
>> - /* one retry if offlined or too small */
>> - return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
>> - MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
>> }
>>
>> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
>
> This function kind of become very similar to shrink_node_memcgs()
> function other than shrink_one vs shrink_lruvec. Can you try to combine
> them and see if it looks not-ugly? Otherwise the code looks good to me.
>
Will try to.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-12-08 3:10 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-04 12:31 [RFC PATCH -next 0/2] mm/mglru: remove memcg lru Chen Ridong
2025-12-04 12:31 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
2025-12-04 18:28 ` Johannes Weiner
2025-12-04 22:29 ` Shakeel Butt
2025-12-08 2:26 ` Chen Ridong
2025-12-08 3:10 ` Chen Ridong
2025-12-04 12:31 ` [RFC PATCH -next 2/2] mm/mglru: remove memcg lru Chen Ridong
2025-12-04 18:34 ` Johannes Weiner
2025-12-05 2:57 ` [RFC PATCH -next 1/2] mm/mglru: use mem_cgroup_iter for global reclaim zhongjinji
2025-12-08 2:35 ` Chen Ridong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox