* [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-09 1:25 [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-09 1:25 ` Chen Ridong
2025-12-22 3:12 ` Shakeel Butt
2025-12-09 1:25 ` [PATCH -next 2/5] mm/mglru: remove memcg lru Chen Ridong
` (5 subsequent siblings)
6 siblings, 1 reply; 25+ messages in thread
From: Chen Ridong @ 2025-12-09 1:25 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
chenridong, zhongjinji
From: Chen Ridong <chenridong@huawei.com>
The memcg LRU was originally introduced for global reclaim to enhance
scalability. However, its implementation complexity has led to performance
regressions when dealing with a large number of memory cgroups [1].
As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
cookie-based iteration for global reclaim, aligning with the approach
already used in shrink_node_memcgs. This simplification removes the
dedicated memcg LRU tracking while maintaining the core functionality.
It performed a stress test based on Yu Zhao's methodology [2] on a
1 TB, 4-node NUMA system. The results are summarized below:
pgsteal:
memcg LRU memcg iter
stddev(pgsteal) / mean(pgsteal) 106.03% 93.20%
sum(pgsteal) / sum(requested) 98.10% 99.28%
workingset_refault_anon:
memcg LRU memcg iter
stddev(refault) / mean(refault) 193.97% 134.67%
sum(refault) 1963229 2027567
The new implementation shows a clear fairness improvement, reducing the
standard deviation relative to the mean by 12.8 percentage points. The
pgsteal ratio is also closer to 100%. Refault counts increased by 3.2%
(from 1,963,229 to 2,027,567).
The primary benefits of this change are:
1. Simplified codebase by removing custom memcg LRU infrastructure
2. Improved fairness in memory reclaim across multiple cgroups
3. Better performance when creating many memory cgroups
[1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
[2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Suggested-by: Johannes Weiner <hannes@cmxpchg.org>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Johannes Weiner <hannes@cmxpchg.org>
---
mm/vmscan.c | 117 ++++++++++++++++------------------------------------
1 file changed, 36 insertions(+), 81 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fddd168a9737..70b0e7e5393c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
return nr_to_scan < 0;
}
-static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
+static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
{
- bool success;
unsigned long scanned = sc->nr_scanned;
unsigned long reclaimed = sc->nr_reclaimed;
- struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
- /* lru_gen_age_node() called mem_cgroup_calculate_protection() */
- if (mem_cgroup_below_min(NULL, memcg))
- return MEMCG_LRU_YOUNG;
-
- if (mem_cgroup_below_low(NULL, memcg)) {
- /* see the comment on MEMCG_NR_GENS */
- if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
- return MEMCG_LRU_TAIL;
-
- memcg_memory_event(memcg, MEMCG_LOW);
- }
-
- success = try_to_shrink_lruvec(lruvec, sc);
+ try_to_shrink_lruvec(lruvec, sc);
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
@@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
sc->nr_reclaimed - reclaimed);
flush_reclaim_state(sc);
-
- if (success && mem_cgroup_online(memcg))
- return MEMCG_LRU_YOUNG;
-
- if (!success && lruvec_is_sizable(lruvec, sc))
- return 0;
-
- /* one retry if offlined or too small */
- return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
- MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
}
static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
{
- int op;
- int gen;
- int bin;
- int first_bin;
- struct lruvec *lruvec;
- struct lru_gen_folio *lrugen;
+ struct mem_cgroup *target = sc->target_mem_cgroup;
+ struct mem_cgroup_reclaim_cookie reclaim = {
+ .pgdat = pgdat,
+ };
+ struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
struct mem_cgroup *memcg;
- struct hlist_nulls_node *pos;
- gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
- bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
-restart:
- op = 0;
- memcg = NULL;
-
- rcu_read_lock();
+ if (current_is_kswapd() || sc->memcg_full_walk)
+ cookie = NULL;
- hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
- if (op) {
- lru_gen_rotate_memcg(lruvec, op);
- op = 0;
- }
+ memcg = mem_cgroup_iter(target, NULL, cookie);
+ while (memcg) {
+ struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- mem_cgroup_put(memcg);
- memcg = NULL;
+ cond_resched();
- if (gen != READ_ONCE(lrugen->gen))
- continue;
+ mem_cgroup_calculate_protection(target, memcg);
- lruvec = container_of(lrugen, struct lruvec, lrugen);
- memcg = lruvec_memcg(lruvec);
+ if (mem_cgroup_below_min(target, memcg))
+ goto next;
- if (!mem_cgroup_tryget(memcg)) {
- lru_gen_release_memcg(memcg);
- memcg = NULL;
- continue;
+ if (mem_cgroup_below_low(target, memcg)) {
+ if (!sc->memcg_low_reclaim) {
+ sc->memcg_low_skipped = 1;
+ goto next;
+ }
+ memcg_memory_event(memcg, MEMCG_LOW);
}
- rcu_read_unlock();
+ shrink_one(lruvec, sc);
- op = shrink_one(lruvec, sc);
-
- rcu_read_lock();
-
- if (should_abort_scan(lruvec, sc))
+ if (should_abort_scan(lruvec, sc)) {
+ if (cookie)
+ mem_cgroup_iter_break(target, memcg);
break;
- }
-
- rcu_read_unlock();
-
- if (op)
- lru_gen_rotate_memcg(lruvec, op);
-
- mem_cgroup_put(memcg);
-
- if (!is_a_nulls(pos))
- return;
+ }
- /* restart if raced with lru_gen_rotate_memcg() */
- if (gen != get_nulls_value(pos))
- goto restart;
+next:
+ if (cookie && sc->nr_reclaimed >= sc->nr_to_reclaim) {
+ mem_cgroup_iter_break(target, memcg);
+ break;
+ }
- /* try the rest of the bins of the current generation */
- bin = get_memcg_bin(bin + 1);
- if (bin != first_bin)
- goto restart;
+ memcg = mem_cgroup_iter(target, memcg, cookie);
+ }
}
static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -5019,8 +4975,7 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
set_mm_walk(NULL, sc->proactive);
- if (try_to_shrink_lruvec(lruvec, sc))
- lru_gen_rotate_memcg(lruvec, MEMCG_LRU_YOUNG);
+ try_to_shrink_lruvec(lruvec, sc);
clear_mm_walk();
--
2.34.1
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-09 1:25 ` [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
@ 2025-12-22 3:12 ` Shakeel Butt
2025-12-22 7:27 ` Chen Ridong
0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-12-22 3:12 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, zhengqi.arch, linux-mm, linux-doc,
linux-kernel, cgroups, lujialin4, zhongjinji
On Tue, Dec 09, 2025 at 01:25:53AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was originally introduced for global reclaim to enhance
> scalability. However, its implementation complexity has led to performance
> regressions when dealing with a large number of memory cgroups [1].
>
> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
> cookie-based iteration for global reclaim, aligning with the approach
> already used in shrink_node_memcgs. This simplification removes the
> dedicated memcg LRU tracking while maintaining the core functionality.
>
> It performed a stress test based on Yu Zhao's methodology [2] on a
> 1 TB, 4-node NUMA system. The results are summarized below:
>
> pgsteal:
> memcg LRU memcg iter
> stddev(pgsteal) / mean(pgsteal) 106.03% 93.20%
> sum(pgsteal) / sum(requested) 98.10% 99.28%
>
> workingset_refault_anon:
> memcg LRU memcg iter
> stddev(refault) / mean(refault) 193.97% 134.67%
> sum(refault) 1963229 2027567
>
> The new implementation shows a clear fairness improvement, reducing the
> standard deviation relative to the mean by 12.8 percentage points. The
> pgsteal ratio is also closer to 100%. Refault counts increased by 3.2%
> (from 1,963,229 to 2,027,567).
>
> The primary benefits of this change are:
> 1. Simplified codebase by removing custom memcg LRU infrastructure
> 2. Improved fairness in memory reclaim across multiple cgroups
> 3. Better performance when creating many memory cgroups
>
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
> Suggested-by: Johannes Weiner <hannes@cmxpchg.org>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> Acked-by: Johannes Weiner <hannes@cmxpchg.org>
> ---
> mm/vmscan.c | 117 ++++++++++++++++------------------------------------
> 1 file changed, 36 insertions(+), 81 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fddd168a9737..70b0e7e5393c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> return nr_to_scan < 0;
> }
>
> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> {
> - bool success;
> unsigned long scanned = sc->nr_scanned;
> unsigned long reclaimed = sc->nr_reclaimed;
> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>
> - /* lru_gen_age_node() called mem_cgroup_calculate_protection() */
> - if (mem_cgroup_below_min(NULL, memcg))
> - return MEMCG_LRU_YOUNG;
> -
> - if (mem_cgroup_below_low(NULL, memcg)) {
> - /* see the comment on MEMCG_NR_GENS */
> - if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
> - return MEMCG_LRU_TAIL;
> -
> - memcg_memory_event(memcg, MEMCG_LOW);
> - }
> -
> - success = try_to_shrink_lruvec(lruvec, sc);
> + try_to_shrink_lruvec(lruvec, sc);
>
> shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>
> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> sc->nr_reclaimed - reclaimed);
>
> flush_reclaim_state(sc);
> -
> - if (success && mem_cgroup_online(memcg))
> - return MEMCG_LRU_YOUNG;
> -
> - if (!success && lruvec_is_sizable(lruvec, sc))
> - return 0;
> -
> - /* one retry if offlined or too small */
> - return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
> - MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
> }
>
> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
> {
> - int op;
> - int gen;
> - int bin;
> - int first_bin;
> - struct lruvec *lruvec;
> - struct lru_gen_folio *lrugen;
> + struct mem_cgroup *target = sc->target_mem_cgroup;
> + struct mem_cgroup_reclaim_cookie reclaim = {
> + .pgdat = pgdat,
> + };
> + struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
Please keep the naming same as shrink_node_memcgs i.e. use 'partial'
here.
> struct mem_cgroup *memcg;
> - struct hlist_nulls_node *pos;
>
> - gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
> - bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
> -restart:
> - op = 0;
> - memcg = NULL;
> -
> - rcu_read_lock();
> + if (current_is_kswapd() || sc->memcg_full_walk)
> + cookie = NULL;
>
> - hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
> - if (op) {
> - lru_gen_rotate_memcg(lruvec, op);
> - op = 0;
> - }
> + memcg = mem_cgroup_iter(target, NULL, cookie);
> + while (memcg) {
Please use the do-while loop same as shrink_node_memcgs and then change
the goto next below to continue similar to shrink_node_memcgs.
> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> - mem_cgroup_put(memcg);
> - memcg = NULL;
> + cond_resched();
>
> - if (gen != READ_ONCE(lrugen->gen))
> - continue;
> + mem_cgroup_calculate_protection(target, memcg);
>
> - lruvec = container_of(lrugen, struct lruvec, lrugen);
> - memcg = lruvec_memcg(lruvec);
> + if (mem_cgroup_below_min(target, memcg))
> + goto next;
>
> - if (!mem_cgroup_tryget(memcg)) {
> - lru_gen_release_memcg(memcg);
> - memcg = NULL;
> - continue;
> + if (mem_cgroup_below_low(target, memcg)) {
> + if (!sc->memcg_low_reclaim) {
> + sc->memcg_low_skipped = 1;
> + goto next;
> + }
> + memcg_memory_event(memcg, MEMCG_LOW);
> }
>
> - rcu_read_unlock();
> + shrink_one(lruvec, sc);
>
> - op = shrink_one(lruvec, sc);
> -
> - rcu_read_lock();
> -
> - if (should_abort_scan(lruvec, sc))
> + if (should_abort_scan(lruvec, sc)) {
> + if (cookie)
> + mem_cgroup_iter_break(target, memcg);
> break;
This seems buggy as we may break the loop without calling
mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
if should_abort_scan() returns true, we will break the loop without
calling mem_cgroup_iter_break() and will leak a reference to memcg.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-22 3:12 ` Shakeel Butt
@ 2025-12-22 7:27 ` Chen Ridong
2025-12-22 21:18 ` Shakeel Butt
0 siblings, 1 reply; 25+ messages in thread
From: Chen Ridong @ 2025-12-22 7:27 UTC (permalink / raw)
To: Shakeel Butt
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, zhengqi.arch, linux-mm, linux-doc,
linux-kernel, cgroups, lujialin4, zhongjinji
On 2025/12/22 11:12, Shakeel Butt wrote:
> On Tue, Dec 09, 2025 at 01:25:53AM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was originally introduced for global reclaim to enhance
>> scalability. However, its implementation complexity has led to performance
>> regressions when dealing with a large number of memory cgroups [1].
>>
>> As suggested by Johannes [1], this patch adopts mem_cgroup_iter with
>> cookie-based iteration for global reclaim, aligning with the approach
>> already used in shrink_node_memcgs. This simplification removes the
>> dedicated memcg LRU tracking while maintaining the core functionality.
>>
>> It performed a stress test based on Yu Zhao's methodology [2] on a
>> 1 TB, 4-node NUMA system. The results are summarized below:
>>
>> pgsteal:
>> memcg LRU memcg iter
>> stddev(pgsteal) / mean(pgsteal) 106.03% 93.20%
>> sum(pgsteal) / sum(requested) 98.10% 99.28%
>>
>> workingset_refault_anon:
>> memcg LRU memcg iter
>> stddev(refault) / mean(refault) 193.97% 134.67%
>> sum(refault) 1963229 2027567
>>
>> The new implementation shows a clear fairness improvement, reducing the
>> standard deviation relative to the mean by 12.8 percentage points. The
>> pgsteal ratio is also closer to 100%. Refault counts increased by 3.2%
>> (from 1,963,229 to 2,027,567).
>>
>> The primary benefits of this change are:
>> 1. Simplified codebase by removing custom memcg LRU infrastructure
>> 2. Improved fairness in memory reclaim across multiple cgroups
>> 3. Better performance when creating many memory cgroups
>>
>> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
>> [2] https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
>> Suggested-by: Johannes Weiner <hannes@cmxpchg.org>
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> Acked-by: Johannes Weiner <hannes@cmxpchg.org>
>> ---
>> mm/vmscan.c | 117 ++++++++++++++++------------------------------------
>> 1 file changed, 36 insertions(+), 81 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index fddd168a9737..70b0e7e5393c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4895,27 +4895,14 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>> return nr_to_scan < 0;
>> }
>>
>> -static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> {
>> - bool success;
>> unsigned long scanned = sc->nr_scanned;
>> unsigned long reclaimed = sc->nr_reclaimed;
>> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>
>> - /* lru_gen_age_node() called mem_cgroup_calculate_protection() */
>> - if (mem_cgroup_below_min(NULL, memcg))
>> - return MEMCG_LRU_YOUNG;
>> -
>> - if (mem_cgroup_below_low(NULL, memcg)) {
>> - /* see the comment on MEMCG_NR_GENS */
>> - if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL)
>> - return MEMCG_LRU_TAIL;
>> -
>> - memcg_memory_event(memcg, MEMCG_LOW);
>> - }
>> -
>> - success = try_to_shrink_lruvec(lruvec, sc);
>> + try_to_shrink_lruvec(lruvec, sc);
>>
>> shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>>
>> @@ -4924,86 +4911,55 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> sc->nr_reclaimed - reclaimed);
>>
>> flush_reclaim_state(sc);
>> -
>> - if (success && mem_cgroup_online(memcg))
>> - return MEMCG_LRU_YOUNG;
>> -
>> - if (!success && lruvec_is_sizable(lruvec, sc))
>> - return 0;
>> -
>> - /* one retry if offlined or too small */
>> - return READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_TAIL ?
>> - MEMCG_LRU_TAIL : MEMCG_LRU_YOUNG;
>> }
>>
>> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
>> {
>> - int op;
>> - int gen;
>> - int bin;
>> - int first_bin;
>> - struct lruvec *lruvec;
>> - struct lru_gen_folio *lrugen;
>> + struct mem_cgroup *target = sc->target_mem_cgroup;
>> + struct mem_cgroup_reclaim_cookie reclaim = {
>> + .pgdat = pgdat,
>> + };
>> + struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
>
> Please keep the naming same as shrink_node_memcgs i.e. use 'partial'
> here.
>
Thank you, will update.
>> struct mem_cgroup *memcg;
>> - struct hlist_nulls_node *pos;
>>
>> - gen = get_memcg_gen(READ_ONCE(pgdat->memcg_lru.seq));
>> - bin = first_bin = get_random_u32_below(MEMCG_NR_BINS);
>> -restart:
>> - op = 0;
>> - memcg = NULL;
>> -
>> - rcu_read_lock();
>> + if (current_is_kswapd() || sc->memcg_full_walk)
>> + cookie = NULL;
>>
>> - hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) {
>> - if (op) {
>> - lru_gen_rotate_memcg(lruvec, op);
>> - op = 0;
>> - }
>> + memcg = mem_cgroup_iter(target, NULL, cookie);
>> + while (memcg) {
>
> Please use the do-while loop same as shrink_node_memcgs and then change
> the goto next below to continue similar to shrink_node_memcgs.
>
Will update.
>> + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> - mem_cgroup_put(memcg);
>> - memcg = NULL;
>> + cond_resched();
>>
>> - if (gen != READ_ONCE(lrugen->gen))
>> - continue;
>> + mem_cgroup_calculate_protection(target, memcg);
>>
>> - lruvec = container_of(lrugen, struct lruvec, lrugen);
>> - memcg = lruvec_memcg(lruvec);
>> + if (mem_cgroup_below_min(target, memcg))
>> + goto next;
>>
>> - if (!mem_cgroup_tryget(memcg)) {
>> - lru_gen_release_memcg(memcg);
>> - memcg = NULL;
>> - continue;
>> + if (mem_cgroup_below_low(target, memcg)) {
>> + if (!sc->memcg_low_reclaim) {
>> + sc->memcg_low_skipped = 1;
>> + goto next;
>> + }
>> + memcg_memory_event(memcg, MEMCG_LOW);
>> }
>>
>> - rcu_read_unlock();
>> + shrink_one(lruvec, sc);
>>
>> - op = shrink_one(lruvec, sc);
>> -
>> - rcu_read_lock();
>> -
>> - if (should_abort_scan(lruvec, sc))
>> + if (should_abort_scan(lruvec, sc)) {
>> + if (cookie)
>> + mem_cgroup_iter_break(target, memcg);
>> break;
>
> This seems buggy as we may break the loop without calling
> mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
> if should_abort_scan() returns true, we will break the loop without
> calling mem_cgroup_iter_break() and will leak a reference to memcg.
>
Thank you for catching that—my mistake.
This also brings up another point: In kswapd, the traditional LRU iterates through all memcgs, but
stops for the generational LRU (GENLRU) when should_abort_scan is met (i.e., enough pages are
reclaimed or the watermark is satisfied). Shouldn't both behave consistently?
Perhaps we should add should_abort_scan(lruvec, sc) in shrink_node_memcgs for the traditional LRU as
well?
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-22 7:27 ` Chen Ridong
@ 2025-12-22 21:18 ` Shakeel Butt
2025-12-23 0:45 ` Chen Ridong
0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-12-22 21:18 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, zhengqi.arch, linux-mm, linux-doc,
linux-kernel, cgroups, lujialin4, zhongjinji
On Mon, Dec 22, 2025 at 03:27:26PM +0800, Chen Ridong wrote:
>
[...]
>
> >> - if (should_abort_scan(lruvec, sc))
> >> + if (should_abort_scan(lruvec, sc)) {
> >> + if (cookie)
> >> + mem_cgroup_iter_break(target, memcg);
> >> break;
> >
> > This seems buggy as we may break the loop without calling
> > mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
> > if should_abort_scan() returns true, we will break the loop without
> > calling mem_cgroup_iter_break() and will leak a reference to memcg.
> >
>
> Thank you for catching that—my mistake.
>
> This also brings up another point: In kswapd, the traditional LRU iterates through all memcgs, but
> stops for the generational LRU (GENLRU) when should_abort_scan is met (i.e., enough pages are
> reclaimed or the watermark is satisfied). Shouldn't both behave consistently?
>
> Perhaps we should add should_abort_scan(lruvec, sc) in shrink_node_memcgs for the traditional LRU as
> well?
We definitely should discuss about should_abort_scan() for traditional
reclaim but to keep things simple, let's do that after this series. For
now, follow Johannes' suggestion of lru_gen_should_abort_scan().
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim
2025-12-22 21:18 ` Shakeel Butt
@ 2025-12-23 0:45 ` Chen Ridong
0 siblings, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-23 0:45 UTC (permalink / raw)
To: Shakeel Butt
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, zhengqi.arch, linux-mm, linux-doc,
linux-kernel, cgroups, lujialin4, zhongjinji
On 2025/12/23 5:18, Shakeel Butt wrote:
> On Mon, Dec 22, 2025 at 03:27:26PM +0800, Chen Ridong wrote:
>>
> [...]
>>
>>>> - if (should_abort_scan(lruvec, sc))
>>>> + if (should_abort_scan(lruvec, sc)) {
>>>> + if (cookie)
>>>> + mem_cgroup_iter_break(target, memcg);
>>>> break;
>>>
>>> This seems buggy as we may break the loop without calling
>>> mem_cgroup_iter_break(). I think for kswapd the cookie will be NULL and
>>> if should_abort_scan() returns true, we will break the loop without
>>> calling mem_cgroup_iter_break() and will leak a reference to memcg.
>>>
>>
>> Thank you for catching that—my mistake.
>>
>> This also brings up another point: In kswapd, the traditional LRU iterates through all memcgs, but
>> stops for the generational LRU (GENLRU) when should_abort_scan is met (i.e., enough pages are
>> reclaimed or the watermark is satisfied). Shouldn't both behave consistently?
>>
>> Perhaps we should add should_abort_scan(lruvec, sc) in shrink_node_memcgs for the traditional LRU as
>> well?
>
> We definitely should discuss about should_abort_scan() for traditional
> reclaim but to keep things simple, let's do that after this series. For
> now, follow Johannes' suggestion of lru_gen_should_abort_scan().
>
Okey, understood.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH -next 2/5] mm/mglru: remove memcg lru
2025-12-09 1:25 [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
2025-12-09 1:25 ` [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
@ 2025-12-09 1:25 ` Chen Ridong
2025-12-22 3:24 ` Shakeel Butt
2025-12-09 1:25 ` [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen Chen Ridong
` (4 subsequent siblings)
6 siblings, 1 reply; 25+ messages in thread
From: Chen Ridong @ 2025-12-09 1:25 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
chenridong, zhongjinji
From: Chen Ridong <chenridong@huawei.com>
Now that the previous patch has switched global reclaim to use
mem_cgroup_iter, the specialized memcg LRU infrastructure is no longer
needed. This patch removes all related code:
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/mm/multigen_lru.rst | 30 ------
include/linux/mmzone.h | 89 -----------------
mm/memcontrol-v1.c | 6 --
mm/memcontrol.c | 4 -
mm/mm_init.c | 1 -
mm/vmscan.c | 153 +-----------------------------
6 files changed, 1 insertion(+), 282 deletions(-)
diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst
index 52ed5092022f..bf8547e2f592 100644
--- a/Documentation/mm/multigen_lru.rst
+++ b/Documentation/mm/multigen_lru.rst
@@ -220,36 +220,6 @@ time domain because a CPU can scan pages at different rates under
varying memory pressure. It calculates a moving average for each new
generation to avoid being permanently locked in a suboptimal state.
-Memcg LRU
----------
-An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
-since each node and memcg combination has an LRU of folios (see
-``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
-global reclaim, which is critical to system-wide memory overcommit in
-data centers. Note that memcg LRU only applies to global reclaim.
-
-The basic structure of an memcg LRU can be understood by an analogy to
-the active/inactive LRU (of folios):
-
-1. It has the young and the old (generations), i.e., the counterparts
- to the active and the inactive;
-2. The increment of ``max_seq`` triggers promotion, i.e., the
- counterpart to activation;
-3. Other events trigger similar operations, e.g., offlining an memcg
- triggers demotion, i.e., the counterpart to deactivation.
-
-In terms of global reclaim, it has two distinct features:
-
-1. Sharding, which allows each thread to start at a random memcg (in
- the old generation) and improves parallelism;
-2. Eventual fairness, which allows direct reclaim to bail out at will
- and reduces latency without affecting fairness over some time.
-
-In terms of traversing memcgs during global reclaim, it improves the
-best-case complexity from O(n) to O(1) and does not affect the
-worst-case complexity O(n). Therefore, on average, it has a sublinear
-complexity.
-
Summary
-------
The multi-gen LRU (of folios) can be disassembled into the following
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75ef7c9f9307..49952301ff3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,12 +509,6 @@ struct lru_gen_folio {
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
/* whether the multi-gen LRU is enabled */
bool enabled;
- /* the memcg generation this lru_gen_folio belongs to */
- u8 gen;
- /* the list segment this lru_gen_folio belongs to */
- u8 seg;
- /* per-node lru_gen_folio list for global reclaim */
- struct hlist_nulls_node list;
};
enum {
@@ -558,79 +552,14 @@ struct lru_gen_mm_walk {
bool force_scan;
};
-/*
- * For each node, memcgs are divided into two generations: the old and the
- * young. For each generation, memcgs are randomly sharded into multiple bins
- * to improve scalability. For each bin, the hlist_nulls is virtually divided
- * into three segments: the head, the tail and the default.
- *
- * An onlining memcg is added to the tail of a random bin in the old generation.
- * The eviction starts at the head of a random bin in the old generation. The
- * per-node memcg generation counter, whose reminder (mod MEMCG_NR_GENS) indexes
- * the old generation, is incremented when all its bins become empty.
- *
- * There are four operations:
- * 1. MEMCG_LRU_HEAD, which moves a memcg to the head of a random bin in its
- * current generation (old or young) and updates its "seg" to "head";
- * 2. MEMCG_LRU_TAIL, which moves a memcg to the tail of a random bin in its
- * current generation (old or young) and updates its "seg" to "tail";
- * 3. MEMCG_LRU_OLD, which moves a memcg to the head of a random bin in the old
- * generation, updates its "gen" to "old" and resets its "seg" to "default";
- * 4. MEMCG_LRU_YOUNG, which moves a memcg to the tail of a random bin in the
- * young generation, updates its "gen" to "young" and resets its "seg" to
- * "default".
- *
- * The events that trigger the above operations are:
- * 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
- * 2. The first attempt to reclaim a memcg below low, which triggers
- * MEMCG_LRU_TAIL;
- * 3. The first attempt to reclaim a memcg offlined or below reclaimable size
- * threshold, which triggers MEMCG_LRU_TAIL;
- * 4. The second attempt to reclaim a memcg offlined or below reclaimable size
- * threshold, which triggers MEMCG_LRU_YOUNG;
- * 5. Attempting to reclaim a memcg below min, which triggers MEMCG_LRU_YOUNG;
- * 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG;
- * 7. Offlining a memcg, which triggers MEMCG_LRU_OLD.
- *
- * Notes:
- * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
- * of their max_seq counters ensures the eventual fairness to all eligible
- * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
- * 2. There are only two valid generations: old (seq) and young (seq+1).
- * MEMCG_NR_GENS is set to three so that when reading the generation counter
- * locklessly, a stale value (seq-1) does not wraparound to young.
- */
-#define MEMCG_NR_GENS 3
-#define MEMCG_NR_BINS 8
-
-struct lru_gen_memcg {
- /* the per-node memcg generation counter */
- unsigned long seq;
- /* each memcg has one lru_gen_folio per node */
- unsigned long nr_memcgs[MEMCG_NR_GENS];
- /* per-node lru_gen_folio list for global reclaim */
- struct hlist_nulls_head fifo[MEMCG_NR_GENS][MEMCG_NR_BINS];
- /* protects the above */
- spinlock_t lock;
-};
-
-void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
-void lru_gen_online_memcg(struct mem_cgroup *memcg);
-void lru_gen_offline_memcg(struct mem_cgroup *memcg);
-void lru_gen_release_memcg(struct mem_cgroup *memcg);
-void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
#else /* !CONFIG_LRU_GEN */
-static inline void lru_gen_init_pgdat(struct pglist_data *pgdat)
-{
-}
-
static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}
@@ -648,22 +577,6 @@ static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
{
}
-static inline void lru_gen_online_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_offline_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_release_memcg(struct mem_cgroup *memcg)
-{
-}
-
-static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
-{
-}
-
#endif /* CONFIG_LRU_GEN */
struct lruvec {
@@ -1503,8 +1416,6 @@ typedef struct pglist_data {
#ifdef CONFIG_LRU_GEN
/* kswap mm walk data */
struct lru_gen_mm_walk mm_walk;
- /* lru_gen_folio list */
- struct lru_gen_memcg memcg_lru;
#endif
CACHELINE_PADDING(_pad2_);
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..8f41e72ae7f0 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -182,12 +182,6 @@ static void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
struct mem_cgroup_per_node *mz;
struct mem_cgroup_tree_per_node *mctz;
- if (lru_gen_enabled()) {
- if (soft_limit_excess(memcg))
- lru_gen_soft_reclaim(memcg, nid);
- return;
- }
-
mctz = soft_limit_tree.rb_tree_per_node[nid];
if (!mctz)
return;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index be810c1fbfc3..ab3ebecb5ec7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3874,8 +3874,6 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
FLUSH_TIME);
- lru_gen_online_memcg(memcg);
-
/* Online state pins memcg ID, memcg ID pins CSS */
refcount_set(&memcg->id.ref, 1);
css_get(css);
@@ -3915,7 +3913,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
reparent_deferred_split_queue(memcg);
reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);
- lru_gen_offline_memcg(memcg);
drain_all_stock(memcg);
@@ -3927,7 +3924,6 @@ static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
invalidate_reclaim_iterators(memcg);
- lru_gen_release_memcg(memcg);
}
static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fc2a6f1e518f..6e5e1fe6ff31 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1745,7 +1745,6 @@ static void __init free_area_init_node(int nid)
pgdat_set_deferred_range(pgdat);
free_area_init_core(pgdat);
- lru_gen_init_pgdat(pgdat);
}
/* Any regular or high memory on that node ? */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70b0e7e5393c..584f41eb4c14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2698,9 +2698,6 @@ static bool should_clear_pmd_young(void)
#define for_each_evictable_type(type, swappiness) \
for ((type) = min_type(swappiness); (type) <= max_type(swappiness); (type)++)
-#define get_memcg_gen(seq) ((seq) % MEMCG_NR_GENS)
-#define get_memcg_bin(bin) ((bin) % MEMCG_NR_BINS)
-
static struct lruvec *get_lruvec(struct mem_cgroup *memcg, int nid)
{
struct pglist_data *pgdat = NODE_DATA(nid);
@@ -4287,140 +4284,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
return true;
}
-/******************************************************************************
- * memcg LRU
- ******************************************************************************/
-
-/* see the comment on MEMCG_NR_GENS */
-enum {
- MEMCG_LRU_NOP,
- MEMCG_LRU_HEAD,
- MEMCG_LRU_TAIL,
- MEMCG_LRU_OLD,
- MEMCG_LRU_YOUNG,
-};
-
-static void lru_gen_rotate_memcg(struct lruvec *lruvec, int op)
-{
- int seg;
- int old, new;
- unsigned long flags;
- int bin = get_random_u32_below(MEMCG_NR_BINS);
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
- spin_lock_irqsave(&pgdat->memcg_lru.lock, flags);
-
- VM_WARN_ON_ONCE(hlist_nulls_unhashed(&lruvec->lrugen.list));
-
- seg = 0;
- new = old = lruvec->lrugen.gen;
-
- /* see the comment on MEMCG_NR_GENS */
- if (op == MEMCG_LRU_HEAD)
- seg = MEMCG_LRU_HEAD;
- else if (op == MEMCG_LRU_TAIL)
- seg = MEMCG_LRU_TAIL;
- else if (op == MEMCG_LRU_OLD)
- new = get_memcg_gen(pgdat->memcg_lru.seq);
- else if (op == MEMCG_LRU_YOUNG)
- new = get_memcg_gen(pgdat->memcg_lru.seq + 1);
- else
- VM_WARN_ON_ONCE(true);
-
- WRITE_ONCE(lruvec->lrugen.seg, seg);
- WRITE_ONCE(lruvec->lrugen.gen, new);
-
- hlist_nulls_del_rcu(&lruvec->lrugen.list);
-
- if (op == MEMCG_LRU_HEAD || op == MEMCG_LRU_OLD)
- hlist_nulls_add_head_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[new][bin]);
- else
- hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[new][bin]);
-
- pgdat->memcg_lru.nr_memcgs[old]--;
- pgdat->memcg_lru.nr_memcgs[new]++;
-
- if (!pgdat->memcg_lru.nr_memcgs[old] && old == get_memcg_gen(pgdat->memcg_lru.seq))
- WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
-
- spin_unlock_irqrestore(&pgdat->memcg_lru.lock, flags);
-}
-
-#ifdef CONFIG_MEMCG
-
-void lru_gen_online_memcg(struct mem_cgroup *memcg)
-{
- int gen;
- int nid;
- int bin = get_random_u32_below(MEMCG_NR_BINS);
-
- for_each_node(nid) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- spin_lock_irq(&pgdat->memcg_lru.lock);
-
- VM_WARN_ON_ONCE(!hlist_nulls_unhashed(&lruvec->lrugen.list));
-
- gen = get_memcg_gen(pgdat->memcg_lru.seq);
-
- lruvec->lrugen.gen = gen;
-
- hlist_nulls_add_tail_rcu(&lruvec->lrugen.list, &pgdat->memcg_lru.fifo[gen][bin]);
- pgdat->memcg_lru.nr_memcgs[gen]++;
-
- spin_unlock_irq(&pgdat->memcg_lru.lock);
- }
-}
-
-void lru_gen_offline_memcg(struct mem_cgroup *memcg)
-{
- int nid;
-
- for_each_node(nid) {
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- lru_gen_rotate_memcg(lruvec, MEMCG_LRU_OLD);
- }
-}
-
-void lru_gen_release_memcg(struct mem_cgroup *memcg)
-{
- int gen;
- int nid;
-
- for_each_node(nid) {
- struct pglist_data *pgdat = NODE_DATA(nid);
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- spin_lock_irq(&pgdat->memcg_lru.lock);
-
- if (hlist_nulls_unhashed(&lruvec->lrugen.list))
- goto unlock;
-
- gen = lruvec->lrugen.gen;
-
- hlist_nulls_del_init_rcu(&lruvec->lrugen.list);
- pgdat->memcg_lru.nr_memcgs[gen]--;
-
- if (!pgdat->memcg_lru.nr_memcgs[gen] && gen == get_memcg_gen(pgdat->memcg_lru.seq))
- WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1);
-unlock:
- spin_unlock_irq(&pgdat->memcg_lru.lock);
- }
-}
-
-void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
-{
- struct lruvec *lruvec = get_lruvec(memcg, nid);
-
- /* see the comment on MEMCG_NR_GENS */
- if (READ_ONCE(lruvec->lrugen.seg) != MEMCG_LRU_HEAD)
- lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
-}
-
-#endif /* CONFIG_MEMCG */
-
/******************************************************************************
* the eviction
******************************************************************************/
@@ -5613,18 +5476,6 @@ static const struct file_operations lru_gen_ro_fops = {
* initialization
******************************************************************************/
-void lru_gen_init_pgdat(struct pglist_data *pgdat)
-{
- int i, j;
-
- spin_lock_init(&pgdat->memcg_lru.lock);
-
- for (i = 0; i < MEMCG_NR_GENS; i++) {
- for (j = 0; j < MEMCG_NR_BINS; j++)
- INIT_HLIST_NULLS_HEAD(&pgdat->memcg_lru.fifo[i][j], i);
- }
-}
-
void lru_gen_init_lruvec(struct lruvec *lruvec)
{
int i;
@@ -5671,9 +5522,7 @@ void lru_gen_exit_memcg(struct mem_cgroup *memcg)
struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
- sizeof(lruvec->lrugen.nr_pages)));
-
- lruvec->lrugen.list.next = LIST_POISON1;
+ sizeof(lruvec->lrugen.nr_pages)));
if (!mm_state)
continue;
--
2.34.1
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 2/5] mm/mglru: remove memcg lru
2025-12-09 1:25 ` [PATCH -next 2/5] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-22 3:24 ` Shakeel Butt
0 siblings, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2025-12-22 3:24 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, zhengqi.arch, linux-mm, linux-doc,
linux-kernel, cgroups, lujialin4, zhongjinji
On Tue, Dec 09, 2025 at 01:25:54AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> Now that the previous patch has switched global reclaim to use
> mem_cgroup_iter, the specialized memcg LRU infrastructure is no longer
> needed. This patch removes all related code:
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-09 1:25 [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
2025-12-09 1:25 ` [PATCH -next 1/5] mm/mglru: use mem_cgroup_iter for global reclaim Chen Ridong
2025-12-09 1:25 ` [PATCH -next 2/5] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-09 1:25 ` Chen Ridong
2025-12-12 2:55 ` kernel test robot
` (2 more replies)
2025-12-09 1:25 ` [PATCH -next 4/5] mm/mglru: combine shrink_many into shrink_node_memcgs Chen Ridong
` (3 subsequent siblings)
6 siblings, 3 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-09 1:25 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
chenridong, zhongjinji
From: Chen Ridong <chenridong@huawei.com>
Currently, flush_reclaim_state is placed differently between
shrink_node_memcgs and shrink_many. shrink_many (only used for gen-LRU)
calls it after each lruvec is shrunk, while shrink_node_memcgs calls it
only after all lruvecs have been shrunk.
This patch moves flush_reclaim_state into shrink_node_memcgs and calls it
after each lruvec. This unifies the behavior and is reasonable because:
1. flush_reclaim_state adds current->reclaim_state->reclaimed to
sc->nr_reclaimed.
2. For non-MGLRU root reclaim, this can help stop the iteration earlier
when nr_to_reclaim is reached.
3. For non-root reclaim, the effect is negligible since flush_reclaim_state
does nothing in that case.
After moving flush_reclaim_state into shrink_node_memcgs, shrink_one can be
extended to support both lrugen and non-lrugen paths. It will call
try_to_shrink_lruvec for lrugen root reclaim and shrink_lruvec otherwise.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
mm/vmscan.c | 57 +++++++++++++++++++++--------------------------------
1 file changed, 23 insertions(+), 34 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 584f41eb4c14..795f5ebd9341 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4758,23 +4758,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
return nr_to_scan < 0;
}
-static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
-{
- unsigned long scanned = sc->nr_scanned;
- unsigned long reclaimed = sc->nr_reclaimed;
- struct pglist_data *pgdat = lruvec_pgdat(lruvec);
- struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-
- try_to_shrink_lruvec(lruvec, sc);
-
- shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
-
- if (!sc->proactive)
- vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
- sc->nr_reclaimed - reclaimed);
-
- flush_reclaim_state(sc);
-}
+static void shrink_one(struct lruvec *lruvec, struct scan_control *sc);
static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
{
@@ -5760,6 +5744,27 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
return inactive_lru_pages > pages_for_compaction;
}
+static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
+{
+ unsigned long scanned = sc->nr_scanned;
+ unsigned long reclaimed = sc->nr_reclaimed;
+ struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+ struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+ if (lru_gen_enabled() && root_reclaim(sc))
+ try_to_shrink_lruvec(lruvec, sc);
+ else
+ shrink_lruvec(lruvec, sc);
+
+ shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
+
+ if (!sc->proactive)
+ vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
+ sc->nr_reclaimed - reclaimed);
+
+ flush_reclaim_state(sc);
+}
+
static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
{
struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
@@ -5784,8 +5789,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
memcg = mem_cgroup_iter(target_memcg, NULL, partial);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- unsigned long reclaimed;
- unsigned long scanned;
/*
* This loop can become CPU-bound when target memcgs
@@ -5817,19 +5820,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
memcg_memory_event(memcg, MEMCG_LOW);
}
- reclaimed = sc->nr_reclaimed;
- scanned = sc->nr_scanned;
-
- shrink_lruvec(lruvec, sc);
-
- shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
- sc->priority);
-
- /* Record the group's reclaim efficiency */
- if (!sc->proactive)
- vmpressure(sc->gfp_mask, memcg, false,
- sc->nr_scanned - scanned,
- sc->nr_reclaimed - reclaimed);
+ shrink_one(lruvec, sc);
/* If partial walks are allowed, bail once goal is reached */
if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) {
@@ -5863,8 +5854,6 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
shrink_node_memcgs(pgdat, sc);
- flush_reclaim_state(sc);
-
nr_node_reclaimed = sc->nr_reclaimed - nr_reclaimed;
/* Record the subtree's reclaim efficiency */
--
2.34.1
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-09 1:25 ` [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen Chen Ridong
@ 2025-12-12 2:55 ` kernel test robot
2025-12-12 9:53 ` Chen Ridong
2025-12-15 21:13 ` Johannes Weiner
2025-12-22 3:49 ` Shakeel Butt
2 siblings, 1 reply; 25+ messages in thread
From: kernel test robot @ 2025-12-12 2:55 UTC (permalink / raw)
To: Chen Ridong, akpm, axelrasmussen, yuanchu, weixugc, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
corbet, hannes, roman.gushchin, shakeel.butt, muchun.song,
zhengqi.arch
Cc: llvm, oe-kbuild-all, linux-mm, linux-doc, linux-kernel, cgroups,
lujialin4, chenridong, zhongjinji
Hi Chen,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Ridong/mm-mglru-use-mem_cgroup_iter-for-global-reclaim/20251209-094913
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20251209012557.1949239-4-chenridong%40huaweicloud.com
patch subject: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
config: x86_64-randconfig-004-20251212 (https://download.01.org/0day-ci/archive/20251212/202512121027.03z9qd08-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251212/202512121027.03z9qd08-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512121027.03z9qd08-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> mm/vmscan.o: warning: objtool: shrink_one+0xeb2: sibling call from callable instruction with modified stack frame
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-12 2:55 ` kernel test robot
@ 2025-12-12 9:53 ` Chen Ridong
0 siblings, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-12 9:53 UTC (permalink / raw)
To: kernel test robot, akpm, axelrasmussen, yuanchu, weixugc, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
corbet, hannes, roman.gushchin, shakeel.butt, muchun.song,
zhengqi.arch
Cc: llvm, oe-kbuild-all, linux-mm, linux-doc, linux-kernel, cgroups,
lujialin4, zhongjinji
On 2025/12/12 10:55, kernel test robot wrote:
> Hi Chen,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Chen-Ridong/mm-mglru-use-mem_cgroup_iter-for-global-reclaim/20251209-094913
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20251209012557.1949239-4-chenridong%40huaweicloud.com
> patch subject: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
> config: x86_64-randconfig-004-20251212 (https://download.01.org/0day-ci/archive/20251212/202512121027.03z9qd08-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251212/202512121027.03z9qd08-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202512121027.03z9qd08-lkp@intel.com/
>
> All warnings (new ones prefixed by >>):
>
>>> mm/vmscan.o: warning: objtool: shrink_one+0xeb2: sibling call from callable instruction with modified stack frame
>
This is the first time I've encountered this warning. While adding
`STACK_FRAME_NON_STANDARD(shrink_one)` resolves it, I noticed this approach isn't widely used in the
codebase. Is this the standard solution, or are there better alternatives?
I've tested that the warning persists even when `shrink_one` is simplified to only call `shrink_lruvec`:
```
static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
{
shrink_lruvec(lruvec, sc);
}
```
How can we properly avoid this warning without using STACK_FRAME_NON_STANDARD?
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-09 1:25 ` [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen Chen Ridong
2025-12-12 2:55 ` kernel test robot
@ 2025-12-15 21:13 ` Johannes Weiner
2025-12-16 1:14 ` Chen Ridong
2025-12-22 3:49 ` Shakeel Butt
2 siblings, 1 reply; 25+ messages in thread
From: Johannes Weiner @ 2025-12-15 21:13 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
zhongjinji
On Tue, Dec 09, 2025 at 01:25:55AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> Currently, flush_reclaim_state is placed differently between
> shrink_node_memcgs and shrink_many. shrink_many (only used for gen-LRU)
> calls it after each lruvec is shrunk, while shrink_node_memcgs calls it
> only after all lruvecs have been shrunk.
>
> This patch moves flush_reclaim_state into shrink_node_memcgs and calls it
> after each lruvec. This unifies the behavior and is reasonable because:
>
> 1. flush_reclaim_state adds current->reclaim_state->reclaimed to
> sc->nr_reclaimed.
> 2. For non-MGLRU root reclaim, this can help stop the iteration earlier
> when nr_to_reclaim is reached.
> 3. For non-root reclaim, the effect is negligible since flush_reclaim_state
> does nothing in that case.
>
> After moving flush_reclaim_state into shrink_node_memcgs, shrink_one can be
> extended to support both lrugen and non-lrugen paths. It will call
> try_to_shrink_lruvec for lrugen root reclaim and shrink_lruvec otherwise.
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
> mm/vmscan.c | 57 +++++++++++++++++++++--------------------------------
> 1 file changed, 23 insertions(+), 34 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 584f41eb4c14..795f5ebd9341 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4758,23 +4758,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> return nr_to_scan < 0;
> }
>
> -static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> -{
> - unsigned long scanned = sc->nr_scanned;
> - unsigned long reclaimed = sc->nr_reclaimed;
> - struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> -
> - try_to_shrink_lruvec(lruvec, sc);
> -
> - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
> -
> - if (!sc->proactive)
> - vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
> - sc->nr_reclaimed - reclaimed);
> -
> - flush_reclaim_state(sc);
> -}
> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc);
>
> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
> {
> @@ -5760,6 +5744,27 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> return inactive_lru_pages > pages_for_compaction;
> }
>
> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> +{
> + unsigned long scanned = sc->nr_scanned;
> + unsigned long reclaimed = sc->nr_reclaimed;
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> + if (lru_gen_enabled() && root_reclaim(sc))
> + try_to_shrink_lruvec(lruvec, sc);
> + else
> + shrink_lruvec(lruvec, sc);
Yikes. So we end up with:
shrink_node_memcgs()
shrink_one()
if lru_gen_enabled && root_reclaim(sc)
try_to_shrink_lruvec(lruvec, sc)
else
shrink_lruvec()
if lru_gen_enabled && !root_reclaim(sc)
lru_gen_shrink_lruvec(lruvec, sc)
try_to_shrink_lruvec()
I think it's doing too much at once. Can you get it into the following
shape:
shrink_node_memcgs()
for each memcg:
if lru_gen_enabled:
lru_gen_shrink_lruvec()
else
shrink_lruvec()
and handle the differences in those two functions? Then look for
overlap one level down, and so forth.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-15 21:13 ` Johannes Weiner
@ 2025-12-16 1:14 ` Chen Ridong
2025-12-22 21:36 ` Shakeel Butt
0 siblings, 1 reply; 25+ messages in thread
From: Chen Ridong @ 2025-12-16 1:14 UTC (permalink / raw)
To: Johannes Weiner
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
zhongjinji
On 2025/12/16 5:13, Johannes Weiner wrote:
> On Tue, Dec 09, 2025 at 01:25:55AM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> Currently, flush_reclaim_state is placed differently between
>> shrink_node_memcgs and shrink_many. shrink_many (only used for gen-LRU)
>> calls it after each lruvec is shrunk, while shrink_node_memcgs calls it
>> only after all lruvecs have been shrunk.
>>
>> This patch moves flush_reclaim_state into shrink_node_memcgs and calls it
>> after each lruvec. This unifies the behavior and is reasonable because:
>>
>> 1. flush_reclaim_state adds current->reclaim_state->reclaimed to
>> sc->nr_reclaimed.
>> 2. For non-MGLRU root reclaim, this can help stop the iteration earlier
>> when nr_to_reclaim is reached.
>> 3. For non-root reclaim, the effect is negligible since flush_reclaim_state
>> does nothing in that case.
>>
>> After moving flush_reclaim_state into shrink_node_memcgs, shrink_one can be
>> extended to support both lrugen and non-lrugen paths. It will call
>> try_to_shrink_lruvec for lrugen root reclaim and shrink_lruvec otherwise.
>>
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> ---
>> mm/vmscan.c | 57 +++++++++++++++++++++--------------------------------
>> 1 file changed, 23 insertions(+), 34 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 584f41eb4c14..795f5ebd9341 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4758,23 +4758,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>> return nr_to_scan < 0;
>> }
>>
>> -static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> -{
>> - unsigned long scanned = sc->nr_scanned;
>> - unsigned long reclaimed = sc->nr_reclaimed;
>> - struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> -
>> - try_to_shrink_lruvec(lruvec, sc);
>> -
>> - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>> -
>> - if (!sc->proactive)
>> - vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
>> - sc->nr_reclaimed - reclaimed);
>> -
>> - flush_reclaim_state(sc);
>> -}
>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc);
>>
>> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
>> {
>> @@ -5760,6 +5744,27 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>> return inactive_lru_pages > pages_for_compaction;
>> }
>>
>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>> +{
>> + unsigned long scanned = sc->nr_scanned;
>> + unsigned long reclaimed = sc->nr_reclaimed;
>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +
>> + if (lru_gen_enabled() && root_reclaim(sc))
>> + try_to_shrink_lruvec(lruvec, sc);
>> + else
>> + shrink_lruvec(lruvec, sc);
>
Hi Johannes, thank you for your reply.
> Yikes. So we end up with:
>
> shrink_node_memcgs()
> shrink_one()
> if lru_gen_enabled && root_reclaim(sc)
> try_to_shrink_lruvec(lruvec, sc)
> else
> shrink_lruvec()
> if lru_gen_enabled && !root_reclaim(sc)
> lru_gen_shrink_lruvec(lruvec, sc)
> try_to_shrink_lruvec()
>
> I think it's doing too much at once. Can you get it into the following
> shape:
>
You're absolutely right. This refactoring is indeed what patch 5/5 implements.
With patch 5/5 applied, the flow becomes:
shrink_node_memcgs()
shrink_one()
if lru_gen_enabled
lru_gen_shrink_lruvec --> symmetric with else shrink_lruvec()
if (root_reclaim(sc)) --> handle root reclaim.
try_to_shrink_lruvec()
else
...
try_to_shrink_lruvec()
else
shrink_lruvec()
This matches the structure you described.
One note: shrink_one() is also called from lru_gen_shrink_node() when memcg is disabled, so I
believe it makes sense to keep this helper.
> shrink_node_memcgs()
> for each memcg:
> if lru_gen_enabled:
> lru_gen_shrink_lruvec()
> else
> shrink_lruvec()
>
Regarding the patch split, I currently kept patch 3/5 and 5/5 separate to make the changes clearer
in each step. Would you prefer that I merge patch 3/5 with patch 5/5, so the full refactoring
appears in one patch?
Looking forward to your guidance.
> and handle the differences in those two functions? Then look for
> overlap one level down, and so forth.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-16 1:14 ` Chen Ridong
@ 2025-12-22 21:36 ` Shakeel Butt
2025-12-23 1:00 ` Chen Ridong
0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-12-22 21:36 UTC (permalink / raw)
To: Chen Ridong
Cc: Johannes Weiner, akpm, axelrasmussen, yuanchu, weixugc, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
corbet, roman.gushchin, muchun.song, zhengqi.arch, linux-mm,
linux-doc, linux-kernel, cgroups, lujialin4, zhongjinji
On Tue, Dec 16, 2025 at 09:14:45AM +0800, Chen Ridong wrote:
>
>
> On 2025/12/16 5:13, Johannes Weiner wrote:
> > On Tue, Dec 09, 2025 at 01:25:55AM +0000, Chen Ridong wrote:
> >> From: Chen Ridong <chenridong@huawei.com>
> >>
> >> Currently, flush_reclaim_state is placed differently between
> >> shrink_node_memcgs and shrink_many. shrink_many (only used for gen-LRU)
> >> calls it after each lruvec is shrunk, while shrink_node_memcgs calls it
> >> only after all lruvecs have been shrunk.
> >>
> >> This patch moves flush_reclaim_state into shrink_node_memcgs and calls it
> >> after each lruvec. This unifies the behavior and is reasonable because:
> >>
> >> 1. flush_reclaim_state adds current->reclaim_state->reclaimed to
> >> sc->nr_reclaimed.
> >> 2. For non-MGLRU root reclaim, this can help stop the iteration earlier
> >> when nr_to_reclaim is reached.
> >> 3. For non-root reclaim, the effect is negligible since flush_reclaim_state
> >> does nothing in that case.
> >>
> >> After moving flush_reclaim_state into shrink_node_memcgs, shrink_one can be
> >> extended to support both lrugen and non-lrugen paths. It will call
> >> try_to_shrink_lruvec for lrugen root reclaim and shrink_lruvec otherwise.
> >>
> >> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> >> ---
> >> mm/vmscan.c | 57 +++++++++++++++++++++--------------------------------
> >> 1 file changed, 23 insertions(+), 34 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 584f41eb4c14..795f5ebd9341 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -4758,23 +4758,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >> return nr_to_scan < 0;
> >> }
> >>
> >> -static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> >> -{
> >> - unsigned long scanned = sc->nr_scanned;
> >> - unsigned long reclaimed = sc->nr_reclaimed;
> >> - struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >> -
> >> - try_to_shrink_lruvec(lruvec, sc);
> >> -
> >> - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
> >> -
> >> - if (!sc->proactive)
> >> - vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
> >> - sc->nr_reclaimed - reclaimed);
> >> -
> >> - flush_reclaim_state(sc);
> >> -}
> >> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc);
> >>
> >> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
> >> {
> >> @@ -5760,6 +5744,27 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> >> return inactive_lru_pages > pages_for_compaction;
> >> }
> >>
> >> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
> >> +{
> >> + unsigned long scanned = sc->nr_scanned;
> >> + unsigned long reclaimed = sc->nr_reclaimed;
> >> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >> +
> >> + if (lru_gen_enabled() && root_reclaim(sc))
> >> + try_to_shrink_lruvec(lruvec, sc);
> >> + else
> >> + shrink_lruvec(lruvec, sc);
> >
>
> Hi Johannes, thank you for your reply.
>
> > Yikes. So we end up with:
> >
> > shrink_node_memcgs()
> > shrink_one()
> > if lru_gen_enabled && root_reclaim(sc)
> > try_to_shrink_lruvec(lruvec, sc)
> > else
> > shrink_lruvec()
> > if lru_gen_enabled && !root_reclaim(sc)
> > lru_gen_shrink_lruvec(lruvec, sc)
> > try_to_shrink_lruvec()
> >
> > I think it's doing too much at once. Can you get it into the following
> > shape:
> >
>
> You're absolutely right. This refactoring is indeed what patch 5/5 implements.
>
> With patch 5/5 applied, the flow becomes:
>
> shrink_node_memcgs()
> shrink_one()
> if lru_gen_enabled
> lru_gen_shrink_lruvec --> symmetric with else shrink_lruvec()
> if (root_reclaim(sc)) --> handle root reclaim.
> try_to_shrink_lruvec()
> else
> ...
> try_to_shrink_lruvec()
> else
> shrink_lruvec()
>
> This matches the structure you described.
>
> One note: shrink_one() is also called from lru_gen_shrink_node() when memcg is disabled, so I
> believe it makes sense to keep this helper.
I think we don't need shrink_one as it can be inlined to its callers and
also shrink_node_memcgs() already handles mem_cgroup_disabled() case, so
lru_gen_shrink_node() should not need shrink_one for such case.
>
> > shrink_node_memcgs()
> > for each memcg:
> > if lru_gen_enabled:
> > lru_gen_shrink_lruvec()
> > else
> > shrink_lruvec()
> >
I actually like what Johannes has requested above but if that is not
possible without changing some behavior then let's aim to do as much as
possible in this series while keeping the same behavior. In a followup
we can try to combine the behavior part.
>
> Regarding the patch split, I currently kept patch 3/5 and 5/5 separate to make the changes clearer
> in each step. Would you prefer that I merge patch 3/5 with patch 5/5, so the full refactoring
> appears in one patch?
>
> Looking forward to your guidance.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-22 21:36 ` Shakeel Butt
@ 2025-12-23 1:00 ` Chen Ridong
0 siblings, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-23 1:00 UTC (permalink / raw)
To: Shakeel Butt
Cc: Johannes Weiner, akpm, axelrasmussen, yuanchu, weixugc, david,
lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
corbet, roman.gushchin, muchun.song, zhengqi.arch, linux-mm,
linux-doc, linux-kernel, cgroups, lujialin4, zhongjinji
On 2025/12/23 5:36, Shakeel Butt wrote:
> On Tue, Dec 16, 2025 at 09:14:45AM +0800, Chen Ridong wrote:
>>
>>
>> On 2025/12/16 5:13, Johannes Weiner wrote:
>>> On Tue, Dec 09, 2025 at 01:25:55AM +0000, Chen Ridong wrote:
>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>
>>>> Currently, flush_reclaim_state is placed differently between
>>>> shrink_node_memcgs and shrink_many. shrink_many (only used for gen-LRU)
>>>> calls it after each lruvec is shrunk, while shrink_node_memcgs calls it
>>>> only after all lruvecs have been shrunk.
>>>>
>>>> This patch moves flush_reclaim_state into shrink_node_memcgs and calls it
>>>> after each lruvec. This unifies the behavior and is reasonable because:
>>>>
>>>> 1. flush_reclaim_state adds current->reclaim_state->reclaimed to
>>>> sc->nr_reclaimed.
>>>> 2. For non-MGLRU root reclaim, this can help stop the iteration earlier
>>>> when nr_to_reclaim is reached.
>>>> 3. For non-root reclaim, the effect is negligible since flush_reclaim_state
>>>> does nothing in that case.
>>>>
>>>> After moving flush_reclaim_state into shrink_node_memcgs, shrink_one can be
>>>> extended to support both lrugen and non-lrugen paths. It will call
>>>> try_to_shrink_lruvec for lrugen root reclaim and shrink_lruvec otherwise.
>>>>
>>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>>> ---
>>>> mm/vmscan.c | 57 +++++++++++++++++++++--------------------------------
>>>> 1 file changed, 23 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 584f41eb4c14..795f5ebd9341 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -4758,23 +4758,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>>> return nr_to_scan < 0;
>>>> }
>>>>
>>>> -static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>>>> -{
>>>> - unsigned long scanned = sc->nr_scanned;
>>>> - unsigned long reclaimed = sc->nr_reclaimed;
>>>> - struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>> - struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>> -
>>>> - try_to_shrink_lruvec(lruvec, sc);
>>>> -
>>>> - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>>>> -
>>>> - if (!sc->proactive)
>>>> - vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
>>>> - sc->nr_reclaimed - reclaimed);
>>>> -
>>>> - flush_reclaim_state(sc);
>>>> -}
>>>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc);
>>>>
>>>> static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
>>>> {
>>>> @@ -5760,6 +5744,27 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>>>> return inactive_lru_pages > pages_for_compaction;
>>>> }
>>>>
>>>> +static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>>>> +{
>>>> + unsigned long scanned = sc->nr_scanned;
>>>> + unsigned long reclaimed = sc->nr_reclaimed;
>>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>> +
>>>> + if (lru_gen_enabled() && root_reclaim(sc))
>>>> + try_to_shrink_lruvec(lruvec, sc);
>>>> + else
>>>> + shrink_lruvec(lruvec, sc);
>>>
>>
>> Hi Johannes, thank you for your reply.
>>
>>> Yikes. So we end up with:
>>>
>>> shrink_node_memcgs()
>>> shrink_one()
>>> if lru_gen_enabled && root_reclaim(sc)
>>> try_to_shrink_lruvec(lruvec, sc)
>>> else
>>> shrink_lruvec()
>>> if lru_gen_enabled && !root_reclaim(sc)
>>> lru_gen_shrink_lruvec(lruvec, sc)
>>> try_to_shrink_lruvec()
>>>
>>> I think it's doing too much at once. Can you get it into the following
>>> shape:
>>>
>>
>> You're absolutely right. This refactoring is indeed what patch 5/5 implements.
>>
>> With patch 5/5 applied, the flow becomes:
>>
>> shrink_node_memcgs()
>> shrink_one()
>> if lru_gen_enabled
>> lru_gen_shrink_lruvec --> symmetric with else shrink_lruvec()
>> if (root_reclaim(sc)) --> handle root reclaim.
>> try_to_shrink_lruvec()
>> else
>> ...
>> try_to_shrink_lruvec()
>> else
>> shrink_lruvec()
>>
>> This matches the structure you described.
>>
>> One note: shrink_one() is also called from lru_gen_shrink_node() when memcg is disabled, so I
>> believe it makes sense to keep this helper.
>
> I think we don't need shrink_one as it can be inlined to its callers and
> also shrink_node_memcgs() already handles mem_cgroup_disabled() case, so
> lru_gen_shrink_node() should not need shrink_one for such case.
>
I think you mean:
shrink_node
lru_gen_shrink_node
// We do not need to handle memcg-disabled case here,
// because shrink_node_memcgs can already handle it.
shrink_node_memcgs
for each memcg:
if lru_gen_enabled:
lru_gen_shrink_lruvec()
else
shrink_lruvec()
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
if (!sc->proactive)
vmpressure(...)
flush_reclaim_state(sc);
With this structure, both shrink_many and shrink_one are no longer needed. That looks much cleaner.
I will update it accordingly.
Thank you very much.
>>
>>> shrink_node_memcgs()
>>> for each memcg:
>>> if lru_gen_enabled:
>>> lru_gen_shrink_lruvec()
>>> else
>>> shrink_lruvec()
>>>
>
> I actually like what Johannes has requested above but if that is not
> possible without changing some behavior then let's aim to do as much as
> possible in this series while keeping the same behavior. In a followup
> we can try to combine the behavior part.
>
>>
>> Regarding the patch split, I currently kept patch 3/5 and 5/5 separate to make the changes clearer
>> in each step. Would you prefer that I merge patch 3/5 with patch 5/5, so the full refactoring
>> appears in one patch?
>>
>> Looking forward to your guidance.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-09 1:25 ` [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen Chen Ridong
2025-12-12 2:55 ` kernel test robot
2025-12-15 21:13 ` Johannes Weiner
@ 2025-12-22 3:49 ` Shakeel Butt
2025-12-22 7:44 ` Chen Ridong
2 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2025-12-22 3:49 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, zhengqi.arch, linux-mm, linux-doc,
linux-kernel, cgroups, lujialin4, zhongjinji
On Tue, Dec 09, 2025 at 01:25:55AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> Currently, flush_reclaim_state is placed differently between
> shrink_node_memcgs and shrink_many. shrink_many (only used for gen-LRU)
> calls it after each lruvec is shrunk, while shrink_node_memcgs calls it
> only after all lruvecs have been shrunk.
>
> This patch moves flush_reclaim_state into shrink_node_memcgs and calls it
> after each lruvec. This unifies the behavior and is reasonable because:
>
> 1. flush_reclaim_state adds current->reclaim_state->reclaimed to
> sc->nr_reclaimed.
> 2. For non-MGLRU root reclaim, this can help stop the iteration earlier
> when nr_to_reclaim is reached.
> 3. For non-root reclaim, the effect is negligible since flush_reclaim_state
> does nothing in that case.
Please decouple flush_reclaim_state() changes in a separate patch i.e.
making calls to flush_reclaim_state() similar for MGLRU and non-MGLRU.
For the remaining of the patch, I will respond on the other email chain.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen
2025-12-22 3:49 ` Shakeel Butt
@ 2025-12-22 7:44 ` Chen Ridong
0 siblings, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-22 7:44 UTC (permalink / raw)
To: Shakeel Butt
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, muchun.song, zhengqi.arch, linux-mm, linux-doc,
linux-kernel, cgroups, lujialin4, zhongjinji
On 2025/12/22 11:49, Shakeel Butt wrote:
> On Tue, Dec 09, 2025 at 01:25:55AM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> Currently, flush_reclaim_state is placed differently between
>> shrink_node_memcgs and shrink_many. shrink_many (only used for gen-LRU)
>> calls it after each lruvec is shrunk, while shrink_node_memcgs calls it
>> only after all lruvecs have been shrunk.
>>
>> This patch moves flush_reclaim_state into shrink_node_memcgs and calls it
>> after each lruvec. This unifies the behavior and is reasonable because:
>>
>> 1. flush_reclaim_state adds current->reclaim_state->reclaimed to
>> sc->nr_reclaimed.
>> 2. For non-MGLRU root reclaim, this can help stop the iteration earlier
>> when nr_to_reclaim is reached.
>> 3. For non-root reclaim, the effect is negligible since flush_reclaim_state
>> does nothing in that case.
>
> Please decouple flush_reclaim_state() changes in a separate patch i.e.
> making calls to flush_reclaim_state() similar for MGLRU and non-MGLRU.
>
> For the remaining of the patch, I will respond on the other email chain.
Thank you for the suggestion.
This change essentially moves only one line of code. I will add a separate patch to handle it
accordingly.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH -next 4/5] mm/mglru: combine shrink_many into shrink_node_memcgs
2025-12-09 1:25 [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
` (2 preceding siblings ...)
2025-12-09 1:25 ` [PATCH -next 3/5] mm/mglru: extend shrink_one for both lrugen and non-lrugen Chen Ridong
@ 2025-12-09 1:25 ` Chen Ridong
2025-12-15 21:17 ` Johannes Weiner
2025-12-09 1:25 ` [PATCH -next 5/5] mm/mglru: factor lrugen state out of shrink_lruvec Chen Ridong
` (2 subsequent siblings)
6 siblings, 1 reply; 25+ messages in thread
From: Chen Ridong @ 2025-12-09 1:25 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
chenridong, zhongjinji
From: Chen Ridong <chenridong@huawei.com>
The previous patch extended shrink_one to support both lrugen and
non-lrugen reclaim. Now shrink_many and shrink_node_memcgs are almost
identical, except that shrink_many also calls should_abort_scan for lrugen
root reclaim.
This patch adds the should_abort_scan check to shrink_node_memcgs (which is
only meaningful for gen-LRU root reclaim). After this change,
shrink_node_memcgs can be used directly instead of shrink_many, allowing
shrink_many to be safely removed.
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
mm/vmscan.c | 67 ++++++++++++-----------------------------------------
1 file changed, 15 insertions(+), 52 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 795f5ebd9341..dbf2cfbe3243 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4758,57 +4758,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
return nr_to_scan < 0;
}
-static void shrink_one(struct lruvec *lruvec, struct scan_control *sc);
-
-static void shrink_many(struct pglist_data *pgdat, struct scan_control *sc)
-{
- struct mem_cgroup *target = sc->target_mem_cgroup;
- struct mem_cgroup_reclaim_cookie reclaim = {
- .pgdat = pgdat,
- };
- struct mem_cgroup_reclaim_cookie *cookie = &reclaim;
- struct mem_cgroup *memcg;
-
- if (current_is_kswapd() || sc->memcg_full_walk)
- cookie = NULL;
-
- memcg = mem_cgroup_iter(target, NULL, cookie);
- while (memcg) {
- struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
- cond_resched();
-
- mem_cgroup_calculate_protection(target, memcg);
-
- if (mem_cgroup_below_min(target, memcg))
- goto next;
-
- if (mem_cgroup_below_low(target, memcg)) {
- if (!sc->memcg_low_reclaim) {
- sc->memcg_low_skipped = 1;
- goto next;
- }
- memcg_memory_event(memcg, MEMCG_LOW);
- }
-
- shrink_one(lruvec, sc);
-
- if (should_abort_scan(lruvec, sc)) {
- if (cookie)
- mem_cgroup_iter_break(target, memcg);
- break;
- }
-
-next:
- if (cookie && sc->nr_reclaimed >= sc->nr_to_reclaim) {
- mem_cgroup_iter_break(target, memcg);
- break;
- }
-
- memcg = mem_cgroup_iter(target, memcg, cookie);
- }
-}
-
static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
struct blk_plug plug;
@@ -4829,6 +4778,9 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
blk_finish_plug(&plug);
}
+static void shrink_one(struct lruvec *lruvec, struct scan_control *sc);
+static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc);
+
static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc)
{
struct blk_plug plug;
@@ -4858,7 +4810,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
if (mem_cgroup_disabled())
shrink_one(&pgdat->__lruvec, sc);
else
- shrink_many(pgdat, sc);
+ shrink_node_memcgs(pgdat, sc);
if (current_is_kswapd())
sc->nr_reclaimed += reclaimed;
@@ -5554,6 +5506,11 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *
BUILD_BUG();
}
+static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
+{
+ return false;
+}
+
#endif /* CONFIG_LRU_GEN */
static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
@@ -5822,6 +5779,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
shrink_one(lruvec, sc);
+ if (should_abort_scan(lruvec, sc)) {
+ if (partial)
+ mem_cgroup_iter_break(target_memcg, memcg);
+ break;
+ }
+
/* If partial walks are allowed, bail once goal is reached */
if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) {
mem_cgroup_iter_break(target_memcg, memcg);
--
2.34.1
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 4/5] mm/mglru: combine shrink_many into shrink_node_memcgs
2025-12-09 1:25 ` [PATCH -next 4/5] mm/mglru: combine shrink_many into shrink_node_memcgs Chen Ridong
@ 2025-12-15 21:17 ` Johannes Weiner
2025-12-16 1:23 ` Chen Ridong
2025-12-22 7:40 ` Chen Ridong
0 siblings, 2 replies; 25+ messages in thread
From: Johannes Weiner @ 2025-12-15 21:17 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
zhongjinji
On Tue, Dec 09, 2025 at 01:25:56AM +0000, Chen Ridong wrote:
> @@ -5822,6 +5779,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>
> shrink_one(lruvec, sc);
>
> + if (should_abort_scan(lruvec, sc)) {
Can you please rename this and add the jump label check?
if (lru_gen_enabled() && lru_gen_should_abort_scan())
The majority of the checks in there already happen inside
shrink_node_memcgs() itself. Factoring those out is probably better in
another patch, but no need to burden classic LRU in the meantime.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 4/5] mm/mglru: combine shrink_many into shrink_node_memcgs
2025-12-15 21:17 ` Johannes Weiner
@ 2025-12-16 1:23 ` Chen Ridong
2025-12-22 7:40 ` Chen Ridong
1 sibling, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-16 1:23 UTC (permalink / raw)
To: Johannes Weiner
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
zhongjinji
On 2025/12/16 5:17, Johannes Weiner wrote:
> On Tue, Dec 09, 2025 at 01:25:56AM +0000, Chen Ridong wrote:
>> @@ -5822,6 +5779,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>
>> shrink_one(lruvec, sc);
>>
>> + if (should_abort_scan(lruvec, sc)) {
>
> Can you please rename this and add the jump label check?
>
> if (lru_gen_enabled() && lru_gen_should_abort_scan())
>
> The majority of the checks in there already happen inside
> shrink_node_memcgs() itself. Factoring those out is probably better in
> another patch, but no need to burden classic LRU in the meantime.
Thank you very much.
Thank you for the suggestion. lru_gen_should_abort_scan() is indeed a better name, and including the
lru_gen_enabled() check in the condition is necessary.
I'll update the patch accordingly.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 4/5] mm/mglru: combine shrink_many into shrink_node_memcgs
2025-12-15 21:17 ` Johannes Weiner
2025-12-16 1:23 ` Chen Ridong
@ 2025-12-22 7:40 ` Chen Ridong
1 sibling, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-22 7:40 UTC (permalink / raw)
To: Johannes Weiner, Shakeel Butt
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
zhongjinji
On 2025/12/16 5:17, Johannes Weiner wrote:
> On Tue, Dec 09, 2025 at 01:25:56AM +0000, Chen Ridong wrote:
>> @@ -5822,6 +5779,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>
>> shrink_one(lruvec, sc);
>>
>> + if (should_abort_scan(lruvec, sc)) {
>
> Can you please rename this and add the jump label check?
>
> if (lru_gen_enabled() && lru_gen_should_abort_scan())
>
> The majority of the checks in there already happen inside
> shrink_node_memcgs() itself. Factoring those out is probably better in
> another patch, but no need to burden classic LRU in the meantime.
Adding should_abort_scan for the classic LRU seems reasonable, as it would allow the scan to stop
earlier when sufficient pages have been reclaimed or watermark is satisfied for global recalim.
Refer to the discussion here:
https://lore.kernel.org/lkml/20251209012557.1949239-1-chenridong@huaweicloud.com/T/#m4eea017f5a222ba676d9222f59ad8c898ac2aefe
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH -next 5/5] mm/mglru: factor lrugen state out of shrink_lruvec
2025-12-09 1:25 [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
` (3 preceding siblings ...)
2025-12-09 1:25 ` [PATCH -next 4/5] mm/mglru: combine shrink_many into shrink_node_memcgs Chen Ridong
@ 2025-12-09 1:25 ` Chen Ridong
2025-12-12 10:15 ` [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
2025-12-15 16:18 ` Michal Koutný
6 siblings, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-09 1:25 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
chenridong, zhongjinji
From: Chen Ridong <chenridong@huawei.com>
A previous patch updated shrink_node_memcgs to handle lrugen root reclaim
and extended shrink_one to support both lrugen and non-lrugen. However,
in shrink_one, lrugen non-root reclaim still invokes shrink_lruvec, which
should only be used for non-lrugen reclaim.
To clarify the semantics, this patch moves the lrugen-specific logic out of
shrink_lruvec, leaving shrink_lruvec exclusively for non-lrugen reclaim.
Now for lrugen, shrink_one invokes lru_gen_shrink_lruvec, which calls
try_to_shrink_lruvec directly, without extra handling for root reclaim, as
that processing is already done in lru_gen_shrink_node. Non-root reclaim
behavior remains unchanged.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
mm/vmscan.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dbf2cfbe3243..c5f517ec52a7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4762,7 +4762,12 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc
{
struct blk_plug plug;
- VM_WARN_ON_ONCE(root_reclaim(sc));
+ /* Root reclaim has finished other extra work outside, just shrink. */
+ if (root_reclaim(sc)) {
+ try_to_shrink_lruvec(lruvec, sc);
+ return;
+ }
+
VM_WARN_ON_ONCE(!sc->may_writepage || !sc->may_unmap);
lru_add_drain();
@@ -5524,11 +5529,6 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
bool proportional_reclaim;
struct blk_plug plug;
- if (lru_gen_enabled() && !root_reclaim(sc)) {
- lru_gen_shrink_lruvec(lruvec, sc);
- return;
- }
-
get_scan_count(lruvec, sc, nr);
/* Record the original scan target for proportional adjustments later */
@@ -5708,8 +5708,8 @@ static void shrink_one(struct lruvec *lruvec, struct scan_control *sc)
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
- if (lru_gen_enabled() && root_reclaim(sc))
- try_to_shrink_lruvec(lruvec, sc);
+ if (lru_gen_enabled())
+ lru_gen_shrink_lruvec(lruvec, sc);
else
shrink_lruvec(lruvec, sc);
--
2.34.1
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 0/5] mm/mglru: remove memcg lru
2025-12-09 1:25 [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
` (4 preceding siblings ...)
2025-12-09 1:25 ` [PATCH -next 5/5] mm/mglru: factor lrugen state out of shrink_lruvec Chen Ridong
@ 2025-12-12 10:15 ` Chen Ridong
2025-12-15 16:18 ` Michal Koutný
6 siblings, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-12 10:15 UTC (permalink / raw)
To: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch
Cc: linux-mm, linux-doc, linux-kernel, cgroups, lujialin4, zhongjinji
On 2025/12/9 9:25, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was introduced to improve scalability in global reclaim,
> but its implementation has grown complex and can cause performance
> regressions when creating many memory cgroups [1].
>
> This series implements mem_cgroup_iter with a reclaim cookie in
> shrink_many() for global reclaim, following the pattern already used in
> shrink_node_memcgs(), an approach suggested by Johannes [1]. The new
> design maintains good fairness across cgroups by preserving iteration
> state between reclaim passes.
>
> Testing was performed using the original stress test from Yu Zhao [2] on a
> 1 TB, 4-node NUMA system. The results show:
>
> pgsteal:
> memcg LRU memcg iter
> stddev(pgsteal) / mean(pgsteal) 106.03% 93.20%
> sum(pgsteal) / sum(requested) 98.10% 99.28%
>
> workingset_refault_anon:
> memcg LRU memcg iter
> stddev(refault) / mean(refault) 193.97% 134.67%
> sum(refault) 1,963,229 2,027,567
>
> The new implementation shows clear fairness improvements, reducing the
> standard deviation relative to the mean by 12.8 percentage points for
> pgsteal and bringing the pgsteal ratio closer to 100%. Refault counts
> increased by 3.2% (from 1,963,229 to 2,027,567).
>
> To simplify review:
> 1. Patch 1 uses mem_cgroup_iter with reclaim cookie in shrink_many()
> 2. Patch 2 removes the now-unused memcg LRU code
> 3. Patches 3–5 combine shrink_many and shrink_node_memcgs
> (This reorganization is clearer after switching to mem_cgroup_iter)
>
> ---
>
> Changes from RFC series:
> 1. Updated the test result data.
> 2. Added patches 3–5 to combine shrink_many and shrink_node_memcgs.
>
> RFC: https://lore.kernel.org/all/20251204123124.1822965-1-chenridong@huaweicloud.com/
>
> Chen Ridong (5):
> mm/mglru: use mem_cgroup_iter for global reclaim
> mm/mglru: remove memcg lru
> mm/mglru: extend shrink_one for both lrugen and non-lrugen
> mm/mglru: combine shrink_many into shrink_node_memcgs
> mm/mglru: factor lrugen state out of shrink_lruvec
>
> Documentation/mm/multigen_lru.rst | 30 ---
> include/linux/mmzone.h | 89 --------
> mm/memcontrol-v1.c | 6 -
> mm/memcontrol.c | 4 -
> mm/mm_init.c | 1 -
> mm/vmscan.c | 332 ++++--------------------------
> 6 files changed, 44 insertions(+), 418 deletions(-)
>
Hello all,
There's a warning from the kernel test robot, and I would like to update the series to fix it along
with any feedback from your reviews.
I'd appreciate it if you could take a look at this patch series when convenient.
Hi Shakeel, I would be very grateful if you could review patches 3-5. They combine shrink_many and
shrink_node_memcgs as you suggested — does that look good to you?
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 0/5] mm/mglru: remove memcg lru
2025-12-09 1:25 [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
` (5 preceding siblings ...)
2025-12-12 10:15 ` [PATCH -next 0/5] mm/mglru: remove memcg lru Chen Ridong
@ 2025-12-15 16:18 ` Michal Koutný
2025-12-16 0:45 ` Chen Ridong
6 siblings, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2025-12-15 16:18 UTC (permalink / raw)
To: Chen Ridong
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
zhongjinji
[-- Attachment #1: Type: text/plain, Size: 1784 bytes --]
Hi.
On Tue, Dec 09, 2025 at 01:25:52AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was introduced to improve scalability in global reclaim,
> but its implementation has grown complex and can cause performance
> regressions when creating many memory cgroups [1].
>
> This series implements mem_cgroup_iter with a reclaim cookie in
> shrink_many() for global reclaim, following the pattern already used in
> shrink_node_memcgs(), an approach suggested by Johannes [1]. The new
> design maintains good fairness across cgroups by preserving iteration
> state between reclaim passes.
>
> Testing was performed using the original stress test from Yu Zhao [2] on a
> 1 TB, 4-node NUMA system. The results show:
(I think the cover letter somehow lost the targets of [1],[2]. I assume
I could retrieve those from patch 1/5.)
>
> pgsteal:
> memcg LRU memcg iter
> stddev(pgsteal) / mean(pgsteal) 106.03% 93.20%
> sum(pgsteal) / sum(requested) 98.10% 99.28%
>
> workingset_refault_anon:
> memcg LRU memcg iter
> stddev(refault) / mean(refault) 193.97% 134.67%
> sum(refault) 1,963,229 2,027,567
>
> The new implementation shows clear fairness improvements, reducing the
> standard deviation relative to the mean by 12.8 percentage points for
> pgsteal and bringing the pgsteal ratio closer to 100%. Refault counts
> increased by 3.2% (from 1,963,229 to 2,027,567).
Just as a quick clarification -- this isn't supposed to affect regular
(CONFIG_LRU_GEN_ENABLED=n) reclaim, correct?
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH -next 0/5] mm/mglru: remove memcg lru
2025-12-15 16:18 ` Michal Koutný
@ 2025-12-16 0:45 ` Chen Ridong
0 siblings, 0 replies; 25+ messages in thread
From: Chen Ridong @ 2025-12-16 0:45 UTC (permalink / raw)
To: Michal Koutný
Cc: akpm, axelrasmussen, yuanchu, weixugc, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, corbet, hannes,
roman.gushchin, shakeel.butt, muchun.song, zhengqi.arch,
linux-mm, linux-doc, linux-kernel, cgroups, lujialin4,
zhongjinji
On 2025/12/16 0:18, Michal Koutný wrote:
> Hi.
>
> On Tue, Dec 09, 2025 at 01:25:52AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The memcg LRU was introduced to improve scalability in global reclaim,
>> but its implementation has grown complex and can cause performance
>> regressions when creating many memory cgroups [1].
>>
>> This series implements mem_cgroup_iter with a reclaim cookie in
>> shrink_many() for global reclaim, following the pattern already used in
>> shrink_node_memcgs(), an approach suggested by Johannes [1]. The new
>> design maintains good fairness across cgroups by preserving iteration
>> state between reclaim passes.
>>
>> Testing was performed using the original stress test from Yu Zhao [2] on a
>> 1 TB, 4-node NUMA system. The results show:
>
> (I think the cover letter somehow lost the targets of [1],[2]. I assume
> I could retrieve those from patch 1/5.)
>
Hi Michal,
Thanks for the reminder—I appreciate you pointing that out.
Apologies for missing the links in the cover letter. You can find them in patch 1/5.
>
>>
>> pgsteal:
>> memcg LRU memcg iter
>> stddev(pgsteal) / mean(pgsteal) 106.03% 93.20%
>> sum(pgsteal) / sum(requested) 98.10% 99.28%
>>
>> workingset_refault_anon:
>> memcg LRU memcg iter
>> stddev(refault) / mean(refault) 193.97% 134.67%
>> sum(refault) 1,963,229 2,027,567
>>
>> The new implementation shows clear fairness improvements, reducing the
>> standard deviation relative to the mean by 12.8 percentage points for
>> pgsteal and bringing the pgsteal ratio closer to 100%. Refault counts
>> increased by 3.2% (from 1,963,229 to 2,027,567).
>
> Just as a quick clarification -- this isn't supposed to affect regular
> (CONFIG_LRU_GEN_ENABLED=n) reclaim, correct?
>
> Thanks,
> Michal
That's correct. To be precise, it only affects root reclaim when lru_gen_enabled() returns true.
Note that the generation LRU can still be enabled via /sys/kernel/mm/lru_gen/enabled even when
CONFIG_LRU_GEN_ENABLED=n.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 25+ messages in thread