* [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-02-28 16:10 Leno Hou
2026-02-28 18:58 ` Andrew Morton
` (4 more replies)
0 siblings, 5 replies; 22+ messages in thread
From: Leno Hou @ 2026-02-28 16:10 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Leno Hou, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Barry Song, Jialing Wang, Yafang Shao, Yu Zhao
When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim
path. This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.
*** Problem Description ***
The issue arises from a "reclaim vacuum" during the transition:
1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
false before the pages are drained from MGLRU lists back to
traditional LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
yet, or the changes are not yet visible to all CPUs due to a lack of
synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
concludes there is no reclaimable memory, and triggers an OOM kill.
A similar race can occur during enablement, where the reclaimer sees
the new state but the MGLRU lists haven't been populated via
fill_evictable() yet.
*** Solution ***
Introduce a 'draining' state to bridge the gap during transitions:
- Use smp_store_release() and smp_load_acquire() to ensure the visibility
of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
is in the 'draining' state, the reclaimer will attempt to scan MGLRU
lists first, and then fall through to traditional LRU lists instead
of returning early. This ensures that folios are visible to at least
one reclaim path at any given time.
*** Reproduction ***
The issue was consistently reproduced on v6.1.157 and v6.18.3 using
a high-pressure memory cgroup (v1) environment.
Reproduction steps:
1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
and 8GB active anonymous memory.
2. Toggle MGLRU state while performing new memory allocations to force
direct reclaim.
Reproduction script:
---
#!/bin/bash
# Fixed reproduction for memcg OOM during MGLRU toggle
set -euo pipefail
MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
# Switch MGLRU dynamically in the background
switch_mglru() {
local orig_val=$(cat "$MGLRU_FILE")
if [[ "$orig_val" != "0x0000" ]]; then
echo n > "$MGLRU_FILE" &
else
echo y > "$MGLRU_FILE" &
fi
}
# Setup 16G memcg
mkdir -p "$CGROUP_PATH"
echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
echo $$ > "$CGROUP_PATH/cgroup.procs"
# 1. Build memory pressure (File + Anon)
dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
sleep 5
# 2. Trigger switch and concurrent allocation
switch_mglru
stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
# Check OOM counter
grep oom_kill "$CGROUP_PATH/memory.oom_control"
---
Signed-off-by: Leno Hou <lenohou@gmail.com>
---
To: linux-mm@kvack.org
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Jialing Wang <wjl.linux@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
---
include/linux/mmzone.h | 2 ++
mm/vmscan.c | 14 +++++++++++---
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c5725..0648ce91dbc6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,8 @@ struct lru_gen_folio {
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
/* whether the multi-gen LRU is enabled */
bool enabled;
+ /* whether the multi-gen LRU is draining to LRU */
+ bool draining;
/* the memcg generation this lru_gen_folio belongs to */
u8 gen;
/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06071995dacc..629a00681163 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
VM_WARN_ON_ONCE(!state_is_valid(lruvec));
- lruvec->lrugen.enabled = enabled;
+ smp_store_release(&lruvec->lrugen.enabled, enabled);
+ smp_store_release(&lruvec->lrugen.draining, true);
while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
spin_unlock_irq(&lruvec->lru_lock);
@@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
spin_lock_irq(&lruvec->lru_lock);
}
+ smp_store_release(&lruvec->lrugen.draining, false);
+
spin_unlock_irq(&lruvec->lru_lock);
}
@@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
bool proportional_reclaim;
struct blk_plug plug;
+ bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
+ bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
- if (lru_gen_enabled() && !root_reclaim(sc)) {
+ if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
lru_gen_shrink_lruvec(lruvec, sc);
- return;
+
+ if (!lru_draining)
+ return;
+
}
get_scan_count(lruvec, sc, nr);
--
2.52.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
@ 2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
` (3 subsequent siblings)
4 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2026-02-28 18:58 UTC (permalink / raw)
To: Leno Hou
Cc: linux-mm, linux-kernel, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Barry Song, Jialing Wang, Yafang Shao, Yu Zhao
On Sun, 1 Mar 2026 00:10:08 +0800 Leno Hou <lenohou@gmail.com> wrote:
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim
> path. This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> ...
>
Nice description, thanks. I'll queue this for testing while we await
comments.
>
> Reproduction script:
> ---
Please avoid using the ^---$ separator in changelogs - it means "end of
changelog text"!
> Signed-off-by: Leno Hou <lenohou@gmail.com>
>
> ---
Ditto.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
2026-02-28 18:58 ` Andrew Morton
@ 2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
` (2 subsequent siblings)
4 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-02-28 19:12 UTC (permalink / raw)
To: Leno Hou, linux-mm, linux-kernel
Cc: oe-kbuild-all, Leno Hou, Andrew Morton,
Linux Memory Management List, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Barry Song, Jialing Wang, Yafang Shao, Yu Zhao
Hi Leno,
kernel test robot noticed the following build errors:
[auto build test ERROR on v7.0-rc1]
[also build test ERROR on linus/master next-20260227]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Leno-Hou/mm-mglru-fix-cgroup-OOM-during-MGLRU-state-switching/20260301-001148
base: v7.0-rc1
patch link: https://lore.kernel.org/r/20260228161008.707-1-lenohou%40gmail.com
patch subject: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
config: x86_64-randconfig-001-20260301 (https://download.01.org/0day-ci/archive/20260301/202603010315.rTOWjv41-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260301/202603010315.rTOWjv41-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603010315.rTOWjv41-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
In file included from include/asm-generic/bitops/generic-non-atomic.h:7,
from include/linux/bitops.h:28,
from include/linux/thread_info.h:27,
from include/linux/spinlock.h:60,
from include/linux/mmzone.h:8,
from include/linux/gfp.h:7,
from include/linux/mm.h:8,
from mm/vmscan.c:15:
mm/vmscan.c: In function 'shrink_lruvec':
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
arch/x86/include/asm/barrier.h:68:17: note: in definition of macro '__smp_load_acquire'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
In file included from <command-line>:
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:642:53: note: in definition of macro '__unqual_scalar_typeof'
642 | #define __unqual_scalar_typeof(x) __typeof_unqual__(x)
| ^
include/asm-generic/rwonce.h:50:9: note: in expansion of macro '__READ_ONCE'
50 | __READ_ONCE(x); \
| ^~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:372,
from include/linux/static_call_types.h:7,
from arch/x86/include/asm/bug.h:141,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:7:
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/asm-generic/rwonce.h:44:73: note: in definition of macro '__READ_ONCE'
44 | #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
| ^
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
69 | compiletime_assert_atomic_type(*p); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
69 | compiletime_assert_atomic_type(*p); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
69 | compiletime_assert_atomic_type(*p); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~~~~~~
include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
709 | compiletime_assert(__native_word(t), \
| ^~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
69 | compiletime_assert_atomic_type(*p); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ^~~~~~~~~~~~~~~~
mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ^~
arch/x86/include/asm/barrier.h:68:17: note: in definition of macro '__smp_load_acquire'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^
mm/vmscan.c:5786:29: note: in expansion of macro 'smp_load_acquire'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ^~~~~~~~~~~~~~~~
mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5786:29: note: in expansion of macro 'smp_load_acquire'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ^~~~~~~~~~~~~~~~
mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~
include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
49 | compiletime_assert_rwonce_type(x); \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
68 | typeof(*p) ___p1 = READ_ONCE(*p); \
| ^~~~~~~~~
include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
176 | #define smp_load_acquire(p) __smp_load_acquire(p)
| ^~~~~~~~~~~~~~~~~~
mm/vmscan.c:5786:29: note: in expansion of macro 'smp_load_acquire'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ^~~~~~~~~~~~~~~~
mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ^~
include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
686 | if (!(condition)) \
| ^~~~~~~~~
include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
| ^~~~~~~~~~~~~~~~~~
include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
36 | compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
..
vim +5785 mm/vmscan.c
5774
5775 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
5776 {
5777 unsigned long nr[NR_LRU_LISTS];
5778 unsigned long targets[NR_LRU_LISTS];
5779 unsigned long nr_to_scan;
5780 enum lru_list lru;
5781 unsigned long nr_reclaimed = 0;
5782 unsigned long nr_to_reclaim = sc->nr_to_reclaim;
5783 bool proportional_reclaim;
5784 struct blk_plug plug;
> 5785 bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
5786 bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
5787
> 5788 if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
5789 lru_gen_shrink_lruvec(lruvec, sc);
5790
5791 if (!lru_draining)
5792 return;
5793
5794 }
5795
5796 get_scan_count(lruvec, sc, nr);
5797
5798 /* Record the original scan target for proportional adjustments later */
5799 memcpy(targets, nr, sizeof(nr));
5800
5801 /*
5802 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
5803 * event that can occur when there is little memory pressure e.g.
5804 * multiple streaming readers/writers. Hence, we do not abort scanning
5805 * when the requested number of pages are reclaimed when scanning at
5806 * DEF_PRIORITY on the assumption that the fact we are direct
5807 * reclaiming implies that kswapd is not keeping up and it is best to
5808 * do a batch of work at once. For memcg reclaim one check is made to
5809 * abort proportional reclaim if either the file or anon lru has already
5810 * dropped to zero at the first pass.
5811 */
5812 proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
5813 sc->priority == DEF_PRIORITY);
5814
5815 blk_start_plug(&plug);
5816 while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
5817 nr[LRU_INACTIVE_FILE]) {
5818 unsigned long nr_anon, nr_file, percentage;
5819 unsigned long nr_scanned;
5820
5821 for_each_evictable_lru(lru) {
5822 if (nr[lru]) {
5823 nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
5824 nr[lru] -= nr_to_scan;
5825
5826 nr_reclaimed += shrink_list(lru, nr_to_scan,
5827 lruvec, sc);
5828 }
5829 }
5830
5831 cond_resched();
5832
5833 if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
5834 continue;
5835
5836 /*
5837 * For kswapd and memcg, reclaim at least the number of pages
5838 * requested. Ensure that the anon and file LRUs are scanned
5839 * proportionally what was requested by get_scan_count(). We
5840 * stop reclaiming one LRU and reduce the amount scanning
5841 * proportional to the original scan target.
5842 */
5843 nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
5844 nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
5845
5846 /*
5847 * It's just vindictive to attack the larger once the smaller
5848 * has gone to zero. And given the way we stop scanning the
5849 * smaller below, this makes sure that we only make one nudge
5850 * towards proportionality once we've got nr_to_reclaim.
5851 */
5852 if (!nr_file || !nr_anon)
5853 break;
5854
5855 if (nr_file > nr_anon) {
5856 unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
5857 targets[LRU_ACTIVE_ANON] + 1;
5858 lru = LRU_BASE;
5859 percentage = nr_anon * 100 / scan_target;
5860 } else {
5861 unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
5862 targets[LRU_ACTIVE_FILE] + 1;
5863 lru = LRU_FILE;
5864 percentage = nr_file * 100 / scan_target;
5865 }
5866
5867 /* Stop scanning the smaller of the LRU */
5868 nr[lru] = 0;
5869 nr[lru + LRU_ACTIVE] = 0;
5870
5871 /*
5872 * Recalculate the other LRU scan count based on its original
5873 * scan target and the percentage scanning already complete
5874 */
5875 lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
5876 nr_scanned = targets[lru] - nr[lru];
5877 nr[lru] = targets[lru] * (100 - percentage) / 100;
5878 nr[lru] -= min(nr[lru], nr_scanned);
5879
5880 lru += LRU_ACTIVE;
5881 nr_scanned = targets[lru] - nr[lru];
5882 nr[lru] = targets[lru] * (100 - percentage) / 100;
5883 nr[lru] -= min(nr[lru], nr_scanned);
5884 }
5885 blk_finish_plug(&plug);
5886 sc->nr_reclaimed += nr_reclaimed;
5887
5888 /*
5889 * Even if we did not try to evict anon pages at all, we want to
5890 * rebalance the anon lru active/inactive ratio.
5891 */
5892 if (can_age_anon_pages(lruvec, sc) &&
5893 inactive_is_low(lruvec, LRU_INACTIVE_ANON))
5894 shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
5895 sc, LRU_ACTIVE_ANON);
5896 }
5897
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
@ 2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
4 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-02-28 19:23 UTC (permalink / raw)
To: Leno Hou, linux-mm, linux-kernel
Cc: llvm, oe-kbuild-all, Leno Hou, Andrew Morton,
Linux Memory Management List, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Barry Song, Jialing Wang, Yafang Shao, Yu Zhao
Hi Leno,
kernel test robot noticed the following build warnings:
[auto build test WARNING on v7.0-rc1]
[also build test WARNING on linus/master next-20260227]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Leno-Hou/mm-mglru-fix-cgroup-OOM-during-MGLRU-state-switching/20260301-001148
base: v7.0-rc1
patch link: https://lore.kernel.org/r/20260228161008.707-1-lenohou%40gmail.com
patch subject: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
config: um-defconfig (https://download.01.org/0day-ci/archive/20260301/202603010300.t6GYRWjK-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9a109fbb6e184ec9bcce10615949f598f4c974a9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260301/202603010300.t6GYRWjK-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603010300.t6GYRWjK-lkp@intel.com/
All warnings (new ones prefixed by >>):
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
>> mm/vmscan.c:5788:37: warning: '&&' within '||' [-Wlogical-op-parentheses]
5788 | if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
| ~~ ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
mm/vmscan.c:5788:37: note: place parentheses around the '&&' expression to silence this warning
5788 | if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
| ^
| ( )
1 warning and 18 errors generated.
vim +5788 mm/vmscan.c
5774
5775 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
5776 {
5777 unsigned long nr[NR_LRU_LISTS];
5778 unsigned long targets[NR_LRU_LISTS];
5779 unsigned long nr_to_scan;
5780 enum lru_list lru;
5781 unsigned long nr_reclaimed = 0;
5782 unsigned long nr_to_reclaim = sc->nr_to_reclaim;
5783 bool proportional_reclaim;
5784 struct blk_plug plug;
5785 bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
5786 bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
5787
> 5788 if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
5789 lru_gen_shrink_lruvec(lruvec, sc);
5790
5791 if (!lru_draining)
5792 return;
5793
5794 }
5795
5796 get_scan_count(lruvec, sc, nr);
5797
5798 /* Record the original scan target for proportional adjustments later */
5799 memcpy(targets, nr, sizeof(nr));
5800
5801 /*
5802 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
5803 * event that can occur when there is little memory pressure e.g.
5804 * multiple streaming readers/writers. Hence, we do not abort scanning
5805 * when the requested number of pages are reclaimed when scanning at
5806 * DEF_PRIORITY on the assumption that the fact we are direct
5807 * reclaiming implies that kswapd is not keeping up and it is best to
5808 * do a batch of work at once. For memcg reclaim one check is made to
5809 * abort proportional reclaim if either the file or anon lru has already
5810 * dropped to zero at the first pass.
5811 */
5812 proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
5813 sc->priority == DEF_PRIORITY);
5814
5815 blk_start_plug(&plug);
5816 while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
5817 nr[LRU_INACTIVE_FILE]) {
5818 unsigned long nr_anon, nr_file, percentage;
5819 unsigned long nr_scanned;
5820
5821 for_each_evictable_lru(lru) {
5822 if (nr[lru]) {
5823 nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
5824 nr[lru] -= nr_to_scan;
5825
5826 nr_reclaimed += shrink_list(lru, nr_to_scan,
5827 lruvec, sc);
5828 }
5829 }
5830
5831 cond_resched();
5832
5833 if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
5834 continue;
5835
5836 /*
5837 * For kswapd and memcg, reclaim at least the number of pages
5838 * requested. Ensure that the anon and file LRUs are scanned
5839 * proportionally what was requested by get_scan_count(). We
5840 * stop reclaiming one LRU and reduce the amount scanning
5841 * proportional to the original scan target.
5842 */
5843 nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
5844 nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
5845
5846 /*
5847 * It's just vindictive to attack the larger once the smaller
5848 * has gone to zero. And given the way we stop scanning the
5849 * smaller below, this makes sure that we only make one nudge
5850 * towards proportionality once we've got nr_to_reclaim.
5851 */
5852 if (!nr_file || !nr_anon)
5853 break;
5854
5855 if (nr_file > nr_anon) {
5856 unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
5857 targets[LRU_ACTIVE_ANON] + 1;
5858 lru = LRU_BASE;
5859 percentage = nr_anon * 100 / scan_target;
5860 } else {
5861 unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
5862 targets[LRU_ACTIVE_FILE] + 1;
5863 lru = LRU_FILE;
5864 percentage = nr_file * 100 / scan_target;
5865 }
5866
5867 /* Stop scanning the smaller of the LRU */
5868 nr[lru] = 0;
5869 nr[lru + LRU_ACTIVE] = 0;
5870
5871 /*
5872 * Recalculate the other LRU scan count based on its original
5873 * scan target and the percentage scanning already complete
5874 */
5875 lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
5876 nr_scanned = targets[lru] - nr[lru];
5877 nr[lru] = targets[lru] * (100 - percentage) / 100;
5878 nr[lru] -= min(nr[lru], nr_scanned);
5879
5880 lru += LRU_ACTIVE;
5881 nr_scanned = targets[lru] - nr[lru];
5882 nr[lru] = targets[lru] * (100 - percentage) / 100;
5883 nr[lru] -= min(nr[lru], nr_scanned);
5884 }
5885 blk_finish_plug(&plug);
5886 sc->nr_reclaimed += nr_reclaimed;
5887
5888 /*
5889 * Even if we did not try to evict anon pages at all, we want to
5890 * rebalance the anon lru active/inactive ratio.
5891 */
5892 if (can_age_anon_pages(lruvec, sc) &&
5893 inactive_is_low(lruvec, LRU_INACTIVE_ANON))
5894 shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
5895 sc, LRU_ACTIVE_ANON);
5896 }
5897
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
` (2 preceding siblings ...)
2026-02-28 19:23 ` kernel test robot
@ 2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
4 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2026-02-28 20:15 UTC (permalink / raw)
To: Leno Hou, linux-mm, linux-kernel
Cc: llvm, oe-kbuild-all, Leno Hou, Andrew Morton,
Linux Memory Management List, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Barry Song, Jialing Wang, Yafang Shao, Yu Zhao
Hi Leno,
kernel test robot noticed the following build errors:
[auto build test ERROR on v7.0-rc1]
[also build test ERROR on linus/master next-20260227]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Leno-Hou/mm-mglru-fix-cgroup-OOM-during-MGLRU-state-switching/20260301-001148
base: v7.0-rc1
patch link: https://lore.kernel.org/r/20260228161008.707-1-lenohou%40gmail.com
patch subject: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
config: um-defconfig (https://download.01.org/0day-ci/archive/20260301/202603010435.MBtvBCTp-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9a109fbb6e184ec9bcce10615949f598f4c974a9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260301/202603010435.MBtvBCTp-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603010435.MBtvBCTp-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
5785 | bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
5786 | bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
| ~~~~~~ ^
mm/vmscan.c:5788:37: warning: '&&' within '||' [-Wlogical-op-parentheses]
5788 | if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
| ~~ ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
mm/vmscan.c:5788:37: note: place parentheses around the '&&' expression to silence this warning
5788 | if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
| ^
| ( )
1 warning and 18 errors generated.
vim +5785 mm/vmscan.c
5774
5775 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
5776 {
5777 unsigned long nr[NR_LRU_LISTS];
5778 unsigned long targets[NR_LRU_LISTS];
5779 unsigned long nr_to_scan;
5780 enum lru_list lru;
5781 unsigned long nr_reclaimed = 0;
5782 unsigned long nr_to_reclaim = sc->nr_to_reclaim;
5783 bool proportional_reclaim;
5784 struct blk_plug plug;
> 5785 bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
5786 bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
5787
5788 if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
5789 lru_gen_shrink_lruvec(lruvec, sc);
5790
5791 if (!lru_draining)
5792 return;
5793
5794 }
5795
5796 get_scan_count(lruvec, sc, nr);
5797
5798 /* Record the original scan target for proportional adjustments later */
5799 memcpy(targets, nr, sizeof(nr));
5800
5801 /*
5802 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
5803 * event that can occur when there is little memory pressure e.g.
5804 * multiple streaming readers/writers. Hence, we do not abort scanning
5805 * when the requested number of pages are reclaimed when scanning at
5806 * DEF_PRIORITY on the assumption that the fact we are direct
5807 * reclaiming implies that kswapd is not keeping up and it is best to
5808 * do a batch of work at once. For memcg reclaim one check is made to
5809 * abort proportional reclaim if either the file or anon lru has already
5810 * dropped to zero at the first pass.
5811 */
5812 proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
5813 sc->priority == DEF_PRIORITY);
5814
5815 blk_start_plug(&plug);
5816 while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
5817 nr[LRU_INACTIVE_FILE]) {
5818 unsigned long nr_anon, nr_file, percentage;
5819 unsigned long nr_scanned;
5820
5821 for_each_evictable_lru(lru) {
5822 if (nr[lru]) {
5823 nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
5824 nr[lru] -= nr_to_scan;
5825
5826 nr_reclaimed += shrink_list(lru, nr_to_scan,
5827 lruvec, sc);
5828 }
5829 }
5830
5831 cond_resched();
5832
5833 if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
5834 continue;
5835
5836 /*
5837 * For kswapd and memcg, reclaim at least the number of pages
5838 * requested. Ensure that the anon and file LRUs are scanned
5839 * proportionally what was requested by get_scan_count(). We
5840 * stop reclaiming one LRU and reduce the amount scanning
5841 * proportional to the original scan target.
5842 */
5843 nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
5844 nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
5845
5846 /*
5847 * It's just vindictive to attack the larger once the smaller
5848 * has gone to zero. And given the way we stop scanning the
5849 * smaller below, this makes sure that we only make one nudge
5850 * towards proportionality once we've got nr_to_reclaim.
5851 */
5852 if (!nr_file || !nr_anon)
5853 break;
5854
5855 if (nr_file > nr_anon) {
5856 unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
5857 targets[LRU_ACTIVE_ANON] + 1;
5858 lru = LRU_BASE;
5859 percentage = nr_anon * 100 / scan_target;
5860 } else {
5861 unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
5862 targets[LRU_ACTIVE_FILE] + 1;
5863 lru = LRU_FILE;
5864 percentage = nr_file * 100 / scan_target;
5865 }
5866
5867 /* Stop scanning the smaller of the LRU */
5868 nr[lru] = 0;
5869 nr[lru + LRU_ACTIVE] = 0;
5870
5871 /*
5872 * Recalculate the other LRU scan count based on its original
5873 * scan target and the percentage scanning already complete
5874 */
5875 lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
5876 nr_scanned = targets[lru] - nr[lru];
5877 nr[lru] = targets[lru] * (100 - percentage) / 100;
5878 nr[lru] -= min(nr[lru], nr_scanned);
5879
5880 lru += LRU_ACTIVE;
5881 nr_scanned = targets[lru] - nr[lru];
5882 nr[lru] = targets[lru] * (100 - percentage) / 100;
5883 nr[lru] -= min(nr[lru], nr_scanned);
5884 }
5885 blk_finish_plug(&plug);
5886 sc->nr_reclaimed += nr_reclaimed;
5887
5888 /*
5889 * Even if we did not try to evict anon pages at all, we want to
5890 * rebalance the anon lru active/inactive ratio.
5891 */
5892 if (can_age_anon_pages(lruvec, sc) &&
5893 inactive_is_low(lruvec, LRU_INACTIVE_ANON))
5894 shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
5895 sc, LRU_ACTIVE_ANON);
5896 }
5897
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
` (3 preceding siblings ...)
2026-02-28 20:15 ` kernel test robot
@ 2026-02-28 21:28 ` Barry Song
2026-02-28 22:41 ` Barry Song
2026-03-02 5:50 ` Yafang Shao
4 siblings, 2 replies; 22+ messages in thread
From: Barry Song @ 2026-02-28 21:28 UTC (permalink / raw)
To: lenohou
Cc: 21cnbao, akpm, axelrasmussen, laoar.shao, linux-kernel, linux-mm,
weixugc, wjl.linux, yuanchu, yuzhao
On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
>
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim
> path. This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> *** Problem Description ***
>
> The issue arises from a "reclaim vacuum" during the transition:
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> false before the pages are drained from MGLRU lists back to
> traditional LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
> yet, or the changes are not yet visible to all CPUs due to a lack of
> synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
> concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees
> the new state but the MGLRU lists haven't been populated via
> fill_evictable() yet.
>
> *** Solution ***
>
> Introduce a 'draining' state to bridge the gap during transitions:
>
> - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> of 'enabled' and 'draining' flags across CPUs.
> - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> lists first, and then fall through to traditional LRU lists instead
> of returning early. This ensures that folios are visible to at least
> one reclaim path at any given time.
>
> *** Reproduction ***
>
> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> a high-pressure memory cgroup (v1) environment.
>
> Reproduction steps:
> 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> and 8GB active anonymous memory.
> 2. Toggle MGLRU state while performing new memory allocations to force
> direct reclaim.
>
> Reproduction script:
> ---
> #!/bin/bash
> # Fixed reproduction for memcg OOM during MGLRU toggle
> set -euo pipefail
>
> MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
>
> # Switch MGLRU dynamically in the background
> switch_mglru() {
> local orig_val=$(cat "$MGLRU_FILE")
> if [[ "$orig_val" != "0x0000" ]]; then
> echo n > "$MGLRU_FILE" &
> else
> echo y > "$MGLRU_FILE" &
> fi
> }
>
> # Setup 16G memcg
> mkdir -p "$CGROUP_PATH"
> echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> echo $$ > "$CGROUP_PATH/cgroup.procs"
>
> # 1. Build memory pressure (File + Anon)
> dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
>
> stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> sleep 5
>
> # 2. Trigger switch and concurrent allocation
> switch_mglru
> stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
>
> # Check OOM counter
> grep oom_kill "$CGROUP_PATH/memory.oom_control"
> ---
>
> Signed-off-by: Leno Hou <lenohou@gmail.com>
>
> ---
> To: linux-mm@kvack.org
> To: linux-kernel@vger.kernel.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Yuanchu Xie <yuanchu@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Barry Song <21cnbao@gmail.com>
> Cc: Jialing Wang <wjl.linux@gmail.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
> include/linux/mmzone.h | 2 ++
> mm/vmscan.c | 14 +++++++++++---
> 2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c5725..0648ce91dbc6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -509,6 +509,8 @@ struct lru_gen_folio {
> atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> /* whether the multi-gen LRU is enabled */
> bool enabled;
> + /* whether the multi-gen LRU is draining to LRU */
> + bool draining;
> /* the memcg generation this lru_gen_folio belongs to */
> u8 gen;
> /* the list segment this lru_gen_folio belongs to */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06071995dacc..629a00681163 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> VM_WARN_ON_ONCE(!state_is_valid(lruvec));
>
> - lruvec->lrugen.enabled = enabled;
> + smp_store_release(&lruvec->lrugen.enabled, enabled);
> + smp_store_release(&lruvec->lrugen.draining, true);
>
> while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> spin_unlock_irq(&lruvec->lru_lock);
> @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> spin_lock_irq(&lruvec->lru_lock);
> }
>
> + smp_store_release(&lruvec->lrugen.draining, false);
> +
> spin_unlock_irq(&lruvec->lru_lock);
> }
>
> @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> bool proportional_reclaim;
> struct blk_plug plug;
> + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
>
> - if (lru_gen_enabled() && !root_reclaim(sc)) {
> + if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> lru_gen_shrink_lruvec(lruvec, sc);
> - return;
Is it possible to simply wait for draining to finish instead of performing
an lru_gen/lru shrink while lru_gen is being disabled or enabled?
Performing a shrink in an intermediate state may still involve a lot of
uncertainty, depending on how far the shrink has progressed and how much
remains in each side’s LRU?
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..ba306e986050 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,8 @@ struct lru_gen_folio {
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
/* whether the multi-gen LRU is enabled */
bool enabled;
+ /* whether the multi-gen LRU is switching from/to active/inactive LRU */
+ bool switching;
/* the memcg generation this lru_gen_folio belongs to */
u8 gen;
/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..60fc611067c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
VM_WARN_ON_ONCE(!state_is_valid(lruvec));
lruvec->lrugen.enabled = enabled;
+ smp_store_release(&lruvec->lrugen.switching, true);
while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
spin_unlock_irq(&lruvec->lru_lock);
@@ -5203,6 +5204,8 @@ static void lru_gen_change_state(bool enabled)
spin_lock_irq(&lruvec->lru_lock);
}
+ smp_store_release(&lruvec->lrugen.switching, false);
+
spin_unlock_irq(&lruvec->lru_lock);
}
@@ -5780,6 +5783,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
bool proportional_reclaim;
struct blk_plug plug;
+#ifdef CONFIG_LRU_GEN
+ while (smp_load_acquire(&lruvec->lrugen.switching))
+ schedule_timeout_uninterruptible(HZ/100);
+#endif
if (lru_gen_enabled() && !root_reclaim(sc)) {
lru_gen_shrink_lruvec(lruvec, sc);
return;
--
> +
> + if (!lru_draining)
> + return;
> +
> }
>
> get_scan_count(lruvec, sc, nr);
> --
> 2.52.0
>
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 21:28 ` Barry Song
@ 2026-02-28 22:41 ` Barry Song
2026-03-01 4:10 ` Barry Song
2026-03-02 5:50 ` Yafang Shao
1 sibling, 1 reply; 22+ messages in thread
From: Barry Song @ 2026-02-28 22:41 UTC (permalink / raw)
To: lenohou
Cc: akpm, axelrasmussen, laoar.shao, linux-kernel, linux-mm, weixugc,
wjl.linux, yuanchu, yuzhao
On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
[...]
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3e51190a55e4..ba306e986050 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -509,6 +509,8 @@ struct lru_gen_folio {
> atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> /* whether the multi-gen LRU is enabled */
> bool enabled;
> + /* whether the multi-gen LRU is switching from/to active/inactive LRU */
> + bool switching;
> /* the memcg generation this lru_gen_folio belongs to */
> u8 gen;
> /* the list segment this lru_gen_folio belongs to */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0fc9373e8251..60fc611067c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
> VM_WARN_ON_ONCE(!state_is_valid(lruvec));
>
> lruvec->lrugen.enabled = enabled;
> + smp_store_release(&lruvec->lrugen.switching, true);
Sorry, I actually meant:
+ smp_store_release(&lruvec->lrugen.switching, true);
lruvec->lrugen.enabled = enabled;
But I guess we could still hit a race condition in extreme cases—switching
MGLRU on or off as frequently as possible. The only reliable way is to check
enabled during shrinking while holding the lruvec’s lock.
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 22:41 ` Barry Song
@ 2026-03-01 4:10 ` Barry Song
0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2026-03-01 4:10 UTC (permalink / raw)
To: lenohou
Cc: 21cnbao, akpm, axelrasmussen, laoar.shao, linux-kernel, linux-mm,
weixugc, wjl.linux, yuanchu, yuzhao
On Sun, Mar 1, 2026 at 6:41 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> [...]
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 3e51190a55e4..ba306e986050 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > /* whether the multi-gen LRU is enabled */
> > bool enabled;
> > + /* whether the multi-gen LRU is switching from/to active/inactive LRU */
> > + bool switching;
> > /* the memcg generation this lru_gen_folio belongs to */
> > u8 gen;
> > /* the list segment this lru_gen_folio belongs to */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0fc9373e8251..60fc611067c7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
> > VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> >
> > lruvec->lrugen.enabled = enabled;
> > + smp_store_release(&lruvec->lrugen.switching, true);
>
> Sorry, I actually meant:
>
> + smp_store_release(&lruvec->lrugen.switching, true);
> lruvec->lrugen.enabled = enabled;
>
> But I guess we could still hit a race condition in extreme cases—switching
> MGLRU on or off as frequently as possible. The only reliable way is to check
> enabled during shrinking while holding the lruvec’s lock.
Sorry, I was talking to myself.... Since the switching and the 'enabled'
state are not inherently serialized with shrink_lruvec(), their values
can change at any time, leading to race conditions.
Therefore, I believe the only safe approach is:
1. Do not allow enabling or disabling MGLRU on an lruvec while
shrink_lruvec() is running.
2. Do not allow shrink_lruvec() to run while MGLRU is being enabled
or disabled on that lruvec.
Something like the following:
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..c4b07159577e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,7 @@ struct lru_gen_folio {
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
/* whether the multi-gen LRU is enabled */
bool enabled;
+ struct rw_semaphore switch_lock;
/* the memcg generation this lru_gen_folio belongs to */
u8 gen;
/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..aadf1e7c31cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5190,6 +5190,7 @@ static void lru_gen_change_state(bool enabled)
for_each_node(nid) {
struct lruvec *lruvec = get_lruvec(memcg, nid);
+ down_write(&lruvec->lrugen.switch_lock);
spin_lock_irq(&lruvec->lru_lock);
VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
@@ -5204,6 +5205,7 @@ static void lru_gen_change_state(bool enabled)
}
spin_unlock_irq(&lruvec->lru_lock);
+ up_write(&lruvec->lrugen.switch_lock);
}
cond_resched();
@@ -5680,6 +5682,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
lrugen->max_seq = MIN_NR_GENS + 1;
lrugen->enabled = lru_gen_enabled();
+ init_rwsem(&lrugen->switch_lock);
for (i = 0; i <= MIN_NR_GENS + 1; i++)
lrugen->timestamps[i] = jiffies;
@@ -5780,10 +5783,14 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
bool proportional_reclaim;
struct blk_plug plug;
- if (lru_gen_enabled() && !root_reclaim(sc)) {
+#ifdef CONFIG_LRU_GEN
+ down_read(&lruvec->lrugen.switch_lock);
+ if (lruvec->lrugen.enabled && !root_reclaim(sc)) {
lru_gen_shrink_lruvec(lruvec, sc);
+ up_read(&lruvec->lrugen.switch_lock);
return;
}
+#endif
get_scan_count(lruvec, sc, nr);
@@ -5885,6 +5892,9 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
inactive_is_low(lruvec, LRU_INACTIVE_ANON))
shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
sc, LRU_ACTIVE_ANON);
+#ifdef CONFIG_LRU_GEN
+ up_read(&lruvec->lrugen.switch_lock);
+#endif
}
/* Use reclaim/compaction for costly allocs or under memory pressure */
--
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-02-28 21:28 ` Barry Song
2026-02-28 22:41 ` Barry Song
@ 2026-03-02 5:50 ` Yafang Shao
2026-03-02 6:58 ` Barry Song
1 sibling, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2026-03-02 5:50 UTC (permalink / raw)
To: Barry Song
Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
wjl.linux, yuanchu, yuzhao
On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> >
> > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > condition exists between the state switching and the memory reclaim
> > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > reclaimable memory is available.
> >
> > *** Problem Description ***
> >
> > The issue arises from a "reclaim vacuum" during the transition:
> >
> > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > false before the pages are drained from MGLRU lists back to
> > traditional LRU lists.
> > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > and skip the MGLRU path.
> > 3. However, these pages might not have reached the traditional LRU lists
> > yet, or the changes are not yet visible to all CPUs due to a lack of
> > synchronization.
> > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > concludes there is no reclaimable memory, and triggers an OOM kill.
> >
> > A similar race can occur during enablement, where the reclaimer sees
> > the new state but the MGLRU lists haven't been populated via
> > fill_evictable() yet.
> >
> > *** Solution ***
> >
> > Introduce a 'draining' state to bridge the gap during transitions:
> >
> > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > of 'enabled' and 'draining' flags across CPUs.
> > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > lists first, and then fall through to traditional LRU lists instead
> > of returning early. This ensures that folios are visible to at least
> > one reclaim path at any given time.
> >
> > *** Reproduction ***
> >
> > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > a high-pressure memory cgroup (v1) environment.
> >
> > Reproduction steps:
> > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > and 8GB active anonymous memory.
> > 2. Toggle MGLRU state while performing new memory allocations to force
> > direct reclaim.
> >
> > Reproduction script:
> > ---
> > #!/bin/bash
> > # Fixed reproduction for memcg OOM during MGLRU toggle
> > set -euo pipefail
> >
> > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> >
> > # Switch MGLRU dynamically in the background
> > switch_mglru() {
> > local orig_val=$(cat "$MGLRU_FILE")
> > if [[ "$orig_val" != "0x0000" ]]; then
> > echo n > "$MGLRU_FILE" &
> > else
> > echo y > "$MGLRU_FILE" &
> > fi
> > }
> >
> > # Setup 16G memcg
> > mkdir -p "$CGROUP_PATH"
> > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > echo $$ > "$CGROUP_PATH/cgroup.procs"
> >
> > # 1. Build memory pressure (File + Anon)
> > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> >
> > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > sleep 5
> >
> > # 2. Trigger switch and concurrent allocation
> > switch_mglru
> > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> >
> > # Check OOM counter
> > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > ---
> >
> > Signed-off-by: Leno Hou <lenohou@gmail.com>
> >
> > ---
> > To: linux-mm@kvack.org
> > To: linux-kernel@vger.kernel.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > Cc: Yuanchu Xie <yuanchu@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Barry Song <21cnbao@gmail.com>
> > Cc: Jialing Wang <wjl.linux@gmail.com>
> > Cc: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > ---
> > include/linux/mmzone.h | 2 ++
> > mm/vmscan.c | 14 +++++++++++---
> > 2 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 7fb7331c5725..0648ce91dbc6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > /* whether the multi-gen LRU is enabled */
> > bool enabled;
> > + /* whether the multi-gen LRU is draining to LRU */
> > + bool draining;
> > /* the memcg generation this lru_gen_folio belongs to */
> > u8 gen;
> > /* the list segment this lru_gen_folio belongs to */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 06071995dacc..629a00681163 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> >
> > - lruvec->lrugen.enabled = enabled;
> > + smp_store_release(&lruvec->lrugen.enabled, enabled);
> > + smp_store_release(&lruvec->lrugen.draining, true);
> >
> > while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > spin_unlock_irq(&lruvec->lru_lock);
> > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > spin_lock_irq(&lruvec->lru_lock);
> > }
> >
> > + smp_store_release(&lruvec->lrugen.draining, false);
> > +
> > spin_unlock_irq(&lruvec->lru_lock);
> > }
> >
> > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > bool proportional_reclaim;
> > struct blk_plug plug;
> > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> >
> > - if (lru_gen_enabled() && !root_reclaim(sc)) {
> > + if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > lru_gen_shrink_lruvec(lruvec, sc);
> > - return;
>
Hello Barry,
> Is it possible to simply wait for draining to finish instead of performing
> an lru_gen/lru shrink while lru_gen is being disabled or enabled?
This might introduce unexpected latency spikes during the waiting period.
>
> Performing a shrink in an intermediate state may still involve a lot of
> uncertainty, depending on how far the shrink has progressed and how much
> remains in each side’s LRU?
The workingset might not be reliable in this intermediate state.
However, since switching MGLRU should not be a frequent operation in a
production environment, I believe the workingset in this intermediate
state should not be a concern. The only reason we would enable or
disable MGLRU is if we find that certain workloads benefit from
it—enabling it when it helps, and disabling it when it causes
degradation. There should be no other scenario in which we would need
to toggle MGLRU on or off.
To identify which workloads can benefit from MGLRU, we must first
ensure that switching it on or off is safe—which is precisely why we
are proposing this patch. Once MGLRU is enabled in production, we can
continue to improve it. Perhaps in the future, we can even implement a
per-workload reclaim mechanism.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 5:50 ` Yafang Shao
@ 2026-03-02 6:58 ` Barry Song
2026-03-02 7:43 ` Yafang Shao
0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2026-03-02 6:58 UTC (permalink / raw)
To: Yafang Shao
Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > >
> > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > condition exists between the state switching and the memory reclaim
> > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > reclaimable memory is available.
> > >
> > > *** Problem Description ***
> > >
> > > The issue arises from a "reclaim vacuum" during the transition:
> > >
> > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > false before the pages are drained from MGLRU lists back to
> > > traditional LRU lists.
> > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > and skip the MGLRU path.
> > > 3. However, these pages might not have reached the traditional LRU lists
> > > yet, or the changes are not yet visible to all CPUs due to a lack of
> > > synchronization.
> > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > concludes there is no reclaimable memory, and triggers an OOM kill.
> > >
> > > A similar race can occur during enablement, where the reclaimer sees
> > > the new state but the MGLRU lists haven't been populated via
> > > fill_evictable() yet.
> > >
> > > *** Solution ***
> > >
> > > Introduce a 'draining' state to bridge the gap during transitions:
> > >
> > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > of 'enabled' and 'draining' flags across CPUs.
> > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > lists first, and then fall through to traditional LRU lists instead
> > > of returning early. This ensures that folios are visible to at least
> > > one reclaim path at any given time.
> > >
> > > *** Reproduction ***
> > >
> > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > a high-pressure memory cgroup (v1) environment.
> > >
> > > Reproduction steps:
> > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > and 8GB active anonymous memory.
> > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > direct reclaim.
> > >
> > > Reproduction script:
> > > ---
> > > #!/bin/bash
> > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > set -euo pipefail
> > >
> > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > >
> > > # Switch MGLRU dynamically in the background
> > > switch_mglru() {
> > > local orig_val=$(cat "$MGLRU_FILE")
> > > if [[ "$orig_val" != "0x0000" ]]; then
> > > echo n > "$MGLRU_FILE" &
> > > else
> > > echo y > "$MGLRU_FILE" &
> > > fi
> > > }
> > >
> > > # Setup 16G memcg
> > > mkdir -p "$CGROUP_PATH"
> > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > >
> > > # 1. Build memory pressure (File + Anon)
> > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > >
> > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > sleep 5
> > >
> > > # 2. Trigger switch and concurrent allocation
> > > switch_mglru
> > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > >
> > > # Check OOM counter
> > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > ---
> > >
> > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > >
> > > ---
> > > To: linux-mm@kvack.org
> > > To: linux-kernel@vger.kernel.org
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > Cc: Wei Xu <weixugc@google.com>
> > > Cc: Barry Song <21cnbao@gmail.com>
> > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > Cc: Yu Zhao <yuzhao@google.com>
> > > ---
> > > include/linux/mmzone.h | 2 ++
> > > mm/vmscan.c | 14 +++++++++++---
> > > 2 files changed, 13 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 7fb7331c5725..0648ce91dbc6 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > /* whether the multi-gen LRU is enabled */
> > > bool enabled;
> > > + /* whether the multi-gen LRU is draining to LRU */
> > > + bool draining;
> > > /* the memcg generation this lru_gen_folio belongs to */
> > > u8 gen;
> > > /* the list segment this lru_gen_folio belongs to */
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 06071995dacc..629a00681163 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > >
> > > - lruvec->lrugen.enabled = enabled;
> > > + smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > + smp_store_release(&lruvec->lrugen.draining, true);
> > >
> > > while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > spin_unlock_irq(&lruvec->lru_lock);
> > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > spin_lock_irq(&lruvec->lru_lock);
> > > }
> > >
> > > + smp_store_release(&lruvec->lrugen.draining, false);
> > > +
> > > spin_unlock_irq(&lruvec->lru_lock);
> > > }
> > >
> > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > bool proportional_reclaim;
> > > struct blk_plug plug;
> > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > - if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > + if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > lru_gen_shrink_lruvec(lruvec, sc);
> > > - return;
> >
>
> Hello Barry,
>
> > Is it possible to simply wait for draining to finish instead of performing
> > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
>
> This might introduce unexpected latency spikes during the waiting period.
I assume latency is not a concern for a very rare
MGLRU on/off case. Do you require the switch to happen
with zero latency?
My main concern is the correctness of the code.
Now the proposed patch is:
+ bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
+ bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
Then choose MGLRU or active/inactive LRU based on
those values.
However, nothing prevents those values from changing
after they are read. Even within the shrink path,
they can still change.
So I think we need an rwsem or something similar here —
a read lock for shrink and a write lock for on/off. The
write lock should happen very rarely.
>
> >
> > Performing a shrink in an intermediate state may still involve a lot of
> > uncertainty, depending on how far the shrink has progressed and how much
> > remains in each side’s LRU?
>
> The workingset might not be reliable in this intermediate state.
> However, since switching MGLRU should not be a frequent operation in a
> production environment, I believe the workingset in this intermediate
> state should not be a concern. The only reason we would enable or
> disable MGLRU is if we find that certain workloads benefit from
> it—enabling it when it helps, and disabling it when it causes
> degradation. There should be no other scenario in which we would need
> to toggle MGLRU on or off.
>
> To identify which workloads can benefit from MGLRU, we must first
> ensure that switching it on or off is safe—which is precisely why we
> are proposing this patch. Once MGLRU is enabled in production, we can
> continue to improve it. Perhaps in the future, we can even implement a
> per-workload reclaim mechanism.
To be honest, the on/off toggle is quite odd. If possible,
I’d prefer not to switch MGLRU or active/inactive
dynamically. Once it’s set up during system boot, it
should remain unchanged.
If we want a per-workload LRU, this could be a good
place for eBPF to hook into folio enqueue, dequeue,
and scanning. There is a project related to this [1][2].
// Policy function hooks
struct cache_ext_ops {
s32 (*policy_init)(struct mem_cgroup *memcg);
// Propose folios to evict
void (*evict_folios)(struct eviction_ctx *ctx,
struct mem_cgroup *memcg);
void (*folio_added)(struct folio *folio);
void (*folio_accessed)(struct folio *folio);
// Folio was removed: clean up metadata
void (*folio_removed)(struct folio *folio);
char name[CACHE_EXT_OPS_NAME_LEN];
};
However, we would need a very strong and convincing
user case to justify it.
[1] https://dl.acm.org/doi/pdf/10.1145/3731569.3764820
[2] https://github.com/cache-ext/cache_ext
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 6:58 ` Barry Song
@ 2026-03-02 7:43 ` Yafang Shao
2026-03-02 8:00 ` Kairui Song
2026-03-02 8:03 ` Barry Song
0 siblings, 2 replies; 22+ messages in thread
From: Yafang Shao @ 2026-03-02 7:43 UTC (permalink / raw)
To: Barry Song
Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > > >
> > > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > > condition exists between the state switching and the memory reclaim
> > > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > > reclaimable memory is available.
> > > >
> > > > *** Problem Description ***
> > > >
> > > > The issue arises from a "reclaim vacuum" during the transition:
> > > >
> > > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > > false before the pages are drained from MGLRU lists back to
> > > > traditional LRU lists.
> > > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > > and skip the MGLRU path.
> > > > 3. However, these pages might not have reached the traditional LRU lists
> > > > yet, or the changes are not yet visible to all CPUs due to a lack of
> > > > synchronization.
> > > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > > concludes there is no reclaimable memory, and triggers an OOM kill.
> > > >
> > > > A similar race can occur during enablement, where the reclaimer sees
> > > > the new state but the MGLRU lists haven't been populated via
> > > > fill_evictable() yet.
> > > >
> > > > *** Solution ***
> > > >
> > > > Introduce a 'draining' state to bridge the gap during transitions:
> > > >
> > > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > > of 'enabled' and 'draining' flags across CPUs.
> > > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > > is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > > lists first, and then fall through to traditional LRU lists instead
> > > > of returning early. This ensures that folios are visible to at least
> > > > one reclaim path at any given time.
> > > >
> > > > *** Reproduction ***
> > > >
> > > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > > a high-pressure memory cgroup (v1) environment.
> > > >
> > > > Reproduction steps:
> > > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > > and 8GB active anonymous memory.
> > > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > > direct reclaim.
> > > >
> > > > Reproduction script:
> > > > ---
> > > > #!/bin/bash
> > > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > > set -euo pipefail
> > > >
> > > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > > >
> > > > # Switch MGLRU dynamically in the background
> > > > switch_mglru() {
> > > > local orig_val=$(cat "$MGLRU_FILE")
> > > > if [[ "$orig_val" != "0x0000" ]]; then
> > > > echo n > "$MGLRU_FILE" &
> > > > else
> > > > echo y > "$MGLRU_FILE" &
> > > > fi
> > > > }
> > > >
> > > > # Setup 16G memcg
> > > > mkdir -p "$CGROUP_PATH"
> > > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > > >
> > > > # 1. Build memory pressure (File + Anon)
> > > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > > >
> > > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > > sleep 5
> > > >
> > > > # 2. Trigger switch and concurrent allocation
> > > > switch_mglru
> > > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > > >
> > > > # Check OOM counter
> > > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > > ---
> > > >
> > > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > > >
> > > > ---
> > > > To: linux-mm@kvack.org
> > > > To: linux-kernel@vger.kernel.org
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > > Cc: Wei Xu <weixugc@google.com>
> > > > Cc: Barry Song <21cnbao@gmail.com>
> > > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>
> > > > ---
> > > > include/linux/mmzone.h | 2 ++
> > > > mm/vmscan.c | 14 +++++++++++---
> > > > 2 files changed, 13 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 7fb7331c5725..0648ce91dbc6 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > > atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > > /* whether the multi-gen LRU is enabled */
> > > > bool enabled;
> > > > + /* whether the multi-gen LRU is draining to LRU */
> > > > + bool draining;
> > > > /* the memcg generation this lru_gen_folio belongs to */
> > > > u8 gen;
> > > > /* the list segment this lru_gen_folio belongs to */
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 06071995dacc..629a00681163 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > > VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > > VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > > >
> > > > - lruvec->lrugen.enabled = enabled;
> > > > + smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > > + smp_store_release(&lruvec->lrugen.draining, true);
> > > >
> > > > while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > > spin_unlock_irq(&lruvec->lru_lock);
> > > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > > spin_lock_irq(&lruvec->lru_lock);
> > > > }
> > > >
> > > > + smp_store_release(&lruvec->lrugen.draining, false);
> > > > +
> > > > spin_unlock_irq(&lruvec->lru_lock);
> > > > }
> > > >
> > > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > > unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > > bool proportional_reclaim;
> > > > struct blk_plug plug;
> > > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > - if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > > + if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > > lru_gen_shrink_lruvec(lruvec, sc);
> > > > - return;
> > >
> >
> > Hello Barry,
> >
> > > Is it possible to simply wait for draining to finish instead of performing
> > > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
> >
> > This might introduce unexpected latency spikes during the waiting period.
>
> I assume latency is not a concern for a very rare
> MGLRU on/off case. Do you require the switch to happen
> with zero latency?
> My main concern is the correctness of the code.
>
> Now the proposed patch is:
>
> + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
>
> Then choose MGLRU or active/inactive LRU based on
> those values.
>
> However, nothing prevents those values from changing
> after they are read. Even within the shrink path,
> they can still change.
If these values are changed during reclaim, the currently running
reclaimer will continue to operate with the old settings, while any
new reclaimer processes will adopt the new values. This approach
should prevent any immediate issues, but the primary risk of this
lockless method is the potential for a user to rapidly toggle the
MGLRU feature, particularly during an intermediate state.
>
> So I think we need an rwsem or something similar here —
> a read lock for shrink and a write lock for on/off. The
> write lock should happen very rarely.
We can introduce a lock-based mechanism in v2.
>
> >
> > >
> > > Performing a shrink in an intermediate state may still involve a lot of
> > > uncertainty, depending on how far the shrink has progressed and how much
> > > remains in each side’s LRU?
> >
> > The workingset might not be reliable in this intermediate state.
> > However, since switching MGLRU should not be a frequent operation in a
> > production environment, I believe the workingset in this intermediate
> > state should not be a concern. The only reason we would enable or
> > disable MGLRU is if we find that certain workloads benefit from
> > it—enabling it when it helps, and disabling it when it causes
> > degradation. There should be no other scenario in which we would need
> > to toggle MGLRU on or off.
> >
> > To identify which workloads can benefit from MGLRU, we must first
> > ensure that switching it on or off is safe—which is precisely why we
> > are proposing this patch. Once MGLRU is enabled in production, we can
> > continue to improve it. Perhaps in the future, we can even implement a
> > per-workload reclaim mechanism.
>
> To be honest, the on/off toggle is quite odd. If possible,
> I’d prefer not to switch MGLRU or active/inactive
> dynamically. Once it’s set up during system boot, it
> should remain unchanged.
While it is well-suited for Android environments, it is not viable for
Kubernetes production servers, where rebooting is highly disruptive.
This limitation is precisely why we need to introduce dynamic toggles.
>
> If we want a per-workload LRU, this could be a good
> place for eBPF to hook into folio enqueue, dequeue,
> and scanning. There is a project related to this [1][2].
>
> // Policy function hooks
> struct cache_ext_ops {
> s32 (*policy_init)(struct mem_cgroup *memcg);
> // Propose folios to evict
> void (*evict_folios)(struct eviction_ctx *ctx,
> struct mem_cgroup *memcg);
> void (*folio_added)(struct folio *folio);
> void (*folio_accessed)(struct folio *folio);
> // Folio was removed: clean up metadata
> void (*folio_removed)(struct folio *folio);
> char name[CACHE_EXT_OPS_NAME_LEN];
> };
>
> However, we would need a very strong and convincing
> user case to justify it.
Thanks for the info.
We're actually already running a BPF-based reclaimer in production,
but we don't have immediate plans to upstream or propose it just yet.
>
> [1] https://dl.acm.org/doi/pdf/10.1145/3731569.3764820
> [2] https://github.com/cache-ext/cache_ext
>
> Thanks
> Barry
--
Regards
Yafang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 7:43 ` Yafang Shao
@ 2026-03-02 8:00 ` Kairui Song
2026-03-02 8:15 ` Barry Song
` (2 more replies)
2026-03-02 8:03 ` Barry Song
1 sibling, 3 replies; 22+ messages in thread
From: Kairui Song @ 2026-03-02 8:00 UTC (permalink / raw)
To: Yafang Shao
Cc: Barry Song, lenohou, akpm, axelrasmussen, linux-kernel, linux-mm,
weixugc, wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > I assume latency is not a concern for a very rare
> > MGLRU on/off case. Do you require the switch to happen
> > with zero latency?
> > My main concern is the correctness of the code.
> >
> > Now the proposed patch is:
> >
> > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> >
> > Then choose MGLRU or active/inactive LRU based on
> > those values.
> >
> > However, nothing prevents those values from changing
> > after they are read. Even within the shrink path,
> > they can still change.
Hi all,
> If these values are changed during reclaim, the currently running
> reclaimer will continue to operate with the old settings, while any
> new reclaimer processes will adopt the new values. This approach
> should prevent any immediate issues, but the primary risk of this
> lockless method is the potential for a user to rapidly toggle the
> MGLRU feature, particularly during an intermediate state.
>
> >
> > So I think we need an rwsem or something similar here —
> > a read lock for shrink and a write lock for on/off. The
> > write lock should happen very rarely.
>
> We can introduce a lock-based mechanism in v2.
I hope we don't need a lock here. Currently there is only a static
key, this patch is already adding more branches, a lock will make
things more complex and the shrinking path is quite performance
sensitive.
> >
> > To be honest, the on/off toggle is quite odd. If possible,
> > I’d prefer not to switch MGLRU or active/inactive
> > dynamically. Once it’s set up during system boot, it
> > should remain unchanged.
>
> While it is well-suited for Android environments, it is not viable for
> Kubernetes production servers, where rebooting is highly disruptive.
> This limitation is precisely why we need to introduce dynamic toggles.
I agree with Barry, the switch isn't supposed to be a knob to be
turned on/off frequently. And I think in the long term we should just
identify the workloads where MGLRU doesn't work well, and fix MGLRU.
Having two LRUs in the kernel is already very odd.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 7:43 ` Yafang Shao
2026-03-02 8:00 ` Kairui Song
@ 2026-03-02 8:03 ` Barry Song
2026-03-02 8:13 ` Yafang Shao
1 sibling, 1 reply; 22+ messages in thread
From: Barry Song @ 2026-03-02 8:03 UTC (permalink / raw)
To: Yafang Shao
Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > > > >
> > > > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > > > condition exists between the state switching and the memory reclaim
> > > > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > > > reclaimable memory is available.
> > > > >
> > > > > *** Problem Description ***
> > > > >
> > > > > The issue arises from a "reclaim vacuum" during the transition:
> > > > >
> > > > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > > > false before the pages are drained from MGLRU lists back to
> > > > > traditional LRU lists.
> > > > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > > > and skip the MGLRU path.
> > > > > 3. However, these pages might not have reached the traditional LRU lists
> > > > > yet, or the changes are not yet visible to all CPUs due to a lack of
> > > > > synchronization.
> > > > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > > > concludes there is no reclaimable memory, and triggers an OOM kill.
> > > > >
> > > > > A similar race can occur during enablement, where the reclaimer sees
> > > > > the new state but the MGLRU lists haven't been populated via
> > > > > fill_evictable() yet.
> > > > >
> > > > > *** Solution ***
> > > > >
> > > > > Introduce a 'draining' state to bridge the gap during transitions:
> > > > >
> > > > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > > > of 'enabled' and 'draining' flags across CPUs.
> > > > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > > > is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > > > lists first, and then fall through to traditional LRU lists instead
> > > > > of returning early. This ensures that folios are visible to at least
> > > > > one reclaim path at any given time.
> > > > >
> > > > > *** Reproduction ***
> > > > >
> > > > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > > > a high-pressure memory cgroup (v1) environment.
> > > > >
> > > > > Reproduction steps:
> > > > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > > > and 8GB active anonymous memory.
> > > > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > > > direct reclaim.
> > > > >
> > > > > Reproduction script:
> > > > > ---
> > > > > #!/bin/bash
> > > > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > > > set -euo pipefail
> > > > >
> > > > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > > > >
> > > > > # Switch MGLRU dynamically in the background
> > > > > switch_mglru() {
> > > > > local orig_val=$(cat "$MGLRU_FILE")
> > > > > if [[ "$orig_val" != "0x0000" ]]; then
> > > > > echo n > "$MGLRU_FILE" &
> > > > > else
> > > > > echo y > "$MGLRU_FILE" &
> > > > > fi
> > > > > }
> > > > >
> > > > > # Setup 16G memcg
> > > > > mkdir -p "$CGROUP_PATH"
> > > > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > > > >
> > > > > # 1. Build memory pressure (File + Anon)
> > > > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > > > >
> > > > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > > > sleep 5
> > > > >
> > > > > # 2. Trigger switch and concurrent allocation
> > > > > switch_mglru
> > > > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > > > >
> > > > > # Check OOM counter
> > > > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > > > ---
> > > > >
> > > > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > > > >
> > > > > ---
> > > > > To: linux-mm@kvack.org
> > > > > To: linux-kernel@vger.kernel.org
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > > > Cc: Wei Xu <weixugc@google.com>
> > > > > Cc: Barry Song <21cnbao@gmail.com>
> > > > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > Cc: Yu Zhao <yuzhao@google.com>
> > > > > ---
> > > > > include/linux/mmzone.h | 2 ++
> > > > > mm/vmscan.c | 14 +++++++++++---
> > > > > 2 files changed, 13 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > > index 7fb7331c5725..0648ce91dbc6 100644
> > > > > --- a/include/linux/mmzone.h
> > > > > +++ b/include/linux/mmzone.h
> > > > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > > > atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > > > /* whether the multi-gen LRU is enabled */
> > > > > bool enabled;
> > > > > + /* whether the multi-gen LRU is draining to LRU */
> > > > > + bool draining;
> > > > > /* the memcg generation this lru_gen_folio belongs to */
> > > > > u8 gen;
> > > > > /* the list segment this lru_gen_folio belongs to */
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 06071995dacc..629a00681163 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > > > VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > > > >
> > > > > - lruvec->lrugen.enabled = enabled;
> > > > > + smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > > > + smp_store_release(&lruvec->lrugen.draining, true);
> > > > >
> > > > > while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > > > spin_unlock_irq(&lruvec->lru_lock);
> > > > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > spin_lock_irq(&lruvec->lru_lock);
> > > > > }
> > > > >
> > > > > + smp_store_release(&lruvec->lrugen.draining, false);
> > > > > +
> > > > > spin_unlock_irq(&lruvec->lru_lock);
> > > > > }
> > > > >
> > > > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > > > unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > > > bool proportional_reclaim;
> > > > > struct blk_plug plug;
> > > > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > > >
> > > > > - if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > > > + if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > > > lru_gen_shrink_lruvec(lruvec, sc);
> > > > > - return;
> > > >
> > >
> > > Hello Barry,
> > >
> > > > Is it possible to simply wait for draining to finish instead of performing
> > > > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
> > >
> > > This might introduce unexpected latency spikes during the waiting period.
> >
> > I assume latency is not a concern for a very rare
> > MGLRU on/off case. Do you require the switch to happen
> > with zero latency?
> > My main concern is the correctness of the code.
> >
> > Now the proposed patch is:
> >
> > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> >
> > Then choose MGLRU or active/inactive LRU based on
> > those values.
> >
> > However, nothing prevents those values from changing
> > after they are read. Even within the shrink path,
> > they can still change.
>
> If these values are changed during reclaim, the currently running
> reclaimer will continue to operate with the old settings, while any
> new reclaimer processes will adopt the new values. This approach
> should prevent any immediate issues, but the primary risk of this
> lockless method is the potential for a user to rapidly toggle the
> MGLRU feature, particularly during an intermediate state.
>
> >
> > So I think we need an rwsem or something similar here —
> > a read lock for shrink and a write lock for on/off. The
> > write lock should happen very rarely.
>
> We can introduce a lock-based mechanism in v2.
Honestly, the on/off toggle is quite fragile. For instance,
folio_check_references() is doing:
if (lru_gen_enabled()) {
if (!referenced_ptes)
return FOLIOREF_RECLAIM;
return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE :
FOLIOREF_KEEP;
}
However, `lru_gen_enabled()` does not indicate the actual LRU
where the folio resides.
`lru_gen_enabled()` is called in many places, but in this case it does
not accurately reflect where folios are placed if a dynamic toggle is
active. During the switching, many unexpected behaviors may occur.
>
> >
> > >
> > > >
> > > > Performing a shrink in an intermediate state may still involve a lot of
> > > > uncertainty, depending on how far the shrink has progressed and how much
> > > > remains in each side’s LRU?
> > >
> > > The workingset might not be reliable in this intermediate state.
> > > However, since switching MGLRU should not be a frequent operation in a
> > > production environment, I believe the workingset in this intermediate
> > > state should not be a concern. The only reason we would enable or
> > > disable MGLRU is if we find that certain workloads benefit from
> > > it—enabling it when it helps, and disabling it when it causes
> > > degradation. There should be no other scenario in which we would need
> > > to toggle MGLRU on or off.
> > >
> > > To identify which workloads can benefit from MGLRU, we must first
> > > ensure that switching it on or off is safe—which is precisely why we
> > > are proposing this patch. Once MGLRU is enabled in production, we can
> > > continue to improve it. Perhaps in the future, we can even implement a
> > > per-workload reclaim mechanism.
> >
> > To be honest, the on/off toggle is quite odd. If possible,
> > I’d prefer not to switch MGLRU or active/inactive
> > dynamically. Once it’s set up during system boot, it
> > should remain unchanged.
>
> While it is well-suited for Android environments, it is not viable for
> Kubernetes production servers, where rebooting is highly disruptive.
> This limitation is precisely why we need to introduce dynamic toggles.
Perhaps we really need to unify MGLRU with the active/inactive lists,
combining the benefits of both approaches. The dynamic toggle, as it
stands, is quite fragile.
A topic was suggested by Kairui here [1].
>
> >
> > If we want a per-workload LRU, this could be a good
> > place for eBPF to hook into folio enqueue, dequeue,
> > and scanning. There is a project related to this [1][2].
> >
> > // Policy function hooks
> > struct cache_ext_ops {
> > s32 (*policy_init)(struct mem_cgroup *memcg);
> > // Propose folios to evict
> > void (*evict_folios)(struct eviction_ctx *ctx,
> > struct mem_cgroup *memcg);
> > void (*folio_added)(struct folio *folio);
> > void (*folio_accessed)(struct folio *folio);
> > // Folio was removed: clean up metadata
> > void (*folio_removed)(struct folio *folio);
> > char name[CACHE_EXT_OPS_NAME_LEN];
> > };
> >
> > However, we would need a very strong and convincing
> > user case to justify it.
>
> Thanks for the info.
> We're actually already running a BPF-based reclaimer in production,
> but we don't have immediate plans to upstream or propose it just yet.
I know you are always far ahead of everyone else. I’m looking forward
to seeing your code and use cases when you are ready.
[1] https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 8:03 ` Barry Song
@ 2026-03-02 8:13 ` Yafang Shao
2026-03-02 8:20 ` Barry Song
0 siblings, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2026-03-02 8:13 UTC (permalink / raw)
To: Barry Song
Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > > > > >
> > > > > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > > > > condition exists between the state switching and the memory reclaim
> > > > > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > > > > reclaimable memory is available.
> > > > > >
> > > > > > *** Problem Description ***
> > > > > >
> > > > > > The issue arises from a "reclaim vacuum" during the transition:
> > > > > >
> > > > > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > > > > false before the pages are drained from MGLRU lists back to
> > > > > > traditional LRU lists.
> > > > > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > > > > and skip the MGLRU path.
> > > > > > 3. However, these pages might not have reached the traditional LRU lists
> > > > > > yet, or the changes are not yet visible to all CPUs due to a lack of
> > > > > > synchronization.
> > > > > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > > > > concludes there is no reclaimable memory, and triggers an OOM kill.
> > > > > >
> > > > > > A similar race can occur during enablement, where the reclaimer sees
> > > > > > the new state but the MGLRU lists haven't been populated via
> > > > > > fill_evictable() yet.
> > > > > >
> > > > > > *** Solution ***
> > > > > >
> > > > > > Introduce a 'draining' state to bridge the gap during transitions:
> > > > > >
> > > > > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > > > > of 'enabled' and 'draining' flags across CPUs.
> > > > > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > > > > is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > > > > lists first, and then fall through to traditional LRU lists instead
> > > > > > of returning early. This ensures that folios are visible to at least
> > > > > > one reclaim path at any given time.
> > > > > >
> > > > > > *** Reproduction ***
> > > > > >
> > > > > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > > > > a high-pressure memory cgroup (v1) environment.
> > > > > >
> > > > > > Reproduction steps:
> > > > > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > > > > and 8GB active anonymous memory.
> > > > > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > > > > direct reclaim.
> > > > > >
> > > > > > Reproduction script:
> > > > > > ---
> > > > > > #!/bin/bash
> > > > > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > > > > set -euo pipefail
> > > > > >
> > > > > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > > > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > > > > >
> > > > > > # Switch MGLRU dynamically in the background
> > > > > > switch_mglru() {
> > > > > > local orig_val=$(cat "$MGLRU_FILE")
> > > > > > if [[ "$orig_val" != "0x0000" ]]; then
> > > > > > echo n > "$MGLRU_FILE" &
> > > > > > else
> > > > > > echo y > "$MGLRU_FILE" &
> > > > > > fi
> > > > > > }
> > > > > >
> > > > > > # Setup 16G memcg
> > > > > > mkdir -p "$CGROUP_PATH"
> > > > > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > > > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > > > > >
> > > > > > # 1. Build memory pressure (File + Anon)
> > > > > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > > > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > > > > >
> > > > > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > > > > sleep 5
> > > > > >
> > > > > > # 2. Trigger switch and concurrent allocation
> > > > > > switch_mglru
> > > > > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > > > > >
> > > > > > # Check OOM counter
> > > > > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > > > > ---
> > > > > >
> > > > > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > > > > >
> > > > > > ---
> > > > > > To: linux-mm@kvack.org
> > > > > > To: linux-kernel@vger.kernel.org
> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > > > > Cc: Wei Xu <weixugc@google.com>
> > > > > > Cc: Barry Song <21cnbao@gmail.com>
> > > > > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > > Cc: Yu Zhao <yuzhao@google.com>
> > > > > > ---
> > > > > > include/linux/mmzone.h | 2 ++
> > > > > > mm/vmscan.c | 14 +++++++++++---
> > > > > > 2 files changed, 13 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > > > index 7fb7331c5725..0648ce91dbc6 100644
> > > > > > --- a/include/linux/mmzone.h
> > > > > > +++ b/include/linux/mmzone.h
> > > > > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > > > > atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > > > > /* whether the multi-gen LRU is enabled */
> > > > > > bool enabled;
> > > > > > + /* whether the multi-gen LRU is draining to LRU */
> > > > > > + bool draining;
> > > > > > /* the memcg generation this lru_gen_folio belongs to */
> > > > > > u8 gen;
> > > > > > /* the list segment this lru_gen_folio belongs to */
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 06071995dacc..629a00681163 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > > VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > > > > VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > > > > >
> > > > > > - lruvec->lrugen.enabled = enabled;
> > > > > > + smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > > > > + smp_store_release(&lruvec->lrugen.draining, true);
> > > > > >
> > > > > > while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > > > > spin_unlock_irq(&lruvec->lru_lock);
> > > > > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > > spin_lock_irq(&lruvec->lru_lock);
> > > > > > }
> > > > > >
> > > > > > + smp_store_release(&lruvec->lrugen.draining, false);
> > > > > > +
> > > > > > spin_unlock_irq(&lruvec->lru_lock);
> > > > > > }
> > > > > >
> > > > > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > > > > unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > > > > bool proportional_reclaim;
> > > > > > struct blk_plug plug;
> > > > > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > > > >
> > > > > > - if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > > > > + if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > > > > lru_gen_shrink_lruvec(lruvec, sc);
> > > > > > - return;
> > > > >
> > > >
> > > > Hello Barry,
> > > >
> > > > > Is it possible to simply wait for draining to finish instead of performing
> > > > > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
> > > >
> > > > This might introduce unexpected latency spikes during the waiting period.
> > >
> > > I assume latency is not a concern for a very rare
> > > MGLRU on/off case. Do you require the switch to happen
> > > with zero latency?
> > > My main concern is the correctness of the code.
> > >
> > > Now the proposed patch is:
> > >
> > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > Then choose MGLRU or active/inactive LRU based on
> > > those values.
> > >
> > > However, nothing prevents those values from changing
> > > after they are read. Even within the shrink path,
> > > they can still change.
> >
> > If these values are changed during reclaim, the currently running
> > reclaimer will continue to operate with the old settings, while any
> > new reclaimer processes will adopt the new values. This approach
> > should prevent any immediate issues, but the primary risk of this
> > lockless method is the potential for a user to rapidly toggle the
> > MGLRU feature, particularly during an intermediate state.
> >
> > >
> > > So I think we need an rwsem or something similar here —
> > > a read lock for shrink and a write lock for on/off. The
> > > write lock should happen very rarely.
> >
> > We can introduce a lock-based mechanism in v2.
>
> Honestly, the on/off toggle is quite fragile. For instance,
>
> folio_check_references() is doing:
>
> if (lru_gen_enabled()) {
> if (!referenced_ptes)
> return FOLIOREF_RECLAIM;
>
> return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE :
> FOLIOREF_KEEP;
> }
>
> However, `lru_gen_enabled()` does not indicate the actual LRU
> where the folio resides.
>
> `lru_gen_enabled()` is called in many places, but in this case it does
> not accurately reflect where folios are placed if a dynamic toggle is
> active. During the switching, many unexpected behaviors may occur.
>
> >
> > >
> > > >
> > > > >
> > > > > Performing a shrink in an intermediate state may still involve a lot of
> > > > > uncertainty, depending on how far the shrink has progressed and how much
> > > > > remains in each side’s LRU?
> > > >
> > > > The workingset might not be reliable in this intermediate state.
> > > > However, since switching MGLRU should not be a frequent operation in a
> > > > production environment, I believe the workingset in this intermediate
> > > > state should not be a concern. The only reason we would enable or
> > > > disable MGLRU is if we find that certain workloads benefit from
> > > > it—enabling it when it helps, and disabling it when it causes
> > > > degradation. There should be no other scenario in which we would need
> > > > to toggle MGLRU on or off.
> > > >
> > > > To identify which workloads can benefit from MGLRU, we must first
> > > > ensure that switching it on or off is safe—which is precisely why we
> > > > are proposing this patch. Once MGLRU is enabled in production, we can
> > > > continue to improve it. Perhaps in the future, we can even implement a
> > > > per-workload reclaim mechanism.
> > >
> > > To be honest, the on/off toggle is quite odd. If possible,
> > > I’d prefer not to switch MGLRU or active/inactive
> > > dynamically. Once it’s set up during system boot, it
> > > should remain unchanged.
> >
> > While it is well-suited for Android environments, it is not viable for
> > Kubernetes production servers, where rebooting is highly disruptive.
> > This limitation is precisely why we need to introduce dynamic toggles.
>
> Perhaps we really need to unify MGLRU with the active/inactive lists,
> combining the benefits of both approaches. The dynamic toggle, as it
> stands, is quite fragile.
> A topic was suggested by Kairui here [1].
>
> >
> > >
> > > If we want a per-workload LRU, this could be a good
> > > place for eBPF to hook into folio enqueue, dequeue,
> > > and scanning. There is a project related to this [1][2].
> > >
> > > // Policy function hooks
> > > struct cache_ext_ops {
> > > s32 (*policy_init)(struct mem_cgroup *memcg);
> > > // Propose folios to evict
> > > void (*evict_folios)(struct eviction_ctx *ctx,
> > > struct mem_cgroup *memcg);
> > > void (*folio_added)(struct folio *folio);
> > > void (*folio_accessed)(struct folio *folio);
> > > // Folio was removed: clean up metadata
> > > void (*folio_removed)(struct folio *folio);
> > > char name[CACHE_EXT_OPS_NAME_LEN];
> > > };
> > >
> > > However, we would need a very strong and convincing
> > > user case to justify it.
> >
> > Thanks for the info.
> > We're actually already running a BPF-based reclaimer in production,
> > but we don't have immediate plans to upstream or propose it just yet.
>
> I know you are always far ahead of everyone else. I’m looking forward
> to seeing your code and use cases when you are ready.
Don't say it that way, that is not cooperative.
We've only deployed a limited BPF-based memcg async reclaimer
internally, and it is currently scoped to our own workloads.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 8:00 ` Kairui Song
@ 2026-03-02 8:15 ` Barry Song
2026-03-02 8:25 ` Yafang Shao
2026-03-02 16:26 ` Michal Hocko
2 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2026-03-02 8:15 UTC (permalink / raw)
To: Kairui Song
Cc: Yafang Shao, lenohou, akpm, axelrasmussen, linux-kernel,
linux-mm, weixugc, wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > I assume latency is not a concern for a very rare
> > > MGLRU on/off case. Do you require the switch to happen
> > > with zero latency?
> > > My main concern is the correctness of the code.
> > >
> > > Now the proposed patch is:
> > >
> > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > Then choose MGLRU or active/inactive LRU based on
> > > those values.
> > >
> > > However, nothing prevents those values from changing
> > > after they are read. Even within the shrink path,
> > > they can still change.
>
> Hi all,
>
> > If these values are changed during reclaim, the currently running
> > reclaimer will continue to operate with the old settings, while any
> > new reclaimer processes will adopt the new values. This approach
> > should prevent any immediate issues, but the primary risk of this
> > lockless method is the potential for a user to rapidly toggle the
> > MGLRU feature, particularly during an intermediate state.
> >
> > >
> > > So I think we need an rwsem or something similar here —
> > > a read lock for shrink and a write lock for on/off. The
> > > write lock should happen very rarely.
> >
> > We can introduce a lock-based mechanism in v2.
>
> I hope we don't need a lock here. Currently there is only a static
> key, this patch is already adding more branches, a lock will make
> things more complex and the shrinking path is quite performance
> sensitive.
I agree that the shrinking path is performance-sensitive. However, the
bottleneck occurs when we move folios out of the LRU, performing
reference checks by scanning PTEs with rmap, unmapping, and compressing
memory. I believe that either the branch or the readlock is too small
to noticeably affect shrink performance.
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 8:13 ` Yafang Shao
@ 2026-03-02 8:20 ` Barry Song
0 siblings, 0 replies; 22+ messages in thread
From: Barry Song @ 2026-03-02 8:20 UTC (permalink / raw)
To: Yafang Shao
Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 4:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]
> > > > If we want a per-workload LRU, this could be a good
> > > > place for eBPF to hook into folio enqueue, dequeue,
> > > > and scanning. There is a project related to this [1][2].
> > > >
> > > > // Policy function hooks
> > > > struct cache_ext_ops {
> > > > s32 (*policy_init)(struct mem_cgroup *memcg);
> > > > // Propose folios to evict
> > > > void (*evict_folios)(struct eviction_ctx *ctx,
> > > > struct mem_cgroup *memcg);
> > > > void (*folio_added)(struct folio *folio);
> > > > void (*folio_accessed)(struct folio *folio);
> > > > // Folio was removed: clean up metadata
> > > > void (*folio_removed)(struct folio *folio);
> > > > char name[CACHE_EXT_OPS_NAME_LEN];
> > > > };
> > > >
> > > > However, we would need a very strong and convincing
> > > > user case to justify it.
> > >
> > > Thanks for the info.
> > > We're actually already running a BPF-based reclaimer in production,
> > > but we don't have immediate plans to upstream or propose it just yet.
> >
> > I know you are always far ahead of everyone else. I’m looking forward
> > to seeing your code and use cases when you are ready.
>
> Don't say it that way, that is not cooperative.
> We've only deployed a limited BPF-based memcg async reclaimer
> internally, and it is currently scoped to our own workloads.
>
Don’t be so sensitive:-) When I say you are far ahead, that’s exactly
what I mean. I truly admire your work on making LRU programmable.
That’s all.
I understand that you have only deployed a limited BPF-based case,
so you need more time before sharing it. I am not criticizing you for
not sharing it yet. Please don’t misunderstand me.
Best Regards
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 8:00 ` Kairui Song
2026-03-02 8:15 ` Barry Song
@ 2026-03-02 8:25 ` Yafang Shao
2026-03-02 9:20 ` Barry Song
2026-03-02 16:26 ` Michal Hocko
2 siblings, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2026-03-02 8:25 UTC (permalink / raw)
To: Kairui Song
Cc: Barry Song, lenohou, akpm, axelrasmussen, linux-kernel, linux-mm,
weixugc, wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > I assume latency is not a concern for a very rare
> > > MGLRU on/off case. Do you require the switch to happen
> > > with zero latency?
> > > My main concern is the correctness of the code.
> > >
> > > Now the proposed patch is:
> > >
> > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > Then choose MGLRU or active/inactive LRU based on
> > > those values.
> > >
> > > However, nothing prevents those values from changing
> > > after they are read. Even within the shrink path,
> > > they can still change.
>
> Hi all,
>
> > If these values are changed during reclaim, the currently running
> > reclaimer will continue to operate with the old settings, while any
> > new reclaimer processes will adopt the new values. This approach
> > should prevent any immediate issues, but the primary risk of this
> > lockless method is the potential for a user to rapidly toggle the
> > MGLRU feature, particularly during an intermediate state.
> >
> > >
> > > So I think we need an rwsem or something similar here —
> > > a read lock for shrink and a write lock for on/off. The
> > > write lock should happen very rarely.
> >
> > We can introduce a lock-based mechanism in v2.
>
> I hope we don't need a lock here. Currently there is only a static
> key, this patch is already adding more branches, a lock will make
> things more complex and the shrinking path is quite performance
> sensitive.
>
> > >
> > > To be honest, the on/off toggle is quite odd. If possible,
> > > I’d prefer not to switch MGLRU or active/inactive
> > > dynamically. Once it’s set up during system boot, it
> > > should remain unchanged.
> >
> > While it is well-suited for Android environments, it is not viable for
> > Kubernetes production servers, where rebooting is highly disruptive.
> > This limitation is precisely why we need to introduce dynamic toggles.
>
> I agree with Barry, the switch isn't supposed to be a knob to be
> turned on/off frequently. And I think in the long term we should just
> identify the workloads where MGLRU doesn't work well, and fix MGLRU.
The challenge we're currently facing is that we don't yet know which
workloads would benefit from it ;)
We do want to enable mglru on our production servers, but first we
need to address the risk of OOM during the switch—that's exactly why
we're proposing this patch.
> Having two LRUs in the kernel is already very odd.
It's difficult to completely move away from either one.
Looking forward to your work.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 8:25 ` Yafang Shao
@ 2026-03-02 9:20 ` Barry Song
2026-03-02 9:47 ` Kairui Song
0 siblings, 1 reply; 22+ messages in thread
From: Barry Song @ 2026-03-02 9:20 UTC (permalink / raw)
To: Yafang Shao
Cc: Kairui Song, lenohou, akpm, axelrasmussen, linux-kernel,
linux-mm, weixugc, wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > I assume latency is not a concern for a very rare
> > > > MGLRU on/off case. Do you require the switch to happen
> > > > with zero latency?
> > > > My main concern is the correctness of the code.
> > > >
> > > > Now the proposed patch is:
> > > >
> > > > + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > Then choose MGLRU or active/inactive LRU based on
> > > > those values.
> > > >
> > > > However, nothing prevents those values from changing
> > > > after they are read. Even within the shrink path,
> > > > they can still change.
> >
> > Hi all,
> >
> > > If these values are changed during reclaim, the currently running
> > > reclaimer will continue to operate with the old settings, while any
> > > new reclaimer processes will adopt the new values. This approach
> > > should prevent any immediate issues, but the primary risk of this
> > > lockless method is the potential for a user to rapidly toggle the
> > > MGLRU feature, particularly during an intermediate state.
> > >
> > > >
> > > > So I think we need an rwsem or something similar here —
> > > > a read lock for shrink and a write lock for on/off. The
> > > > write lock should happen very rarely.
> > >
> > > We can introduce a lock-based mechanism in v2.
> >
> > I hope we don't need a lock here. Currently there is only a static
> > key, this patch is already adding more branches, a lock will make
> > things more complex and the shrinking path is quite performance
> > sensitive.
> >
> > > >
> > > > To be honest, the on/off toggle is quite odd. If possible,
> > > > I’d prefer not to switch MGLRU or active/inactive
> > > > dynamically. Once it’s set up during system boot, it
> > > > should remain unchanged.
> > >
> > > While it is well-suited for Android environments, it is not viable for
> > > Kubernetes production servers, where rebooting is highly disruptive.
> > > This limitation is precisely why we need to introduce dynamic toggles.
> >
> > I agree with Barry, the switch isn't supposed to be a knob to be
> > turned on/off frequently. And I think in the long term we should just
> > identify the workloads where MGLRU doesn't work well, and fix MGLRU.
>
> The challenge we're currently facing is that we don't yet know which
> workloads would benefit from it ;)
> We do want to enable mglru on our production servers, but first we
> need to address the risk of OOM during the switch—that's exactly why
> we're proposing this patch.
Nobody objects to your intention to fix it. I’m curious: to what
extent do we want to fix it? Do we aim to merely reduce the probability
of OOM and other mistakes, or do we want a complete fix that makes
the dynamic on/off fully safe?
Currently, many places appear fragile, mainly because
`lru_gen_enabled()` checks a global variable that doesn’t accurately
reflect where folios are during switching. A full fix might require
guarding the shrinking path against the switching path to prevent
simultaneous execution, which would add unnecessary complexity for a
rarely used "feature".
If our goal is only to reduce the probability of mistakes, I feel your
current patch may be fine, even though some race conditions
remain in principle.
Thanks
Barry
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 9:20 ` Barry Song
@ 2026-03-02 9:47 ` Kairui Song
2026-03-02 14:35 ` Yafang Shao
0 siblings, 1 reply; 22+ messages in thread
From: Kairui Song @ 2026-03-02 9:47 UTC (permalink / raw)
To: Barry Song, Yafang Shao
Cc: bingfangguo, lenohou, akpm, axelrasmussen, linux-kernel,
linux-mm, weixugc, wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > The challenge we're currently facing is that we don't yet know which
> > workloads would benefit from it ;)
> > We do want to enable mglru on our production servers, but first we
> > need to address the risk of OOM during the switch—that's exactly why
> > we're proposing this patch.
>
> Nobody objects to your intention to fix it. I’m curious: to what
> extent do we want to fix it? Do we aim to merely reduce the probability
> of OOM and other mistakes, or do we want a complete fix that makes
> the dynamic on/off fully safe?
Yeah, I'm glad that more people are trying MGLRU and improving it.
We also have an downstream fix for the OOM on switch issue, but that's
mostly as a fallback in case MGLRU doesn't work well, our goal is
still try to enable MGLRU as much as possible, many issues have been
identified and I'm willing to push and fix things upstream together.
I didn't consider the the OOM on switch an upstream issue though. But
to fix that we just used a schedule_timeout when seeing the lru status
is different from the global status, very close to what Barry
suggested, with some other tweaks.
Keep doing the reclaim during the switch did result in some unexpected
behaviors, including OOM still occurring, just much more unlikely than
before. Like a typical TOCTOU problem for checking the lru's status.
Let me Cc BIngfang, maybe he can provide more detail.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 9:47 ` Kairui Song
@ 2026-03-02 14:35 ` Yafang Shao
2026-03-02 17:51 ` Yuanchu Xie
0 siblings, 1 reply; 22+ messages in thread
From: Yafang Shao @ 2026-03-02 14:35 UTC (permalink / raw)
To: Kairui Song
Cc: Barry Song, bingfangguo, lenohou, akpm, axelrasmussen,
linux-kernel, linux-mm, weixugc, wjl.linux, yuanchu, yuzhao
On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > The challenge we're currently facing is that we don't yet know which
> > > workloads would benefit from it ;)
> > > We do want to enable mglru on our production servers, but first we
> > > need to address the risk of OOM during the switch—that's exactly why
> > > we're proposing this patch.
> >
> > Nobody objects to your intention to fix it. I’m curious: to what
> > extent do we want to fix it? Do we aim to merely reduce the probability
> > of OOM and other mistakes, or do we want a complete fix that makes
> > the dynamic on/off fully safe?
>
> Yeah, I'm glad that more people are trying MGLRU and improving it.
>
> We also have an downstream fix for the OOM on switch issue, but that's
> mostly as a fallback in case MGLRU doesn't work well, our goal is
> still try to enable MGLRU as much as possible,
Our goals are aligned.
Before enabling mglru, we must first ensure it won't cause OOM errors
across multiple servers. We propose fixing this because, during our
previous mglru enablement, many instances of a single service OOM'd
simultaneously—potentially leading to data loss for that service.
> many issues have been
> identified and I'm willing to push and fix things upstream together.
>
> I didn't consider the the OOM on switch an upstream issue though.
This is a serious upstream kernel bug that could lead to data loss. If
it is not recognized as such, the upstream kernel should consider
removing this dynamic toggle.
> But
> to fix that we just used a schedule_timeout when seeing the lru status
So your proposal is essentially something like this?
while (status) {
schedule_timeout(random_timeout);
}
> is different from the global status, very close to what Barry
> suggested, with some other tweaks.
>
> Keep doing the reclaim during the switch did result in some unexpected
> behaviors, including OOM still occurring, just much more unlikely than
> before. Like a typical TOCTOU problem for checking the lru's status.
>
> Let me Cc BIngfang, maybe he can provide more detail.
Looking forward to your solution.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 8:00 ` Kairui Song
2026-03-02 8:15 ` Barry Song
2026-03-02 8:25 ` Yafang Shao
@ 2026-03-02 16:26 ` Michal Hocko
2 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2026-03-02 16:26 UTC (permalink / raw)
To: Kairui Song
Cc: Yafang Shao, Barry Song, lenohou, akpm, axelrasmussen,
linux-kernel, linux-mm, weixugc, wjl.linux, yuanchu, yuzhao
On Mon 02-03-26 16:00:03, Kairui Song wrote:
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]
> > > To be honest, the on/off toggle is quite odd. If possible,
> > > I’d prefer not to switch MGLRU or active/inactive
> > > dynamically. Once it’s set up during system boot, it
> > > should remain unchanged.
> >
> > While it is well-suited for Android environments, it is not viable for
> > Kubernetes production servers, where rebooting is highly disruptive.
> > This limitation is precisely why we need to introduce dynamic toggles.
>
> I agree with Barry, the switch isn't supposed to be a knob to be
> turned on/off frequently.
Is there any actual usecase other than debugging to switch the reclaim
plementation back and forth? In other words do we really need to care
about this issue at all? Is the additional code worth it?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-02 14:35 ` Yafang Shao
@ 2026-03-02 17:51 ` Yuanchu Xie
0 siblings, 0 replies; 22+ messages in thread
From: Yuanchu Xie @ 2026-03-02 17:51 UTC (permalink / raw)
To: Yafang Shao
Cc: Kairui Song, Barry Song, bingfangguo, lenohou, akpm,
axelrasmussen, linux-kernel, linux-mm, weixugc, wjl.linux,
yuzhao
Hi Yafang,
On Mon, Mar 2, 2026 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > The challenge we're currently facing is that we don't yet know which
> > > > workloads would benefit from it ;)
> > > > We do want to enable mglru on our production servers, but first we
> > > > need to address the risk of OOM during the switch—that's exactly why
> > > > we're proposing this patch.
> > >
> > > Nobody objects to your intention to fix it. I’m curious: to what
> > > extent do we want to fix it? Do we aim to merely reduce the probability
> > > of OOM and other mistakes, or do we want a complete fix that makes
> > > the dynamic on/off fully safe?
> >
> > Yeah, I'm glad that more people are trying MGLRU and improving it.
> >
> > We also have an downstream fix for the OOM on switch issue, but that's
> > mostly as a fallback in case MGLRU doesn't work well, our goal is
> > still try to enable MGLRU as much as possible,
>
> Our goals are aligned.
> Before enabling mglru, we must first ensure it won't cause OOM errors
> across multiple servers. We propose fixing this because, during our
> previous mglru enablement, many instances of a single service OOM'd
> simultaneously—potentially leading to data loss for that service.
Would it be possible to drain the jobs away from the machine before
switching LRUs? The MGLRU kill-switch could be improved, but making
the switch more or less "hitless" would require significant work. Is
the use case a one-time switch from active/inactive to MGLRU?
I do want to note that OOMs causing data loss is not really the kernel's fault.
Thanks,
Yuanchu
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-03-02 17:52 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
2026-02-28 22:41 ` Barry Song
2026-03-01 4:10 ` Barry Song
2026-03-02 5:50 ` Yafang Shao
2026-03-02 6:58 ` Barry Song
2026-03-02 7:43 ` Yafang Shao
2026-03-02 8:00 ` Kairui Song
2026-03-02 8:15 ` Barry Song
2026-03-02 8:25 ` Yafang Shao
2026-03-02 9:20 ` Barry Song
2026-03-02 9:47 ` Kairui Song
2026-03-02 14:35 ` Yafang Shao
2026-03-02 17:51 ` Yuanchu Xie
2026-03-02 16:26 ` Michal Hocko
2026-03-02 8:03 ` Barry Song
2026-03-02 8:13 ` Yafang Shao
2026-03-02 8:20 ` Barry Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox