[PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-02-28 16:10 Leno Hou
  2026-02-28 18:58 ` Andrew Morton
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Leno Hou @ 2026-02-28 16:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Leno Hou, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Barry Song, Jialing Wang, Yafang Shao, Yu Zhao

When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim
path. This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

*** Problem Description ***

The issue arises from a "reclaim vacuum" during the transition:

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
   false before the pages are drained from MGLRU lists back to
   traditional LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
   and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
   yet, or the changes are not yet visible to all CPUs due to a lack of
   synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
   concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees
the new state but the MGLRU lists haven't been populated via
fill_evictable() yet.

*** Solution ***

Introduce a 'draining' state to bridge the gap during transitions:

- Use smp_store_release() and smp_load_acquire() to ensure the visibility
  of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
  is in the 'draining' state, the reclaimer will attempt to scan MGLRU
  lists first, and then fall through to traditional LRU lists instead
  of returning early. This ensures that folios are visible to at least
  one reclaim path at any given time.

*** Reproduction ***

The issue was consistently reproduced on v6.1.157 and v6.18.3 using
a high-pressure memory cgroup (v1) environment.

Reproduction steps:
1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
   and 8GB active anonymous memory.
2. Toggle MGLRU state while performing new memory allocations to force
   direct reclaim.

Reproduction script:
---
#!/bin/bash
# Fixed reproduction for memcg OOM during MGLRU toggle
set -euo pipefail

MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"

# Switch MGLRU dynamically in the background
switch_mglru() {
    local orig_val=$(cat "$MGLRU_FILE")
    if [[ "$orig_val" != "0x0000" ]]; then
        echo n > "$MGLRU_FILE" &
    else
        echo y > "$MGLRU_FILE" &
    fi
}

# Setup 16G memcg
mkdir -p "$CGROUP_PATH"
echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
echo $$ > "$CGROUP_PATH/cgroup.procs"

# 1. Build memory pressure (File + Anon)
dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache

stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
sleep 5

# 2. Trigger switch and concurrent allocation
switch_mglru
stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"

# Check OOM counter
grep oom_kill "$CGROUP_PATH/memory.oom_control"
---

Signed-off-by: Leno Hou <lenohou@gmail.com>

---
To: linux-mm@kvack.org
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Jialing Wang <wjl.linux@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 include/linux/mmzone.h |  2 ++
 mm/vmscan.c            | 14 +++++++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c5725..0648ce91dbc6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,8 @@ struct lru_gen_folio {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multi-gen LRU is enabled */
 	bool enabled;
+	/* whether the multi-gen LRU is draining to LRU */
+	bool draining;
 	/* the memcg generation this lru_gen_folio belongs to */
 	u8 gen;
 	/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06071995dacc..629a00681163 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
 			VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
 			VM_WARN_ON_ONCE(!state_is_valid(lruvec));
 
-			lruvec->lrugen.enabled = enabled;
+			smp_store_release(&lruvec->lrugen.enabled, enabled);
+			smp_store_release(&lruvec->lrugen.draining, true);
 
 			while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
 				spin_unlock_irq(&lruvec->lru_lock);
@@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
+			smp_store_release(&lruvec->lrugen.draining, false);
+
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
 
@@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 	bool proportional_reclaim;
 	struct blk_plug plug;
+	bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
+	bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
 
-	if (lru_gen_enabled() && !root_reclaim(sc)) {
+	if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
 		lru_gen_shrink_lruvec(lruvec, sc);
-		return;
+
+		if (!lru_draining)
+			return;
+
 	}
 
 	get_scan_count(lruvec, sc, nr);
-- 
2.52.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
@ 2026-02-28 18:58 ` Andrew Morton
  2026-02-28 19:12 ` kernel test robot
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2026-02-28 18:58 UTC (permalink / raw)
  To: Leno Hou
  Cc: linux-mm, linux-kernel, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Barry Song, Jialing Wang, Yafang Shao, Yu Zhao

On Sun,  1 Mar 2026 00:10:08 +0800 Leno Hou <lenohou@gmail.com> wrote:

> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim
> path. This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
> 
> ...
>

Nice description, thanks.  I'll queue this for testing while we await
comments.

> 
> Reproduction script:
> ---

Please avoid using the ^---$ separator in changelogs - it means "end of
changelog text"!

> Signed-off-by: Leno Hou <lenohou@gmail.com>
> 
> ---

Ditto.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
  2026-02-28 18:58 ` Andrew Morton
@ 2026-02-28 19:12 ` kernel test robot
  2026-02-28 19:23 ` kernel test robot
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2026-02-28 19:12 UTC (permalink / raw)
  To: Leno Hou, linux-mm, linux-kernel
  Cc: oe-kbuild-all, Leno Hou, Andrew Morton,
	Linux Memory Management List, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Barry Song, Jialing Wang, Yafang Shao, Yu Zhao

Hi Leno,

kernel test robot noticed the following build errors:

[auto build test ERROR on v7.0-rc1]
[also build test ERROR on linus/master next-20260227]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Leno-Hou/mm-mglru-fix-cgroup-OOM-during-MGLRU-state-switching/20260301-001148
base:   v7.0-rc1
patch link:    https://lore.kernel.org/r/20260228161008.707-1-lenohou%40gmail.com
patch subject: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
config: x86_64-randconfig-001-20260301 (https://download.01.org/0day-ci/archive/20260301/202603010315.rTOWjv41-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260301/202603010315.rTOWjv41-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603010315.rTOWjv41-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bitops/generic-non-atomic.h:7,
                    from include/linux/bitops.h:28,
                    from include/linux/thread_info.h:27,
                    from include/linux/spinlock.h:60,
                    from include/linux/mmzone.h:8,
                    from include/linux/gfp.h:7,
                    from include/linux/mm.h:8,
                    from mm/vmscan.c:15:
   mm/vmscan.c: In function 'shrink_lruvec':
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   arch/x86/include/asm/barrier.h:68:17: note: in definition of macro '__smp_load_acquire'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                 ^
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
   In file included from <command-line>:
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:642:53: note: in definition of macro '__unqual_scalar_typeof'
     642 | #define __unqual_scalar_typeof(x) __typeof_unqual__(x)
         |                                                     ^
   include/asm-generic/rwonce.h:50:9: note: in expansion of macro '__READ_ONCE'
      50 |         __READ_ONCE(x);                                                 \
         |         ^~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
   In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
                    from include/linux/compiler.h:372,
                    from include/linux/static_call_types.h:7,
                    from arch/x86/include/asm/bug.h:141,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:7:
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/asm-generic/rwonce.h:44:73: note: in definition of macro '__READ_ONCE'
      44 | #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                         ^
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
     709 |         compiletime_assert(__native_word(t),                            \
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
     709 |         compiletime_assert(__native_word(t),                            \
         |                            ^~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
      69 |         compiletime_assert_atomic_type(*p);                             \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
     709 |         compiletime_assert(__native_word(t),                            \
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
     709 |         compiletime_assert(__native_word(t),                            \
         |                            ^~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
      69 |         compiletime_assert_atomic_type(*p);                             \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
     709 |         compiletime_assert(__native_word(t),                            \
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
     709 |         compiletime_assert(__native_word(t),                            \
         |                            ^~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
      69 |         compiletime_assert_atomic_type(*p);                             \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
>> mm/vmscan.c:5785:55: error: 'struct lruvec' has no member named 'lrugen'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                       ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:9: note: in expansion of macro 'compiletime_assert'
     709 |         compiletime_assert(__native_word(t),                            \
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/compiler_types.h:709:28: note: in expansion of macro '__native_word'
     709 |         compiletime_assert(__native_word(t),                            \
         |                            ^~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:69:9: note: in expansion of macro 'compiletime_assert_atomic_type'
      69 |         compiletime_assert_atomic_type(*p);                             \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5785:31: note: in expansion of macro 'smp_load_acquire'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                               ^~~~~~~~~~~~~~~~
   mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                                     ^~
   arch/x86/include/asm/barrier.h:68:17: note: in definition of macro '__smp_load_acquire'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                 ^
   mm/vmscan.c:5786:29: note: in expansion of macro 'smp_load_acquire'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                             ^~~~~~~~~~~~~~~~
   mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                                     ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5786:29: note: in expansion of macro 'smp_load_acquire'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                             ^~~~~~~~~~~~~~~~
   mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                                     ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |                            ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:9: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |         compiletime_assert_rwonce_type(x);                              \
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/include/asm/barrier.h:68:28: note: in expansion of macro 'READ_ONCE'
      68 |         typeof(*p) ___p1 = READ_ONCE(*p);                               \
         |                            ^~~~~~~~~
   include/asm-generic/barrier.h:176:29: note: in expansion of macro '__smp_load_acquire'
     176 | #define smp_load_acquire(p) __smp_load_acquire(p)
         |                             ^~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5786:29: note: in expansion of macro 'smp_load_acquire'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                             ^~~~~~~~~~~~~~~~
   mm/vmscan.c:5786:53: error: 'struct lruvec' has no member named 'lrugen'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                                     ^~
   include/linux/compiler_types.h:686:23: note: in definition of macro '__compiletime_assert'
     686 |                 if (!(condition))                                       \
         |                       ^~~~~~~~~
   include/linux/compiler_types.h:706:9: note: in expansion of macro '_compiletime_assert'
     706 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:9: note: in expansion of macro 'compiletime_assert'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
         |         ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:28: note: in expansion of macro '__native_word'
      36 |         compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
..


vim +5785 mm/vmscan.c

  5774	
  5775	static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
  5776	{
  5777		unsigned long nr[NR_LRU_LISTS];
  5778		unsigned long targets[NR_LRU_LISTS];
  5779		unsigned long nr_to_scan;
  5780		enum lru_list lru;
  5781		unsigned long nr_reclaimed = 0;
  5782		unsigned long nr_to_reclaim = sc->nr_to_reclaim;
  5783		bool proportional_reclaim;
  5784		struct blk_plug plug;
> 5785		bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
  5786		bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
  5787	
> 5788		if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
  5789			lru_gen_shrink_lruvec(lruvec, sc);
  5790	
  5791			if (!lru_draining)
  5792				return;
  5793	
  5794		}
  5795	
  5796		get_scan_count(lruvec, sc, nr);
  5797	
  5798		/* Record the original scan target for proportional adjustments later */
  5799		memcpy(targets, nr, sizeof(nr));
  5800	
  5801		/*
  5802		 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
  5803		 * event that can occur when there is little memory pressure e.g.
  5804		 * multiple streaming readers/writers. Hence, we do not abort scanning
  5805		 * when the requested number of pages are reclaimed when scanning at
  5806		 * DEF_PRIORITY on the assumption that the fact we are direct
  5807		 * reclaiming implies that kswapd is not keeping up and it is best to
  5808		 * do a batch of work at once. For memcg reclaim one check is made to
  5809		 * abort proportional reclaim if either the file or anon lru has already
  5810		 * dropped to zero at the first pass.
  5811		 */
  5812		proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
  5813					sc->priority == DEF_PRIORITY);
  5814	
  5815		blk_start_plug(&plug);
  5816		while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
  5817						nr[LRU_INACTIVE_FILE]) {
  5818			unsigned long nr_anon, nr_file, percentage;
  5819			unsigned long nr_scanned;
  5820	
  5821			for_each_evictable_lru(lru) {
  5822				if (nr[lru]) {
  5823					nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
  5824					nr[lru] -= nr_to_scan;
  5825	
  5826					nr_reclaimed += shrink_list(lru, nr_to_scan,
  5827								    lruvec, sc);
  5828				}
  5829			}
  5830	
  5831			cond_resched();
  5832	
  5833			if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
  5834				continue;
  5835	
  5836			/*
  5837			 * For kswapd and memcg, reclaim at least the number of pages
  5838			 * requested. Ensure that the anon and file LRUs are scanned
  5839			 * proportionally what was requested by get_scan_count(). We
  5840			 * stop reclaiming one LRU and reduce the amount scanning
  5841			 * proportional to the original scan target.
  5842			 */
  5843			nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
  5844			nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
  5845	
  5846			/*
  5847			 * It's just vindictive to attack the larger once the smaller
  5848			 * has gone to zero.  And given the way we stop scanning the
  5849			 * smaller below, this makes sure that we only make one nudge
  5850			 * towards proportionality once we've got nr_to_reclaim.
  5851			 */
  5852			if (!nr_file || !nr_anon)
  5853				break;
  5854	
  5855			if (nr_file > nr_anon) {
  5856				unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
  5857							targets[LRU_ACTIVE_ANON] + 1;
  5858				lru = LRU_BASE;
  5859				percentage = nr_anon * 100 / scan_target;
  5860			} else {
  5861				unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
  5862							targets[LRU_ACTIVE_FILE] + 1;
  5863				lru = LRU_FILE;
  5864				percentage = nr_file * 100 / scan_target;
  5865			}
  5866	
  5867			/* Stop scanning the smaller of the LRU */
  5868			nr[lru] = 0;
  5869			nr[lru + LRU_ACTIVE] = 0;
  5870	
  5871			/*
  5872			 * Recalculate the other LRU scan count based on its original
  5873			 * scan target and the percentage scanning already complete
  5874			 */
  5875			lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
  5876			nr_scanned = targets[lru] - nr[lru];
  5877			nr[lru] = targets[lru] * (100 - percentage) / 100;
  5878			nr[lru] -= min(nr[lru], nr_scanned);
  5879	
  5880			lru += LRU_ACTIVE;
  5881			nr_scanned = targets[lru] - nr[lru];
  5882			nr[lru] = targets[lru] * (100 - percentage) / 100;
  5883			nr[lru] -= min(nr[lru], nr_scanned);
  5884		}
  5885		blk_finish_plug(&plug);
  5886		sc->nr_reclaimed += nr_reclaimed;
  5887	
  5888		/*
  5889		 * Even if we did not try to evict anon pages at all, we want to
  5890		 * rebalance the anon lru active/inactive ratio.
  5891		 */
  5892		if (can_age_anon_pages(lruvec, sc) &&
  5893		    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
  5894			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
  5895					   sc, LRU_ACTIVE_ANON);
  5896	}
  5897	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
  2026-02-28 18:58 ` Andrew Morton
  2026-02-28 19:12 ` kernel test robot
@ 2026-02-28 19:23 ` kernel test robot
  2026-02-28 20:15 ` kernel test robot
  2026-02-28 21:28 ` Barry Song
  4 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2026-02-28 19:23 UTC (permalink / raw)
  To: Leno Hou, linux-mm, linux-kernel
  Cc: llvm, oe-kbuild-all, Leno Hou, Andrew Morton,
	Linux Memory Management List, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Barry Song, Jialing Wang, Yafang Shao, Yu Zhao

Hi Leno,

kernel test robot noticed the following build warnings:

[auto build test WARNING on v7.0-rc1]
[also build test WARNING on linus/master next-20260227]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Leno-Hou/mm-mglru-fix-cgroup-OOM-during-MGLRU-state-switching/20260301-001148
base:   v7.0-rc1
patch link:    https://lore.kernel.org/r/20260228161008.707-1-lenohou%40gmail.com
patch subject: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
config: um-defconfig (https://download.01.org/0day-ci/archive/20260301/202603010300.t6GYRWjK-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9a109fbb6e184ec9bcce10615949f598f4c974a9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260301/202603010300.t6GYRWjK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603010300.t6GYRWjK-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
>> mm/vmscan.c:5788:37: warning: '&&' within '||' [-Wlogical-op-parentheses]
    5788 |         if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
         |                            ~~ ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5788:37: note: place parentheses around the '&&' expression to silence this warning
    5788 |         if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
         |                                            ^                   
         |                               (                                )
   1 warning and 18 errors generated.


vim +5788 mm/vmscan.c

  5774	
  5775	static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
  5776	{
  5777		unsigned long nr[NR_LRU_LISTS];
  5778		unsigned long targets[NR_LRU_LISTS];
  5779		unsigned long nr_to_scan;
  5780		enum lru_list lru;
  5781		unsigned long nr_reclaimed = 0;
  5782		unsigned long nr_to_reclaim = sc->nr_to_reclaim;
  5783		bool proportional_reclaim;
  5784		struct blk_plug plug;
  5785		bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
  5786		bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
  5787	
> 5788		if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
  5789			lru_gen_shrink_lruvec(lruvec, sc);
  5790	
  5791			if (!lru_draining)
  5792				return;
  5793	
  5794		}
  5795	
  5796		get_scan_count(lruvec, sc, nr);
  5797	
  5798		/* Record the original scan target for proportional adjustments later */
  5799		memcpy(targets, nr, sizeof(nr));
  5800	
  5801		/*
  5802		 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
  5803		 * event that can occur when there is little memory pressure e.g.
  5804		 * multiple streaming readers/writers. Hence, we do not abort scanning
  5805		 * when the requested number of pages are reclaimed when scanning at
  5806		 * DEF_PRIORITY on the assumption that the fact we are direct
  5807		 * reclaiming implies that kswapd is not keeping up and it is best to
  5808		 * do a batch of work at once. For memcg reclaim one check is made to
  5809		 * abort proportional reclaim if either the file or anon lru has already
  5810		 * dropped to zero at the first pass.
  5811		 */
  5812		proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
  5813					sc->priority == DEF_PRIORITY);
  5814	
  5815		blk_start_plug(&plug);
  5816		while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
  5817						nr[LRU_INACTIVE_FILE]) {
  5818			unsigned long nr_anon, nr_file, percentage;
  5819			unsigned long nr_scanned;
  5820	
  5821			for_each_evictable_lru(lru) {
  5822				if (nr[lru]) {
  5823					nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
  5824					nr[lru] -= nr_to_scan;
  5825	
  5826					nr_reclaimed += shrink_list(lru, nr_to_scan,
  5827								    lruvec, sc);
  5828				}
  5829			}
  5830	
  5831			cond_resched();
  5832	
  5833			if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
  5834				continue;
  5835	
  5836			/*
  5837			 * For kswapd and memcg, reclaim at least the number of pages
  5838			 * requested. Ensure that the anon and file LRUs are scanned
  5839			 * proportionally what was requested by get_scan_count(). We
  5840			 * stop reclaiming one LRU and reduce the amount scanning
  5841			 * proportional to the original scan target.
  5842			 */
  5843			nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
  5844			nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
  5845	
  5846			/*
  5847			 * It's just vindictive to attack the larger once the smaller
  5848			 * has gone to zero.  And given the way we stop scanning the
  5849			 * smaller below, this makes sure that we only make one nudge
  5850			 * towards proportionality once we've got nr_to_reclaim.
  5851			 */
  5852			if (!nr_file || !nr_anon)
  5853				break;
  5854	
  5855			if (nr_file > nr_anon) {
  5856				unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
  5857							targets[LRU_ACTIVE_ANON] + 1;
  5858				lru = LRU_BASE;
  5859				percentage = nr_anon * 100 / scan_target;
  5860			} else {
  5861				unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
  5862							targets[LRU_ACTIVE_FILE] + 1;
  5863				lru = LRU_FILE;
  5864				percentage = nr_file * 100 / scan_target;
  5865			}
  5866	
  5867			/* Stop scanning the smaller of the LRU */
  5868			nr[lru] = 0;
  5869			nr[lru + LRU_ACTIVE] = 0;
  5870	
  5871			/*
  5872			 * Recalculate the other LRU scan count based on its original
  5873			 * scan target and the percentage scanning already complete
  5874			 */
  5875			lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
  5876			nr_scanned = targets[lru] - nr[lru];
  5877			nr[lru] = targets[lru] * (100 - percentage) / 100;
  5878			nr[lru] -= min(nr[lru], nr_scanned);
  5879	
  5880			lru += LRU_ACTIVE;
  5881			nr_scanned = targets[lru] - nr[lru];
  5882			nr[lru] = targets[lru] * (100 - percentage) / 100;
  5883			nr[lru] -= min(nr[lru], nr_scanned);
  5884		}
  5885		blk_finish_plug(&plug);
  5886		sc->nr_reclaimed += nr_reclaimed;
  5887	
  5888		/*
  5889		 * Even if we did not try to evict anon pages at all, we want to
  5890		 * rebalance the anon lru active/inactive ratio.
  5891		 */
  5892		if (can_age_anon_pages(lruvec, sc) &&
  5893		    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
  5894			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
  5895					   sc, LRU_ACTIVE_ANON);
  5896	}
  5897	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
                   ` (2 preceding siblings ...)
  2026-02-28 19:23 ` kernel test robot
@ 2026-02-28 20:15 ` kernel test robot
  2026-02-28 21:28 ` Barry Song
  4 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2026-02-28 20:15 UTC (permalink / raw)
  To: Leno Hou, linux-mm, linux-kernel
  Cc: llvm, oe-kbuild-all, Leno Hou, Andrew Morton,
	Linux Memory Management List, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Barry Song, Jialing Wang, Yafang Shao, Yu Zhao

Hi Leno,

kernel test robot noticed the following build errors:

[auto build test ERROR on v7.0-rc1]
[also build test ERROR on linus/master next-20260227]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Leno-Hou/mm-mglru-fix-cgroup-OOM-during-MGLRU-state-switching/20260301-001148
base:   v7.0-rc1
patch link:    https://lore.kernel.org/r/20260228161008.707-1-lenohou%40gmail.com
patch subject: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
config: um-defconfig (https://download.01.org/0day-ci/archive/20260301/202603010435.MBtvBCTp-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9a109fbb6e184ec9bcce10615949f598f4c974a9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260301/202603010435.MBtvBCTp-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603010435.MBtvBCTp-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
>> mm/vmscan.c:5785:50: error: no member named 'lrugen' in 'struct lruvec'
    5785 |         bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
         |                                                 ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5786:48: error: no member named 'lrugen' in 'struct lruvec'
    5786 |         bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
         |                                               ~~~~~~  ^
   mm/vmscan.c:5788:37: warning: '&&' within '||' [-Wlogical-op-parentheses]
    5788 |         if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
         |                            ~~ ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
   mm/vmscan.c:5788:37: note: place parentheses around the '&&' expression to silence this warning
    5788 |         if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
         |                                            ^                   
         |                               (                                )
   1 warning and 18 errors generated.


vim +5785 mm/vmscan.c

  5774	
  5775	static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
  5776	{
  5777		unsigned long nr[NR_LRU_LISTS];
  5778		unsigned long targets[NR_LRU_LISTS];
  5779		unsigned long nr_to_scan;
  5780		enum lru_list lru;
  5781		unsigned long nr_reclaimed = 0;
  5782		unsigned long nr_to_reclaim = sc->nr_to_reclaim;
  5783		bool proportional_reclaim;
  5784		struct blk_plug plug;
> 5785		bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
  5786		bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
  5787	
  5788		if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
  5789			lru_gen_shrink_lruvec(lruvec, sc);
  5790	
  5791			if (!lru_draining)
  5792				return;
  5793	
  5794		}
  5795	
  5796		get_scan_count(lruvec, sc, nr);
  5797	
  5798		/* Record the original scan target for proportional adjustments later */
  5799		memcpy(targets, nr, sizeof(nr));
  5800	
  5801		/*
  5802		 * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
  5803		 * event that can occur when there is little memory pressure e.g.
  5804		 * multiple streaming readers/writers. Hence, we do not abort scanning
  5805		 * when the requested number of pages are reclaimed when scanning at
  5806		 * DEF_PRIORITY on the assumption that the fact we are direct
  5807		 * reclaiming implies that kswapd is not keeping up and it is best to
  5808		 * do a batch of work at once. For memcg reclaim one check is made to
  5809		 * abort proportional reclaim if either the file or anon lru has already
  5810		 * dropped to zero at the first pass.
  5811		 */
  5812		proportional_reclaim = (!cgroup_reclaim(sc) && !current_is_kswapd() &&
  5813					sc->priority == DEF_PRIORITY);
  5814	
  5815		blk_start_plug(&plug);
  5816		while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
  5817						nr[LRU_INACTIVE_FILE]) {
  5818			unsigned long nr_anon, nr_file, percentage;
  5819			unsigned long nr_scanned;
  5820	
  5821			for_each_evictable_lru(lru) {
  5822				if (nr[lru]) {
  5823					nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
  5824					nr[lru] -= nr_to_scan;
  5825	
  5826					nr_reclaimed += shrink_list(lru, nr_to_scan,
  5827								    lruvec, sc);
  5828				}
  5829			}
  5830	
  5831			cond_resched();
  5832	
  5833			if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
  5834				continue;
  5835	
  5836			/*
  5837			 * For kswapd and memcg, reclaim at least the number of pages
  5838			 * requested. Ensure that the anon and file LRUs are scanned
  5839			 * proportionally what was requested by get_scan_count(). We
  5840			 * stop reclaiming one LRU and reduce the amount scanning
  5841			 * proportional to the original scan target.
  5842			 */
  5843			nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
  5844			nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];
  5845	
  5846			/*
  5847			 * It's just vindictive to attack the larger once the smaller
  5848			 * has gone to zero.  And given the way we stop scanning the
  5849			 * smaller below, this makes sure that we only make one nudge
  5850			 * towards proportionality once we've got nr_to_reclaim.
  5851			 */
  5852			if (!nr_file || !nr_anon)
  5853				break;
  5854	
  5855			if (nr_file > nr_anon) {
  5856				unsigned long scan_target = targets[LRU_INACTIVE_ANON] +
  5857							targets[LRU_ACTIVE_ANON] + 1;
  5858				lru = LRU_BASE;
  5859				percentage = nr_anon * 100 / scan_target;
  5860			} else {
  5861				unsigned long scan_target = targets[LRU_INACTIVE_FILE] +
  5862							targets[LRU_ACTIVE_FILE] + 1;
  5863				lru = LRU_FILE;
  5864				percentage = nr_file * 100 / scan_target;
  5865			}
  5866	
  5867			/* Stop scanning the smaller of the LRU */
  5868			nr[lru] = 0;
  5869			nr[lru + LRU_ACTIVE] = 0;
  5870	
  5871			/*
  5872			 * Recalculate the other LRU scan count based on its original
  5873			 * scan target and the percentage scanning already complete
  5874			 */
  5875			lru = (lru == LRU_FILE) ? LRU_BASE : LRU_FILE;
  5876			nr_scanned = targets[lru] - nr[lru];
  5877			nr[lru] = targets[lru] * (100 - percentage) / 100;
  5878			nr[lru] -= min(nr[lru], nr_scanned);
  5879	
  5880			lru += LRU_ACTIVE;
  5881			nr_scanned = targets[lru] - nr[lru];
  5882			nr[lru] = targets[lru] * (100 - percentage) / 100;
  5883			nr[lru] -= min(nr[lru], nr_scanned);
  5884		}
  5885		blk_finish_plug(&plug);
  5886		sc->nr_reclaimed += nr_reclaimed;
  5887	
  5888		/*
  5889		 * Even if we did not try to evict anon pages at all, we want to
  5890		 * rebalance the anon lru active/inactive ratio.
  5891		 */
  5892		if (can_age_anon_pages(lruvec, sc) &&
  5893		    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
  5894			shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
  5895					   sc, LRU_ACTIVE_ANON);
  5896	}
  5897	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
                   ` (3 preceding siblings ...)
  2026-02-28 20:15 ` kernel test robot
@ 2026-02-28 21:28 ` Barry Song
  2026-02-28 22:41   ` Barry Song
  2026-03-02  5:50   ` Yafang Shao
  4 siblings, 2 replies; 27+ messages in thread
From: Barry Song @ 2026-02-28 21:28 UTC (permalink / raw)
  To: lenohou
  Cc: 21cnbao, akpm, axelrasmussen, laoar.shao, linux-kernel, linux-mm,
	weixugc, wjl.linux, yuanchu, yuzhao

On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
>
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim
> path. This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> *** Problem Description ***
>
> The issue arises from a "reclaim vacuum" during the transition:
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
>    false before the pages are drained from MGLRU lists back to
>    traditional LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
>    and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
>    yet, or the changes are not yet visible to all CPUs due to a lack of
>    synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
>    concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees
> the new state but the MGLRU lists haven't been populated via
> fill_evictable() yet.
>
> *** Solution ***
>
> Introduce a 'draining' state to bridge the gap during transitions:
>
> - Use smp_store_release() and smp_load_acquire() to ensure the visibility
>   of 'enabled' and 'draining' flags across CPUs.
> - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
>   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
>   lists first, and then fall through to traditional LRU lists instead
>   of returning early. This ensures that folios are visible to at least
>   one reclaim path at any given time.
>
> *** Reproduction ***
>
> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> a high-pressure memory cgroup (v1) environment.
>
> Reproduction steps:
> 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
>    and 8GB active anonymous memory.
> 2. Toggle MGLRU state while performing new memory allocations to force
>    direct reclaim.
>
> Reproduction script:
> ---
> #!/bin/bash
> # Fixed reproduction for memcg OOM during MGLRU toggle
> set -euo pipefail
>
> MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
>
> # Switch MGLRU dynamically in the background
> switch_mglru() {
>     local orig_val=$(cat "$MGLRU_FILE")
>     if [[ "$orig_val" != "0x0000" ]]; then
>         echo n > "$MGLRU_FILE" &
>     else
>         echo y > "$MGLRU_FILE" &
>     fi
> }
>
> # Setup 16G memcg
> mkdir -p "$CGROUP_PATH"
> echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> echo $$ > "$CGROUP_PATH/cgroup.procs"
>
> # 1. Build memory pressure (File + Anon)
> dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
>
> stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> sleep 5
>
> # 2. Trigger switch and concurrent allocation
> switch_mglru
> stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
>
> # Check OOM counter
> grep oom_kill "$CGROUP_PATH/memory.oom_control"
> ---
>
> Signed-off-by: Leno Hou <lenohou@gmail.com>
>
> ---
> To: linux-mm@kvack.org
> To: linux-kernel@vger.kernel.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Yuanchu Xie <yuanchu@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Barry Song <21cnbao@gmail.com>
> Cc: Jialing Wang <wjl.linux@gmail.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/vmscan.c            | 14 +++++++++++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c5725..0648ce91dbc6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -509,6 +509,8 @@ struct lru_gen_folio {
>         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         /* whether the multi-gen LRU is enabled */
>         bool enabled;
> +       /* whether the multi-gen LRU is draining to LRU */
> +       bool draining;
>         /* the memcg generation this lru_gen_folio belongs to */
>         u8 gen;
>         /* the list segment this lru_gen_folio belongs to */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06071995dacc..629a00681163 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
>                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
>                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
>
> -                       lruvec->lrugen.enabled = enabled;
> +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> +                       smp_store_release(&lruvec->lrugen.draining, true);
>
>                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
>                                 spin_unlock_irq(&lruvec->lru_lock);
> @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
>                                 spin_lock_irq(&lruvec->lru_lock);
>                         }
>
> +                       smp_store_release(&lruvec->lrugen.draining, false);
> +
>                         spin_unlock_irq(&lruvec->lru_lock);
>                 }
>
> @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
>         bool proportional_reclaim;
>         struct blk_plug plug;
> +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
>
> -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
>                 lru_gen_shrink_lruvec(lruvec, sc);
> -               return;

Is it possible to simply wait for draining to finish instead of performing
an lru_gen/lru shrink while lru_gen is being disabled or enabled?

Performing a shrink in an intermediate state may still involve a lot of
uncertainty, depending on how far the shrink has progressed and how much
remains in each side’s LRU？

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..ba306e986050 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,8 @@ struct lru_gen_folio {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multi-gen LRU is enabled */
 	bool enabled;
+	/* whether the multi-gen LRU is switching from/to active/inactive LRU */
+	bool switching;
 	/* the memcg generation this lru_gen_folio belongs to */
 	u8 gen;
 	/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..60fc611067c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
 			VM_WARN_ON_ONCE(!state_is_valid(lruvec));
 
 			lruvec->lrugen.enabled = enabled;
+			smp_store_release(&lruvec->lrugen.switching, true);
 
 			while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
 				spin_unlock_irq(&lruvec->lru_lock);
@@ -5203,6 +5204,8 @@ static void lru_gen_change_state(bool enabled)
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
+			smp_store_release(&lruvec->lrugen.switching, false);
+
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
 
@@ -5780,6 +5783,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	bool proportional_reclaim;
 	struct blk_plug plug;
 
+#ifdef CONFIG_LRU_GEN
+	while (smp_load_acquire(&lruvec->lrugen.switching))
+		schedule_timeout_uninterruptible(HZ/100);
+#endif
 	if (lru_gen_enabled() && !root_reclaim(sc)) {
 		lru_gen_shrink_lruvec(lruvec, sc);
 		return;
-- 

> +
> +               if (!lru_draining)
> +                       return;
> +
>         }
>
>         get_scan_count(lruvec, sc, nr);
> --
> 2.52.0
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 21:28 ` Barry Song
@ 2026-02-28 22:41   ` Barry Song
  2026-03-01  4:10     ` Barry Song
  2026-03-02  5:50   ` Yafang Shao
  1 sibling, 1 reply; 27+ messages in thread
From: Barry Song @ 2026-02-28 22:41 UTC (permalink / raw)
  To: lenohou
  Cc: akpm, axelrasmussen, laoar.shao, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuanchu, yuzhao

On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
[...]
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3e51190a55e4..ba306e986050 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -509,6 +509,8 @@ struct lru_gen_folio {
>         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>         /* whether the multi-gen LRU is enabled */
>         bool enabled;
> +       /* whether the multi-gen LRU is switching from/to active/inactive LRU */
> +       bool switching;
>         /* the memcg generation this lru_gen_folio belongs to */
>         u8 gen;
>         /* the list segment this lru_gen_folio belongs to */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0fc9373e8251..60fc611067c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
>                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
>
>                         lruvec->lrugen.enabled = enabled;
> +                       smp_store_release(&lruvec->lrugen.switching, true);

Sorry, I actually meant:

 +                       smp_store_release(&lruvec->lrugen.switching, true);
                         lruvec->lrugen.enabled = enabled;

But I guess we could still hit a race condition in extreme cases—switching
MGLRU on or off as frequently as possible. The only reliable way is to check
enabled during shrinking while holding the lruvec’s lock.

Thanks
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 22:41   ` Barry Song
@ 2026-03-01  4:10     ` Barry Song
  0 siblings, 0 replies; 27+ messages in thread
From: Barry Song @ 2026-03-01  4:10 UTC (permalink / raw)
  To: lenohou
  Cc: 21cnbao, akpm, axelrasmussen, laoar.shao, linux-kernel, linux-mm,
	weixugc, wjl.linux, yuanchu, yuzhao

On Sun, Mar 1, 2026 at 6:41 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> [...]
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 3e51190a55e4..ba306e986050 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> >         /* whether the multi-gen LRU is enabled */
> >         bool enabled;
> > +       /* whether the multi-gen LRU is switching from/to active/inactive LRU */
> > +       bool switching;
> >         /* the memcg generation this lru_gen_folio belongs to */
> >         u8 gen;
> >         /* the list segment this lru_gen_folio belongs to */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0fc9373e8251..60fc611067c7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
> >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> >
> >                         lruvec->lrugen.enabled = enabled;
> > +                       smp_store_release(&lruvec->lrugen.switching, true);
>
> Sorry, I actually meant:
>
>  +                       smp_store_release(&lruvec->lrugen.switching, true);
>                          lruvec->lrugen.enabled = enabled;
>
> But I guess we could still hit a race condition in extreme cases—switching
> MGLRU on or off as frequently as possible. The only reliable way is to check
> enabled during shrinking while holding the lruvec’s lock.

Sorry, I was talking to myself.... Since the switching and the 'enabled'
state are not inherently serialized with shrink_lruvec(), their values
can change at any time, leading to race conditions.

Therefore, I believe the only safe approach is:
1. Do not allow enabling or disabling MGLRU on an lruvec while
   shrink_lruvec() is running.
2. Do not allow shrink_lruvec() to run while MGLRU is being enabled
   or disabled on that lruvec.

Something like the following:

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..c4b07159577e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,7 @@ struct lru_gen_folio {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multi-gen LRU is enabled */
 	bool enabled;
+	struct rw_semaphore switch_lock;
 	/* the memcg generation this lru_gen_folio belongs to */
 	u8 gen;
 	/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..aadf1e7c31cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5190,6 +5190,7 @@ static void lru_gen_change_state(bool enabled)
 		for_each_node(nid) {
 			struct lruvec *lruvec = get_lruvec(memcg, nid);
 
+			down_write(&lruvec->lrugen.switch_lock);
 			spin_lock_irq(&lruvec->lru_lock);
 
 			VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
@@ -5204,6 +5205,7 @@ static void lru_gen_change_state(bool enabled)
 			}
 
 			spin_unlock_irq(&lruvec->lru_lock);
+			up_write(&lruvec->lrugen.switch_lock);
 		}
 
 		cond_resched();
@@ -5680,6 +5682,7 @@ void lru_gen_init_lruvec(struct lruvec *lruvec)
 
 	lrugen->max_seq = MIN_NR_GENS + 1;
 	lrugen->enabled = lru_gen_enabled();
+	init_rwsem(&lrugen->switch_lock);
 
 	for (i = 0; i <= MIN_NR_GENS + 1; i++)
 		lrugen->timestamps[i] = jiffies;
@@ -5780,10 +5783,14 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	bool proportional_reclaim;
 	struct blk_plug plug;
 
-	if (lru_gen_enabled() && !root_reclaim(sc)) {
+#ifdef CONFIG_LRU_GEN
+	down_read(&lruvec->lrugen.switch_lock);
+	if (lruvec->lrugen.enabled && !root_reclaim(sc)) {
 		lru_gen_shrink_lruvec(lruvec, sc);
+		up_read(&lruvec->lrugen.switch_lock);
 		return;
 	}
+#endif
 
 	get_scan_count(lruvec, sc, nr);
 
@@ -5885,6 +5892,9 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	    inactive_is_low(lruvec, LRU_INACTIVE_ANON))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
+#ifdef CONFIG_LRU_GEN
+	up_read(&lruvec->lrugen.switch_lock);
+#endif
 }
 
 /* Use reclaim/compaction for costly allocs or under memory pressure */
-- 

Thanks
Barry



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-02-28 21:28 ` Barry Song
  2026-02-28 22:41   ` Barry Song
@ 2026-03-02  5:50   ` Yafang Shao
  2026-03-02  6:58     ` Barry Song
  1 sibling, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2026-03-02  5:50 UTC (permalink / raw)
  To: Barry Song
  Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuanchu, yuzhao

On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> >
> > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > condition exists between the state switching and the memory reclaim
> > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > reclaimable memory is available.
> >
> > *** Problem Description ***
> >
> > The issue arises from a "reclaim vacuum" during the transition:
> >
> > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> >    false before the pages are drained from MGLRU lists back to
> >    traditional LRU lists.
> > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> >    and skip the MGLRU path.
> > 3. However, these pages might not have reached the traditional LRU lists
> >    yet, or the changes are not yet visible to all CPUs due to a lack of
> >    synchronization.
> > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> >    concludes there is no reclaimable memory, and triggers an OOM kill.
> >
> > A similar race can occur during enablement, where the reclaimer sees
> > the new state but the MGLRU lists haven't been populated via
> > fill_evictable() yet.
> >
> > *** Solution ***
> >
> > Introduce a 'draining' state to bridge the gap during transitions:
> >
> > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> >   of 'enabled' and 'draining' flags across CPUs.
> > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> >   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> >   lists first, and then fall through to traditional LRU lists instead
> >   of returning early. This ensures that folios are visible to at least
> >   one reclaim path at any given time.
> >
> > *** Reproduction ***
> >
> > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > a high-pressure memory cgroup (v1) environment.
> >
> > Reproduction steps:
> > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> >    and 8GB active anonymous memory.
> > 2. Toggle MGLRU state while performing new memory allocations to force
> >    direct reclaim.
> >
> > Reproduction script:
> > ---
> > #!/bin/bash
> > # Fixed reproduction for memcg OOM during MGLRU toggle
> > set -euo pipefail
> >
> > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> >
> > # Switch MGLRU dynamically in the background
> > switch_mglru() {
> >     local orig_val=$(cat "$MGLRU_FILE")
> >     if [[ "$orig_val" != "0x0000" ]]; then
> >         echo n > "$MGLRU_FILE" &
> >     else
> >         echo y > "$MGLRU_FILE" &
> >     fi
> > }
> >
> > # Setup 16G memcg
> > mkdir -p "$CGROUP_PATH"
> > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > echo $$ > "$CGROUP_PATH/cgroup.procs"
> >
> > # 1. Build memory pressure (File + Anon)
> > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> >
> > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > sleep 5
> >
> > # 2. Trigger switch and concurrent allocation
> > switch_mglru
> > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> >
> > # Check OOM counter
> > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > ---
> >
> > Signed-off-by: Leno Hou <lenohou@gmail.com>
> >
> > ---
> > To: linux-mm@kvack.org
> > To: linux-kernel@vger.kernel.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > Cc: Yuanchu Xie <yuanchu@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Barry Song <21cnbao@gmail.com>
> > Cc: Jialing Wang <wjl.linux@gmail.com>
> > Cc: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > ---
> >  include/linux/mmzone.h |  2 ++
> >  mm/vmscan.c            | 14 +++++++++++---
> >  2 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 7fb7331c5725..0648ce91dbc6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> >         /* whether the multi-gen LRU is enabled */
> >         bool enabled;
> > +       /* whether the multi-gen LRU is draining to LRU */
> > +       bool draining;
> >         /* the memcg generation this lru_gen_folio belongs to */
> >         u8 gen;
> >         /* the list segment this lru_gen_folio belongs to */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 06071995dacc..629a00681163 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> >                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> >
> > -                       lruvec->lrugen.enabled = enabled;
> > +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> > +                       smp_store_release(&lruvec->lrugen.draining, true);
> >
> >                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> >                                 spin_unlock_irq(&lruvec->lru_lock);
> > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> >                                 spin_lock_irq(&lruvec->lru_lock);
> >                         }
> >
> > +                       smp_store_release(&lruvec->lrugen.draining, false);
> > +
> >                         spin_unlock_irq(&lruvec->lru_lock);
> >                 }
> >
> > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> >         bool proportional_reclaim;
> >         struct blk_plug plug;
> > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> >
> > -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> > +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> >                 lru_gen_shrink_lruvec(lruvec, sc);
> > -               return;
>

Hello Barry,

> Is it possible to simply wait for draining to finish instead of performing
> an lru_gen/lru shrink while lru_gen is being disabled or enabled?

This might introduce unexpected latency spikes during the waiting period.

>
> Performing a shrink in an intermediate state may still involve a lot of
> uncertainty, depending on how far the shrink has progressed and how much
> remains in each side’s LRU？

The workingset might not be reliable in this intermediate state.
However, since switching MGLRU should not be a frequent operation in a
production environment, I believe the workingset in this intermediate
state should not be a concern. The only reason we would enable or
disable MGLRU is if we find that certain workloads benefit from
it—enabling it when it helps, and disabling it when it causes
degradation. There should be no other scenario in which we would need
to toggle MGLRU on or off.

To identify which workloads can benefit from MGLRU, we must first
ensure that switching it on or off is safe—which is precisely why we
are proposing this patch. Once MGLRU is enabled in production, we can
continue to improve it. Perhaps in the future, we can even implement a
per-workload reclaim mechanism.


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  5:50   ` Yafang Shao
@ 2026-03-02  6:58     ` Barry Song
  2026-03-02  7:43       ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2026-03-02  6:58 UTC (permalink / raw)
  To: Yafang Shao
  Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > >
> > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > condition exists between the state switching and the memory reclaim
> > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > reclaimable memory is available.
> > >
> > > *** Problem Description ***
> > >
> > > The issue arises from a "reclaim vacuum" during the transition:
> > >
> > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > >    false before the pages are drained from MGLRU lists back to
> > >    traditional LRU lists.
> > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > >    and skip the MGLRU path.
> > > 3. However, these pages might not have reached the traditional LRU lists
> > >    yet, or the changes are not yet visible to all CPUs due to a lack of
> > >    synchronization.
> > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > >    concludes there is no reclaimable memory, and triggers an OOM kill.
> > >
> > > A similar race can occur during enablement, where the reclaimer sees
> > > the new state but the MGLRU lists haven't been populated via
> > > fill_evictable() yet.
> > >
> > > *** Solution ***
> > >
> > > Introduce a 'draining' state to bridge the gap during transitions:
> > >
> > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > >   of 'enabled' and 'draining' flags across CPUs.
> > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > >   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > >   lists first, and then fall through to traditional LRU lists instead
> > >   of returning early. This ensures that folios are visible to at least
> > >   one reclaim path at any given time.
> > >
> > > *** Reproduction ***
> > >
> > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > a high-pressure memory cgroup (v1) environment.
> > >
> > > Reproduction steps:
> > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > >    and 8GB active anonymous memory.
> > > 2. Toggle MGLRU state while performing new memory allocations to force
> > >    direct reclaim.
> > >
> > > Reproduction script:
> > > ---
> > > #!/bin/bash
> > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > set -euo pipefail
> > >
> > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > >
> > > # Switch MGLRU dynamically in the background
> > > switch_mglru() {
> > >     local orig_val=$(cat "$MGLRU_FILE")
> > >     if [[ "$orig_val" != "0x0000" ]]; then
> > >         echo n > "$MGLRU_FILE" &
> > >     else
> > >         echo y > "$MGLRU_FILE" &
> > >     fi
> > > }
> > >
> > > # Setup 16G memcg
> > > mkdir -p "$CGROUP_PATH"
> > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > >
> > > # 1. Build memory pressure (File + Anon)
> > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > >
> > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > sleep 5
> > >
> > > # 2. Trigger switch and concurrent allocation
> > > switch_mglru
> > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > >
> > > # Check OOM counter
> > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > ---
> > >
> > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > >
> > > ---
> > > To: linux-mm@kvack.org
> > > To: linux-kernel@vger.kernel.org
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > Cc: Wei Xu <weixugc@google.com>
> > > Cc: Barry Song <21cnbao@gmail.com>
> > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > Cc: Yu Zhao <yuzhao@google.com>
> > > ---
> > >  include/linux/mmzone.h |  2 ++
> > >  mm/vmscan.c            | 14 +++++++++++---
> > >  2 files changed, 13 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 7fb7331c5725..0648ce91dbc6 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > >         /* whether the multi-gen LRU is enabled */
> > >         bool enabled;
> > > +       /* whether the multi-gen LRU is draining to LRU */
> > > +       bool draining;
> > >         /* the memcg generation this lru_gen_folio belongs to */
> > >         u8 gen;
> > >         /* the list segment this lru_gen_folio belongs to */
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 06071995dacc..629a00681163 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > >                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > >
> > > -                       lruvec->lrugen.enabled = enabled;
> > > +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > +                       smp_store_release(&lruvec->lrugen.draining, true);
> > >
> > >                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > >                                 spin_unlock_irq(&lruvec->lru_lock);
> > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > >                                 spin_lock_irq(&lruvec->lru_lock);
> > >                         }
> > >
> > > +                       smp_store_release(&lruvec->lrugen.draining, false);
> > > +
> > >                         spin_unlock_irq(&lruvec->lru_lock);
> > >                 }
> > >
> > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > >         bool proportional_reclaim;
> > >         struct blk_plug plug;
> > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > >                 lru_gen_shrink_lruvec(lruvec, sc);
> > > -               return;
> >
>
> Hello Barry,
>
> > Is it possible to simply wait for draining to finish instead of performing
> > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
>
> This might introduce unexpected latency spikes during the waiting period.

I assume latency is not a concern for a very rare
MGLRU on/off case. Do you require the switch to happen
with zero latency?
My main concern is the correctness of the code.

Now the proposed patch is:

+       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
+       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);

Then choose MGLRU or active/inactive LRU based on
those values.

However, nothing prevents those values from changing
after they are read. Even within the shrink path,
they can still change.

So I think we need an rwsem or something similar here —
a read lock for shrink and a write lock for on/off. The
write lock should happen very rarely.

>
> >
> > Performing a shrink in an intermediate state may still involve a lot of
> > uncertainty, depending on how far the shrink has progressed and how much
> > remains in each side’s LRU？
>
> The workingset might not be reliable in this intermediate state.
> However, since switching MGLRU should not be a frequent operation in a
> production environment, I believe the workingset in this intermediate
> state should not be a concern. The only reason we would enable or
> disable MGLRU is if we find that certain workloads benefit from
> it—enabling it when it helps, and disabling it when it causes
> degradation. There should be no other scenario in which we would need
> to toggle MGLRU on or off.
>
> To identify which workloads can benefit from MGLRU, we must first
> ensure that switching it on or off is safe—which is precisely why we
> are proposing this patch. Once MGLRU is enabled in production, we can
> continue to improve it. Perhaps in the future, we can even implement a
> per-workload reclaim mechanism.

To be honest, the on/off toggle is quite odd. If possible,
I’d prefer not to switch MGLRU or active/inactive
dynamically. Once it’s set up during system boot, it
should remain unchanged.

If we want a per-workload LRU, this could be a good
place for eBPF to hook into folio enqueue, dequeue,
and scanning. There is a project related to this [1][2].

// Policy function hooks
struct cache_ext_ops {
       s32 (*policy_init)(struct mem_cgroup *memcg);
       // Propose folios to evict
       void (*evict_folios)(struct eviction_ctx *ctx,
                 struct mem_cgroup *memcg);
       void (*folio_added)(struct folio *folio);
       void (*folio_accessed)(struct folio *folio);
       // Folio was removed: clean up metadata
       void (*folio_removed)(struct folio *folio);
       char name[CACHE_EXT_OPS_NAME_LEN];
};

However, we would need a very strong and convincing
user case to justify it.

[1] https://dl.acm.org/doi/pdf/10.1145/3731569.3764820
[2] https://github.com/cache-ext/cache_ext

Thanks
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  6:58     ` Barry Song
@ 2026-03-02  7:43       ` Yafang Shao
  2026-03-02  8:00         ` Kairui Song
  2026-03-02  8:03         ` Barry Song
  0 siblings, 2 replies; 27+ messages in thread
From: Yafang Shao @ 2026-03-02  7:43 UTC (permalink / raw)
  To: Barry Song
  Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > > >
> > > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > > condition exists between the state switching and the memory reclaim
> > > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > > reclaimable memory is available.
> > > >
> > > > *** Problem Description ***
> > > >
> > > > The issue arises from a "reclaim vacuum" during the transition:
> > > >
> > > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > >    false before the pages are drained from MGLRU lists back to
> > > >    traditional LRU lists.
> > > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > >    and skip the MGLRU path.
> > > > 3. However, these pages might not have reached the traditional LRU lists
> > > >    yet, or the changes are not yet visible to all CPUs due to a lack of
> > > >    synchronization.
> > > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > >    concludes there is no reclaimable memory, and triggers an OOM kill.
> > > >
> > > > A similar race can occur during enablement, where the reclaimer sees
> > > > the new state but the MGLRU lists haven't been populated via
> > > > fill_evictable() yet.
> > > >
> > > > *** Solution ***
> > > >
> > > > Introduce a 'draining' state to bridge the gap during transitions:
> > > >
> > > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > >   of 'enabled' and 'draining' flags across CPUs.
> > > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > >   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > >   lists first, and then fall through to traditional LRU lists instead
> > > >   of returning early. This ensures that folios are visible to at least
> > > >   one reclaim path at any given time.
> > > >
> > > > *** Reproduction ***
> > > >
> > > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > > a high-pressure memory cgroup (v1) environment.
> > > >
> > > > Reproduction steps:
> > > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > >    and 8GB active anonymous memory.
> > > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > >    direct reclaim.
> > > >
> > > > Reproduction script:
> > > > ---
> > > > #!/bin/bash
> > > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > > set -euo pipefail
> > > >
> > > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > > >
> > > > # Switch MGLRU dynamically in the background
> > > > switch_mglru() {
> > > >     local orig_val=$(cat "$MGLRU_FILE")
> > > >     if [[ "$orig_val" != "0x0000" ]]; then
> > > >         echo n > "$MGLRU_FILE" &
> > > >     else
> > > >         echo y > "$MGLRU_FILE" &
> > > >     fi
> > > > }
> > > >
> > > > # Setup 16G memcg
> > > > mkdir -p "$CGROUP_PATH"
> > > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > > >
> > > > # 1. Build memory pressure (File + Anon)
> > > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > > >
> > > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > > sleep 5
> > > >
> > > > # 2. Trigger switch and concurrent allocation
> > > > switch_mglru
> > > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > > >
> > > > # Check OOM counter
> > > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > > ---
> > > >
> > > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > > >
> > > > ---
> > > > To: linux-mm@kvack.org
> > > > To: linux-kernel@vger.kernel.org
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > > Cc: Wei Xu <weixugc@google.com>
> > > > Cc: Barry Song <21cnbao@gmail.com>
> > > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > Cc: Yu Zhao <yuzhao@google.com>
> > > > ---
> > > >  include/linux/mmzone.h |  2 ++
> > > >  mm/vmscan.c            | 14 +++++++++++---
> > > >  2 files changed, 13 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 7fb7331c5725..0648ce91dbc6 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > >         /* whether the multi-gen LRU is enabled */
> > > >         bool enabled;
> > > > +       /* whether the multi-gen LRU is draining to LRU */
> > > > +       bool draining;
> > > >         /* the memcg generation this lru_gen_folio belongs to */
> > > >         u8 gen;
> > > >         /* the list segment this lru_gen_folio belongs to */
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 06071995dacc..629a00681163 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > >                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > > >
> > > > -                       lruvec->lrugen.enabled = enabled;
> > > > +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > > +                       smp_store_release(&lruvec->lrugen.draining, true);
> > > >
> > > >                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > >                                 spin_unlock_irq(&lruvec->lru_lock);
> > > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > >                                 spin_lock_irq(&lruvec->lru_lock);
> > > >                         }
> > > >
> > > > +                       smp_store_release(&lruvec->lrugen.draining, false);
> > > > +
> > > >                         spin_unlock_irq(&lruvec->lru_lock);
> > > >                 }
> > > >
> > > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > >         bool proportional_reclaim;
> > > >         struct blk_plug plug;
> > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > > +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > >                 lru_gen_shrink_lruvec(lruvec, sc);
> > > > -               return;
> > >
> >
> > Hello Barry,
> >
> > > Is it possible to simply wait for draining to finish instead of performing
> > > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
> >
> > This might introduce unexpected latency spikes during the waiting period.
>
> I assume latency is not a concern for a very rare
> MGLRU on/off case. Do you require the switch to happen
> with zero latency?
> My main concern is the correctness of the code.
>
> Now the proposed patch is:
>
> +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
>
> Then choose MGLRU or active/inactive LRU based on
> those values.
>
> However, nothing prevents those values from changing
> after they are read. Even within the shrink path,
> they can still change.

If these values are changed during reclaim, the currently running
reclaimer will continue to operate with the old settings, while any
new reclaimer processes will adopt the new values. This approach
should prevent any immediate issues, but the primary risk of this
lockless method is the potential for a user to rapidly toggle the
MGLRU feature, particularly during an intermediate state.

>
> So I think we need an rwsem or something similar here —
> a read lock for shrink and a write lock for on/off. The
> write lock should happen very rarely.

We can introduce a lock-based mechanism in v2.

>
> >
> > >
> > > Performing a shrink in an intermediate state may still involve a lot of
> > > uncertainty, depending on how far the shrink has progressed and how much
> > > remains in each side’s LRU？
> >
> > The workingset might not be reliable in this intermediate state.
> > However, since switching MGLRU should not be a frequent operation in a
> > production environment, I believe the workingset in this intermediate
> > state should not be a concern. The only reason we would enable or
> > disable MGLRU is if we find that certain workloads benefit from
> > it—enabling it when it helps, and disabling it when it causes
> > degradation. There should be no other scenario in which we would need
> > to toggle MGLRU on or off.
> >
> > To identify which workloads can benefit from MGLRU, we must first
> > ensure that switching it on or off is safe—which is precisely why we
> > are proposing this patch. Once MGLRU is enabled in production, we can
> > continue to improve it. Perhaps in the future, we can even implement a
> > per-workload reclaim mechanism.
>
> To be honest, the on/off toggle is quite odd. If possible,
> I’d prefer not to switch MGLRU or active/inactive
> dynamically. Once it’s set up during system boot, it
> should remain unchanged.

While it is well-suited for Android environments, it is not viable for
Kubernetes production servers, where rebooting is highly disruptive.
This limitation is precisely why we need to introduce dynamic toggles.

>
> If we want a per-workload LRU, this could be a good
> place for eBPF to hook into folio enqueue, dequeue,
> and scanning. There is a project related to this [1][2].
>
> // Policy function hooks
> struct cache_ext_ops {
>        s32 (*policy_init)(struct mem_cgroup *memcg);
>        // Propose folios to evict
>        void (*evict_folios)(struct eviction_ctx *ctx,
>                  struct mem_cgroup *memcg);
>        void (*folio_added)(struct folio *folio);
>        void (*folio_accessed)(struct folio *folio);
>        // Folio was removed: clean up metadata
>        void (*folio_removed)(struct folio *folio);
>        char name[CACHE_EXT_OPS_NAME_LEN];
> };
>
> However, we would need a very strong and convincing
> user case to justify it.

Thanks for the info.
We're actually already running a BPF-based reclaimer in production,
but we don't have immediate plans to upstream or propose it just yet.

>
> [1] https://dl.acm.org/doi/pdf/10.1145/3731569.3764820
> [2] https://github.com/cache-ext/cache_ext
>
> Thanks
> Barry



-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  7:43       ` Yafang Shao
@ 2026-03-02  8:00         ` Kairui Song
  2026-03-02  8:15           ` Barry Song
                             ` (2 more replies)
  2026-03-02  8:03         ` Barry Song
  1 sibling, 3 replies; 27+ messages in thread
From: Kairui Song @ 2026-03-02  8:00 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Barry Song, lenohou, akpm, axelrasmussen, linux-kernel, linux-mm,
	weixugc, wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > I assume latency is not a concern for a very rare
> > MGLRU on/off case. Do you require the switch to happen
> > with zero latency?
> > My main concern is the correctness of the code.
> >
> > Now the proposed patch is:
> >
> > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> >
> > Then choose MGLRU or active/inactive LRU based on
> > those values.
> >
> > However, nothing prevents those values from changing
> > after they are read. Even within the shrink path,
> > they can still change.

Hi all,

> If these values are changed during reclaim, the currently running
> reclaimer will continue to operate with the old settings, while any
> new reclaimer processes will adopt the new values. This approach
> should prevent any immediate issues, but the primary risk of this
> lockless method is the potential for a user to rapidly toggle the
> MGLRU feature, particularly during an intermediate state.
>
> >
> > So I think we need an rwsem or something similar here —
> > a read lock for shrink and a write lock for on/off. The
> > write lock should happen very rarely.
>
> We can introduce a lock-based mechanism in v2.

I hope we don't need a lock here. Currently there is only a static
key, this patch is already adding more branches, a lock will make
things more complex and the shrinking path is quite performance
sensitive.

> >
> > To be honest, the on/off toggle is quite odd. If possible,
> > I’d prefer not to switch MGLRU or active/inactive
> > dynamically. Once it’s set up during system boot, it
> > should remain unchanged.
>
> While it is well-suited for Android environments, it is not viable for
> Kubernetes production servers, where rebooting is highly disruptive.
> This limitation is precisely why we need to introduce dynamic toggles.

I agree with Barry, the switch isn't supposed to be a knob to be
turned on/off frequently. And I think in the long term we should just
identify the workloads where MGLRU doesn't work well, and fix MGLRU.
Having two LRUs in the kernel is already very odd.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  8:00         ` Kairui Song
@ 2026-03-02  8:15           ` Barry Song
  2026-03-02  8:25           ` Yafang Shao
  2026-03-02 16:26           ` Michal Hocko
  2 siblings, 0 replies; 27+ messages in thread
From: Barry Song @ 2026-03-02  8:15 UTC (permalink / raw)
  To: Kairui Song
  Cc: Yafang Shao, lenohou, akpm, axelrasmussen, linux-kernel,
	linux-mm, weixugc, wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > I assume latency is not a concern for a very rare
> > > MGLRU on/off case. Do you require the switch to happen
> > > with zero latency?
> > > My main concern is the correctness of the code.
> > >
> > > Now the proposed patch is:
> > >
> > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > Then choose MGLRU or active/inactive LRU based on
> > > those values.
> > >
> > > However, nothing prevents those values from changing
> > > after they are read. Even within the shrink path,
> > > they can still change.
>
> Hi all,
>
> > If these values are changed during reclaim, the currently running
> > reclaimer will continue to operate with the old settings, while any
> > new reclaimer processes will adopt the new values. This approach
> > should prevent any immediate issues, but the primary risk of this
> > lockless method is the potential for a user to rapidly toggle the
> > MGLRU feature, particularly during an intermediate state.
> >
> > >
> > > So I think we need an rwsem or something similar here —
> > > a read lock for shrink and a write lock for on/off. The
> > > write lock should happen very rarely.
> >
> > We can introduce a lock-based mechanism in v2.
>
> I hope we don't need a lock here. Currently there is only a static
> key, this patch is already adding more branches, a lock will make
> things more complex and the shrinking path is quite performance
> sensitive.

I agree that the shrinking path is performance-sensitive. However, the
bottleneck occurs when we move folios out of the LRU, performing
reference checks by scanning PTEs with rmap, unmapping, and compressing
memory. I believe that either the branch or the readlock is too small
to noticeably affect shrink performance.

Thanks
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  8:00         ` Kairui Song
  2026-03-02  8:15           ` Barry Song
@ 2026-03-02  8:25           ` Yafang Shao
  2026-03-02  9:20             ` Barry Song
  2026-03-02 16:26           ` Michal Hocko
  2 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2026-03-02  8:25 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, lenohou, akpm, axelrasmussen, linux-kernel, linux-mm,
	weixugc, wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > I assume latency is not a concern for a very rare
> > > MGLRU on/off case. Do you require the switch to happen
> > > with zero latency?
> > > My main concern is the correctness of the code.
> > >
> > > Now the proposed patch is:
> > >
> > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > Then choose MGLRU or active/inactive LRU based on
> > > those values.
> > >
> > > However, nothing prevents those values from changing
> > > after they are read. Even within the shrink path,
> > > they can still change.
>
> Hi all,
>
> > If these values are changed during reclaim, the currently running
> > reclaimer will continue to operate with the old settings, while any
> > new reclaimer processes will adopt the new values. This approach
> > should prevent any immediate issues, but the primary risk of this
> > lockless method is the potential for a user to rapidly toggle the
> > MGLRU feature, particularly during an intermediate state.
> >
> > >
> > > So I think we need an rwsem or something similar here —
> > > a read lock for shrink and a write lock for on/off. The
> > > write lock should happen very rarely.
> >
> > We can introduce a lock-based mechanism in v2.
>
> I hope we don't need a lock here. Currently there is only a static
> key, this patch is already adding more branches, a lock will make
> things more complex and the shrinking path is quite performance
> sensitive.
>
> > >
> > > To be honest, the on/off toggle is quite odd. If possible,
> > > I’d prefer not to switch MGLRU or active/inactive
> > > dynamically. Once it’s set up during system boot, it
> > > should remain unchanged.
> >
> > While it is well-suited for Android environments, it is not viable for
> > Kubernetes production servers, where rebooting is highly disruptive.
> > This limitation is precisely why we need to introduce dynamic toggles.
>
> I agree with Barry, the switch isn't supposed to be a knob to be
> turned on/off frequently. And I think in the long term we should just
> identify the workloads where MGLRU doesn't work well, and fix MGLRU.

The challenge we're currently facing is that we don't yet know which
workloads would benefit from it ;)
We do want to enable mglru on our production servers, but first we
need to address the risk of OOM during the switch—that's exactly why
we're proposing this patch.


> Having two LRUs in the kernel is already very odd.

It's difficult to completely move away from either one.
Looking forward to your work.

--
Regards
Yafang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  8:25           ` Yafang Shao
@ 2026-03-02  9:20             ` Barry Song
  2026-03-02  9:47               ` Kairui Song
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2026-03-02  9:20 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Kairui Song, lenohou, akpm, axelrasmussen, linux-kernel,
	linux-mm, weixugc, wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > I assume latency is not a concern for a very rare
> > > > MGLRU on/off case. Do you require the switch to happen
> > > > with zero latency?
> > > > My main concern is the correctness of the code.
> > > >
> > > > Now the proposed patch is:
> > > >
> > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > Then choose MGLRU or active/inactive LRU based on
> > > > those values.
> > > >
> > > > However, nothing prevents those values from changing
> > > > after they are read. Even within the shrink path,
> > > > they can still change.
> >
> > Hi all,
> >
> > > If these values are changed during reclaim, the currently running
> > > reclaimer will continue to operate with the old settings, while any
> > > new reclaimer processes will adopt the new values. This approach
> > > should prevent any immediate issues, but the primary risk of this
> > > lockless method is the potential for a user to rapidly toggle the
> > > MGLRU feature, particularly during an intermediate state.
> > >
> > > >
> > > > So I think we need an rwsem or something similar here —
> > > > a read lock for shrink and a write lock for on/off. The
> > > > write lock should happen very rarely.
> > >
> > > We can introduce a lock-based mechanism in v2.
> >
> > I hope we don't need a lock here. Currently there is only a static
> > key, this patch is already adding more branches, a lock will make
> > things more complex and the shrinking path is quite performance
> > sensitive.
> >
> > > >
> > > > To be honest, the on/off toggle is quite odd. If possible,
> > > > I’d prefer not to switch MGLRU or active/inactive
> > > > dynamically. Once it’s set up during system boot, it
> > > > should remain unchanged.
> > >
> > > While it is well-suited for Android environments, it is not viable for
> > > Kubernetes production servers, where rebooting is highly disruptive.
> > > This limitation is precisely why we need to introduce dynamic toggles.
> >
> > I agree with Barry, the switch isn't supposed to be a knob to be
> > turned on/off frequently. And I think in the long term we should just
> > identify the workloads where MGLRU doesn't work well, and fix MGLRU.
>
> The challenge we're currently facing is that we don't yet know which
> workloads would benefit from it ;)
> We do want to enable mglru on our production servers, but first we
> need to address the risk of OOM during the switch—that's exactly why
> we're proposing this patch.

Nobody objects to your intention to fix it. I’m curious: to what
extent do we want to fix it? Do we aim to merely reduce the probability
of OOM and other mistakes, or do we want a complete fix that makes
the dynamic on/off fully safe?

Currently, many places appear fragile, mainly because
`lru_gen_enabled()` checks a global variable that doesn’t accurately
reflect where folios are during switching. A full fix might require
guarding the shrinking path against the switching path to prevent
simultaneous execution, which would add unnecessary complexity for a
rarely used "feature".

If our goal is only to reduce the probability of mistakes, I feel your
current patch may be fine, even though some race conditions
remain in principle.

Thanks
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  9:20             ` Barry Song
@ 2026-03-02  9:47               ` Kairui Song
  2026-03-02 14:35                 ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: Kairui Song @ 2026-03-02  9:47 UTC (permalink / raw)
  To: Barry Song, Yafang Shao
  Cc: bingfangguo, lenohou, akpm, axelrasmussen, linux-kernel,
	linux-mm, weixugc, wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > The challenge we're currently facing is that we don't yet know which
> > workloads would benefit from it ;)
> > We do want to enable mglru on our production servers, but first we
> > need to address the risk of OOM during the switch—that's exactly why
> > we're proposing this patch.
>
> Nobody objects to your intention to fix it. I’m curious: to what
> extent do we want to fix it? Do we aim to merely reduce the probability
> of OOM and other mistakes, or do we want a complete fix that makes
> the dynamic on/off fully safe?

Yeah, I'm glad that more people are trying MGLRU and improving it.

We also have an downstream fix for the OOM on switch issue, but that's
mostly as a fallback in case MGLRU doesn't work well, our goal is
still try to enable MGLRU as much as possible, many issues have been
identified and I'm willing to push and fix things upstream together.

I didn't consider the the OOM on switch an upstream issue though. But
to fix that we just used a schedule_timeout when seeing the lru status
is different from the global status, very close to what Barry
suggested, with some other tweaks.

Keep doing the reclaim during the switch did result in some unexpected
behaviors, including OOM still occurring, just much more unlikely than
before. Like a typical TOCTOU problem for checking the lru's status.

Let me Cc BIngfang, maybe he can provide more detail.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  9:47               ` Kairui Song
@ 2026-03-02 14:35                 ` Yafang Shao
  2026-03-02 17:51                   ` Yuanchu Xie
  0 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2026-03-02 14:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, bingfangguo, lenohou, akpm, axelrasmussen,
	linux-kernel, linux-mm, weixugc, wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > The challenge we're currently facing is that we don't yet know which
> > > workloads would benefit from it ;)
> > > We do want to enable mglru on our production servers, but first we
> > > need to address the risk of OOM during the switch—that's exactly why
> > > we're proposing this patch.
> >
> > Nobody objects to your intention to fix it. I’m curious: to what
> > extent do we want to fix it? Do we aim to merely reduce the probability
> > of OOM and other mistakes, or do we want a complete fix that makes
> > the dynamic on/off fully safe?
>
> Yeah, I'm glad that more people are trying MGLRU and improving it.
>
> We also have an downstream fix for the OOM on switch issue, but that's
> mostly as a fallback in case MGLRU doesn't work well, our goal is
> still try to enable MGLRU as much as possible,

Our goals are aligned.
Before enabling mglru, we must first ensure it won't cause OOM errors
across multiple servers. We propose fixing this because, during our
previous mglru enablement, many instances of a single service OOM'd
simultaneously—potentially leading to data loss for that service.

> many issues have been
> identified and I'm willing to push and fix things upstream together.
>
> I didn't consider the the OOM on switch an upstream issue though.

This is a serious upstream kernel bug that could lead to data loss. If
it is not recognized as such, the upstream kernel should consider
removing this dynamic toggle.

> But
> to fix that we just used a schedule_timeout when seeing the lru status

So your proposal is essentially something like this?

    while (status) {
         schedule_timeout(random_timeout);
    }

> is different from the global status, very close to what Barry
> suggested, with some other tweaks.
>
> Keep doing the reclaim during the switch did result in some unexpected
> behaviors, including OOM still occurring, just much more unlikely than
> before. Like a typical TOCTOU problem for checking the lru's status.
>
> Let me Cc BIngfang, maybe he can provide more detail.

Looking forward to your solution.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02 14:35                 ` Yafang Shao
@ 2026-03-02 17:51                   ` Yuanchu Xie
  2026-03-03  1:34                     ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Yuanchu Xie @ 2026-03-02 17:51 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Kairui Song, Barry Song, bingfangguo, lenohou, akpm,
	axelrasmussen, linux-kernel, linux-mm, weixugc, wjl.linux,
	yuzhao

Hi Yafang,

On Mon, Mar 2, 2026 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > The challenge we're currently facing is that we don't yet know which
> > > > workloads would benefit from it ;)
> > > > We do want to enable mglru on our production servers, but first we
> > > > need to address the risk of OOM during the switch—that's exactly why
> > > > we're proposing this patch.
> > >
> > > Nobody objects to your intention to fix it. I’m curious: to what
> > > extent do we want to fix it? Do we aim to merely reduce the probability
> > > of OOM and other mistakes, or do we want a complete fix that makes
> > > the dynamic on/off fully safe?
> >
> > Yeah, I'm glad that more people are trying MGLRU and improving it.
> >
> > We also have an downstream fix for the OOM on switch issue, but that's
> > mostly as a fallback in case MGLRU doesn't work well, our goal is
> > still try to enable MGLRU as much as possible,
>
> Our goals are aligned.
> Before enabling mglru, we must first ensure it won't cause OOM errors
> across multiple servers. We propose fixing this because, during our
> previous mglru enablement, many instances of a single service OOM'd
> simultaneously—potentially leading to data loss for that service.

Would it be possible to drain the jobs away from the machine before
switching LRUs? The MGLRU kill-switch could be improved, but making
the switch more or less "hitless" would require significant work. Is
the use case a one-time switch from active/inactive to MGLRU?
I do want to note that OOMs causing data loss is not really the kernel's fault.

Thanks,
Yuanchu


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02 17:51                   ` Yuanchu Xie
@ 2026-03-03  1:34                     ` Barry Song
  2026-03-03  1:40                       ` Axel Rasmussen
  0 siblings, 1 reply; 27+ messages in thread
From: Barry Song @ 2026-03-03  1:34 UTC (permalink / raw)
  To: Yuanchu Xie
  Cc: Yafang Shao, Kairui Song, bingfangguo, lenohou, akpm,
	axelrasmussen, linux-kernel, linux-mm, weixugc, wjl.linux,
	yuzhao

On Tue, Mar 3, 2026 at 1:52 AM Yuanchu Xie <yuanchu@google.com> wrote:
>
> Hi Yafang,
>
> On Mon, Mar 2, 2026 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > The challenge we're currently facing is that we don't yet know which
> > > > > workloads would benefit from it ;)
> > > > > We do want to enable mglru on our production servers, but first we
> > > > > need to address the risk of OOM during the switch—that's exactly why
> > > > > we're proposing this patch.
> > > >
> > > > Nobody objects to your intention to fix it. I’m curious: to what
> > > > extent do we want to fix it? Do we aim to merely reduce the probability
> > > > of OOM and other mistakes, or do we want a complete fix that makes
> > > > the dynamic on/off fully safe?
> > >
> > > Yeah, I'm glad that more people are trying MGLRU and improving it.
> > >
> > > We also have an downstream fix for the OOM on switch issue, but that's
> > > mostly as a fallback in case MGLRU doesn't work well, our goal is
> > > still try to enable MGLRU as much as possible,
> >
> > Our goals are aligned.
> > Before enabling mglru, we must first ensure it won't cause OOM errors
> > across multiple servers. We propose fixing this because, during our
> > previous mglru enablement, many instances of a single service OOM'd
> > simultaneously—potentially leading to data loss for that service.
>
> Would it be possible to drain the jobs away from the machine before
> switching LRUs? The MGLRU kill-switch could be improved, but making
> the switch more or less "hitless" would require significant work. Is
> the use case a one-time switch from active/inactive to MGLRU?

I guess the point is that if upstream provides a sysctl to
toggle MGLRU on and off, then that sysctl should actually
work as intended. Otherwise, it would be better to remove
it.

Based on the previous discussion, we have two options:

1. Reduce the likelihood of OOM and other errors.
This could be achieved either by applying Leno's patch,
which suggests shrinking both MGLRU and active/inactive
lists during switching, or by making shrink_lruvec wait
until the switching is complete via schedule_timeout().

Note that there is no guarantee the switching state
won’t change during shrink_lruvec.

2. Ensure that shrinking and switching do not occur
simultaneously by using something like an rwsem —
shrinking can proceed in parallel under the read
lock, while the (rare) switching path takes the
write lock.

If we want to keep the toggle, we could at least make a
small change to reduce the likelihood of mistakes?

> I do want to note that OOMs causing data loss is not really the kernel's fault.
>

Best Regards
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-03  1:34                     ` Barry Song
@ 2026-03-03  1:40                       ` Axel Rasmussen
  2026-03-03  2:43                         ` Yafang Shao
  0 siblings, 1 reply; 27+ messages in thread
From: Axel Rasmussen @ 2026-03-03  1:40 UTC (permalink / raw)
  To: Barry Song
  Cc: Yuanchu Xie, Yafang Shao, Kairui Song, bingfangguo, lenohou,
	akpm, linux-kernel, linux-mm, weixugc, wjl.linux, yuzhao

On Mon, Mar 2, 2026 at 5:34 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 3, 2026 at 1:52 AM Yuanchu Xie <yuanchu@google.com> wrote:
> >
> > Hi Yafang,
> >
> > On Mon, Mar 2, 2026 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > The challenge we're currently facing is that we don't yet know which
> > > > > > workloads would benefit from it ;)
> > > > > > We do want to enable mglru on our production servers, but first we
> > > > > > need to address the risk of OOM during the switch—that's exactly why
> > > > > > we're proposing this patch.
> > > > >
> > > > > Nobody objects to your intention to fix it. I’m curious: to what
> > > > > extent do we want to fix it? Do we aim to merely reduce the probability
> > > > > of OOM and other mistakes, or do we want a complete fix that makes
> > > > > the dynamic on/off fully safe?
> > > >
> > > > Yeah, I'm glad that more people are trying MGLRU and improving it.
> > > >
> > > > We also have an downstream fix for the OOM on switch issue, but that's
> > > > mostly as a fallback in case MGLRU doesn't work well, our goal is
> > > > still try to enable MGLRU as much as possible,
> > >
> > > Our goals are aligned.
> > > Before enabling mglru, we must first ensure it won't cause OOM errors
> > > across multiple servers. We propose fixing this because, during our
> > > previous mglru enablement, many instances of a single service OOM'd
> > > simultaneously—potentially leading to data loss for that service.
> >
> > Would it be possible to drain the jobs away from the machine before
> > switching LRUs? The MGLRU kill-switch could be improved, but making
> > the switch more or less "hitless" would require significant work. Is
> > the use case a one-time switch from active/inactive to MGLRU?
>
> I guess the point is that if upstream provides a sysctl to
> toggle MGLRU on and off, then that sysctl should actually
> work as intended. Otherwise, it would be better to remove
> it.

I think the problem is the requirements are not well specified. :)

Is it enough for the knob to function well on idle systems? Or does it
need to function "ideally" under all conceivable workloads / stress?
Also how do we define "ideally" - is a stray OOM kill acceptable or
not? Is that preferable to waiting on the switch / drain to complete
during reclaim or not? Reasonable users could disagree.

>
> Based on the previous discussion, we have two options:
>
> 1. Reduce the likelihood of OOM and other errors.
> This could be achieved either by applying Leno's patch,
> which suggests shrinking both MGLRU and active/inactive
> lists during switching, or by making shrink_lruvec wait
> until the switching is complete via schedule_timeout().
>
> Note that there is no guarantee the switching state
> won’t change during shrink_lruvec.
>
> 2. Ensure that shrinking and switching do not occur
> simultaneously by using something like an rwsem —
> shrinking can proceed in parallel under the read
> lock, while the (rare) switching path takes the
> write lock.
>
> If we want to keep the toggle, we could at least make a
> small change to reduce the likelihood of mistakes?
>
> > I do want to note that OOMs causing data loss is not really the kernel's fault.
> >
>
> Best Regards
> Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-03  1:40                       ` Axel Rasmussen
@ 2026-03-03  2:43                         ` Yafang Shao
  2026-03-03  8:27                           ` Bingfang Guo
  0 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2026-03-03  2:43 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Barry Song, Yuanchu Xie, Kairui Song, bingfangguo, lenohou, akpm,
	linux-kernel, linux-mm, weixugc, wjl.linux, yuzhao

On Tue, Mar 3, 2026 at 9:40 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Mon, Mar 2, 2026 at 5:34 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, Mar 3, 2026 at 1:52 AM Yuanchu Xie <yuanchu@google.com> wrote:
> > >
> > > Hi Yafang,
> > >
> > > On Mon, Mar 2, 2026 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > The challenge we're currently facing is that we don't yet know which
> > > > > > > workloads would benefit from it ;)
> > > > > > > We do want to enable mglru on our production servers, but first we
> > > > > > > need to address the risk of OOM during the switch—that's exactly why
> > > > > > > we're proposing this patch.
> > > > > >
> > > > > > Nobody objects to your intention to fix it. I’m curious: to what
> > > > > > extent do we want to fix it? Do we aim to merely reduce the probability
> > > > > > of OOM and other mistakes, or do we want a complete fix that makes
> > > > > > the dynamic on/off fully safe?
> > > > >
> > > > > Yeah, I'm glad that more people are trying MGLRU and improving it.
> > > > >
> > > > > We also have an downstream fix for the OOM on switch issue, but that's
> > > > > mostly as a fallback in case MGLRU doesn't work well, our goal is
> > > > > still try to enable MGLRU as much as possible,
> > > >
> > > > Our goals are aligned.
> > > > Before enabling mglru, we must first ensure it won't cause OOM errors
> > > > across multiple servers. We propose fixing this because, during our
> > > > previous mglru enablement, many instances of a single service OOM'd
> > > > simultaneously—potentially leading to data loss for that service.
> > >
> > > Would it be possible to drain the jobs away from the machine before
> > > switching LRUs? The MGLRU kill-switch could be improved, but making
> > > the switch more or less "hitless" would require significant work. Is
> > > the use case a one-time switch from active/inactive to MGLRU?
> >
> > I guess the point is that if upstream provides a sysctl to
> > toggle MGLRU on and off, then that sysctl should actually
> > work as intended. Otherwise, it would be better to remove
> > it.
>
> I think the problem is the requirements are not well specified. :)

We are planning to enable MGLRU across our large server fleet. During
a previous enablement attempt, we observed multiple instances of a
single service experiencing OOM errors simultaneously, which led to
unexpected user data loss. Despite this, we remain committed to
rolling out MGLRU to more production servers, with the critical
requirement of avoiding OOM events during the transition.

Given the scale of our fleet, it is not feasible to enable MGLRU on
servers one by one while continuously monitoring for OOM occurrences.
Therefore, we need to modify the kernel to minimize the risk of OOM
errors during the enablement process.


>
> Is it enough for the knob to function well on idle systems? Or does it
> need to function "ideally" under all conceivable workloads / stress?
> Also how do we define "ideally" - is a stray OOM kill acceptable or
> not? Is that preferable to waiting on the switch / drain to complete
> during reclaim or not? Reasonable users could disagree.


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-03  2:43                         ` Yafang Shao
@ 2026-03-03  8:27                           ` Bingfang Guo
  0 siblings, 0 replies; 27+ messages in thread
From: Bingfang Guo @ 2026-03-03  8:27 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Axel Rasmussen, Barry Song, Yuanchu Xie, Kairui Song,
	BINGFANG GUO, lenohou, akpm, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuzhao

Resend due to missing In-Reply-To header.

Hi all. Thanks for inviting me to the discussion. I'm glad to join you and share
my ideas and findings with you.

On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > I assume latency is not a concern for a very rare
> > > > MGLRU on/off case. Do you require the switch to happen
> > > > with zero latency?
> > > > My main concern is the correctness of the code.
> > > >
> > > > Now the proposed patch is:
> > > >
> > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > Then choose MGLRU or active/inactive LRU based on
> > > > those values.
> > > >
> > > > However, nothing prevents those values from changing
> > > > after they are read. Even within the shrink path,
> > > > they can still change.
> >
> > Hi all,
> >
> > > If these values are changed during reclaim, the currently running
> > > reclaimer will continue to operate with the old settings, while any
> > > new reclaimer processes will adopt the new values. This approach
> > > should prevent any immediate issues, but the primary risk of this
> > > lockless method is the potential for a user to rapidly toggle the
> > > MGLRU feature, particularly during an intermediate state.
> > >
> > > >
> > > > So I think we need an rwsem or something similar here —
> > > > a read lock for shrink and a write lock for on/off. The
> > > > write lock should happen very rarely.
> > >
> > > We can introduce a lock-based mechanism in v2.
> >
> > I hope we don't need a lock here. Currently there is only a static
> > key, this patch is already adding more branches, a lock will make
> > things more complex and the shrinking path is quite performance
> > sensitive.
> >
> > > >
> > > > To be honest, the on/off toggle is quite odd. If possible,
> > > > I’d prefer not to switch MGLRU or active/inactive
> > > > dynamically. Once it’s set up during system boot, it
> > > > should remain unchanged.
> > >
> > > While it is well-suited for Android environments, it is not viable for
> > > Kubernetes production servers, where rebooting is highly disruptive.
> > > This limitation is precisely why we need to introduce dynamic toggles.
> >
> > I agree with Barry, the switch isn't supposed to be a knob to be
> > turned on/off frequently. And I think in the long term we should just
> > identify the workloads where MGLRU doesn't work well, and fix MGLRU.
>
> The challenge we're currently facing is that we don't yet know which
> workloads would benefit from it ;)
> We do want to enable mglru on our production servers, but first we
> need to address the risk of OOM during the switch—that's exactly why
> we're proposing this patch.

Yes. I believe our long term target is to integrate the two LRU implementations.
But for now, it's important to keep this dynamic toggling feature and make it
robust and work well. So if users are willing to try the new LRU algorithm, they
are free to enable it after system boots for testing, and disable it if they run
into some trouble without worrying about OOM and other problems. Therefore, we
can have more users and potentially expose more problems related to MGLRU and
fix them.


On Mon, Mar 3, 2026 at 1:34 AM Barry Sone <21cnbao@gmail.com> wrote:
> 2. Ensure that shrinking and switching do not occur
> simultaneously by using something like an rwsem —
> shrinking can proceed in parallel under the read
> lock, while the (rare) switching path takes the
> write lock.

In my opinion, completely banning others from reclaming seems to demand more
than needed. We have many huge servers with services running in enourmous memcg.
In such case, waiting for the draining to complete may take so long (tens of
seconds for example) that the service get many timeout failures. But there's
high chance that reclaimers can still reclaim enough even if the draning is not
completed. So maybe we can have concurrent reclaming and state switch draining?



Regarding the discussion, I would like to propose a slightly different approach
that is already in use in production. The proposal mainly focuses on two
practical considerations:

1. State switching is a rare operation. So we should not penalize the normal
reclaim path or introduce more locks for this rare case.
2. We should avoid introducing long latency spikes during production state
transitions (e.g., switching on live machines).

The downstream solution is very similar to a combination of all your proposals,
but with some radical attempts to try to avoid sleeping therefore reduce lags
from waiting as much as possible. At the same time, we keep a last-chance wait
to prevent early OOMs from happening.

We use a static key to indicate that the state change is in progress. All
operations are encapsulated in that slow path so no extra overhead for the
normal path. If we are in the draining process, we first try reclaiming from
where lrugen->enabled says we are, if we still have a lot of retry times left.
With no retries left, simply wait until the we pass the race window.

--
Thanks
Bingfang

--
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3f..d7ff7a6ed088 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2652,6 +2652,43 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 
 #ifdef CONFIG_LRU_GEN
 
+DEFINE_STATIC_KEY_FALSE(lru_gen_draining);
+
+static inline bool lru_gen_is_draining(void)
+{
+       return static_branch_unlikely(&lru_gen_draining);
+}
+
+/*
+ * Lazily wait for the draining thread to finish if it's running.
+ *
+ * Return: whether we'd like to reclaim from multi-gen LRU.
+ */
+static inline bool lru_gen_draining_wait(struct lruvec *lruvec, struct scan_control *sc)
+{
+       bool global_enabled = lru_gen_enabled();
+
+       /* Try reclaiming from the current LRU first */
+       if (sc->priority > DEF_PRIORITY / 2)
+               return READ_ONCE(lruvec->lrugen.enabled);
+
+       /* Oops, try from the other side... */
+       if (sc->priority > 1)
+               return global_enabled;
+
+       /*
+        * If we see lrugen.enabled is consistent here, when we get the lru
+        * spinlock, the migrating thread will have filled the lruvec with some
+        * pages, so we can continue without waiting.
+        */
+       while (global_enabled ^ READ_ONCE(lruvec->lrugen.enabled)) {
+               /* Not switching this one yet. Wait for a while. */
+               schedule_timeout_uninterruptible(1);
+       }
+
+       return global_enabled;
+}
+
 #ifdef CONFIG_LRU_GEN_ENABLED
 DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
 #define get_cap(cap)   static_branch_likely(&lru_gen_caps[cap])
@@ -5171,6 +5208,8 @@ static void lru_gen_change_state(bool enabled)
        if (enabled == lru_gen_enabled())
                goto unlock;
 
+       static_branch_enable_cpuslocked(&lru_gen_draining);
+
        if (enabled)
                static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
        else
@@ -5201,6 +5240,9 @@ static void lru_gen_change_state(bool enabled)
 
                cond_resched();
        } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+       static_branch_disable_cpuslocked(&lru_gen_draining);
+
 unlock:
        mutex_unlock(&state_mutex);
        put_online_mems();
@@ -5752,6 +5794,16 @@ late_initcall(init_lru_gen);
 
 #else /* !CONFIG_LRU_GEN */
 
+static inline bool lru_gen_is_draining(void)
+{
+       return false;
+}
+
+static inline bool shrink_lruvec_draining(struct lruvec *lruvec, struct scan_control *sc)
+{
+       return false;
+}
+
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
        BUILD_BUG();
@@ -5780,7 +5832,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
        bool proportional_reclaim;
        struct blk_plug plug;
 
-       if (lru_gen_enabled() && !root_reclaim(sc)) {
+       if (lru_gen_is_draining() && lru_gen_draining_wait(lruvec, sc)) {
+               lru_gen_shrink_lruvec(lruvec, sc);
+               return;
+       } else if (lru_gen_enabled() && !root_reclaim(sc)) {
                lru_gen_shrink_lruvec(lruvec, sc);
                return;
        }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  8:00         ` Kairui Song
  2026-03-02  8:15           ` Barry Song
  2026-03-02  8:25           ` Yafang Shao
@ 2026-03-02 16:26           ` Michal Hocko
  2 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2026-03-02 16:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: Yafang Shao, Barry Song, lenohou, akpm, axelrasmussen,
	linux-kernel, linux-mm, weixugc, wjl.linux, yuanchu, yuzhao

On Mon 02-03-26 16:00:03, Kairui Song wrote:
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]
> > > To be honest, the on/off toggle is quite odd. If possible,
> > > I’d prefer not to switch MGLRU or active/inactive
> > > dynamically. Once it’s set up during system boot, it
> > > should remain unchanged.
> >
> > While it is well-suited for Android environments, it is not viable for
> > Kubernetes production servers, where rebooting is highly disruptive.
> > This limitation is precisely why we need to introduce dynamic toggles.
> 
> I agree with Barry, the switch isn't supposed to be a knob to be
> turned on/off frequently. 

Is there any actual usecase other than debugging to switch the reclaim
plementation back and forth? In other words do we really need to care
about this issue at all? Is the additional code worth it?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  7:43       ` Yafang Shao
  2026-03-02  8:00         ` Kairui Song
@ 2026-03-02  8:03         ` Barry Song
  2026-03-02  8:13           ` Yafang Shao
  1 sibling, 1 reply; 27+ messages in thread
From: Barry Song @ 2026-03-02  8:03 UTC (permalink / raw)
  To: Yafang Shao
  Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > > > >
> > > > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > > > condition exists between the state switching and the memory reclaim
> > > > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > > > reclaimable memory is available.
> > > > >
> > > > > *** Problem Description ***
> > > > >
> > > > > The issue arises from a "reclaim vacuum" during the transition:
> > > > >
> > > > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > > >    false before the pages are drained from MGLRU lists back to
> > > > >    traditional LRU lists.
> > > > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > > >    and skip the MGLRU path.
> > > > > 3. However, these pages might not have reached the traditional LRU lists
> > > > >    yet, or the changes are not yet visible to all CPUs due to a lack of
> > > > >    synchronization.
> > > > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > > >    concludes there is no reclaimable memory, and triggers an OOM kill.
> > > > >
> > > > > A similar race can occur during enablement, where the reclaimer sees
> > > > > the new state but the MGLRU lists haven't been populated via
> > > > > fill_evictable() yet.
> > > > >
> > > > > *** Solution ***
> > > > >
> > > > > Introduce a 'draining' state to bridge the gap during transitions:
> > > > >
> > > > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > > >   of 'enabled' and 'draining' flags across CPUs.
> > > > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > > >   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > > >   lists first, and then fall through to traditional LRU lists instead
> > > > >   of returning early. This ensures that folios are visible to at least
> > > > >   one reclaim path at any given time.
> > > > >
> > > > > *** Reproduction ***
> > > > >
> > > > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > > > a high-pressure memory cgroup (v1) environment.
> > > > >
> > > > > Reproduction steps:
> > > > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > > >    and 8GB active anonymous memory.
> > > > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > > >    direct reclaim.
> > > > >
> > > > > Reproduction script:
> > > > > ---
> > > > > #!/bin/bash
> > > > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > > > set -euo pipefail
> > > > >
> > > > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > > > >
> > > > > # Switch MGLRU dynamically in the background
> > > > > switch_mglru() {
> > > > >     local orig_val=$(cat "$MGLRU_FILE")
> > > > >     if [[ "$orig_val" != "0x0000" ]]; then
> > > > >         echo n > "$MGLRU_FILE" &
> > > > >     else
> > > > >         echo y > "$MGLRU_FILE" &
> > > > >     fi
> > > > > }
> > > > >
> > > > > # Setup 16G memcg
> > > > > mkdir -p "$CGROUP_PATH"
> > > > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > > > >
> > > > > # 1. Build memory pressure (File + Anon)
> > > > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > > > >
> > > > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > > > sleep 5
> > > > >
> > > > > # 2. Trigger switch and concurrent allocation
> > > > > switch_mglru
> > > > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > > > >
> > > > > # Check OOM counter
> > > > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > > > ---
> > > > >
> > > > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > > > >
> > > > > ---
> > > > > To: linux-mm@kvack.org
> > > > > To: linux-kernel@vger.kernel.org
> > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > > > Cc: Wei Xu <weixugc@google.com>
> > > > > Cc: Barry Song <21cnbao@gmail.com>
> > > > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > Cc: Yu Zhao <yuzhao@google.com>
> > > > > ---
> > > > >  include/linux/mmzone.h |  2 ++
> > > > >  mm/vmscan.c            | 14 +++++++++++---
> > > > >  2 files changed, 13 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > > index 7fb7331c5725..0648ce91dbc6 100644
> > > > > --- a/include/linux/mmzone.h
> > > > > +++ b/include/linux/mmzone.h
> > > > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > > >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > > >         /* whether the multi-gen LRU is enabled */
> > > > >         bool enabled;
> > > > > +       /* whether the multi-gen LRU is draining to LRU */
> > > > > +       bool draining;
> > > > >         /* the memcg generation this lru_gen_folio belongs to */
> > > > >         u8 gen;
> > > > >         /* the list segment this lru_gen_folio belongs to */
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 06071995dacc..629a00681163 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > > >                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > > >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > > > >
> > > > > -                       lruvec->lrugen.enabled = enabled;
> > > > > +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > > > +                       smp_store_release(&lruvec->lrugen.draining, true);
> > > > >
> > > > >                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > > >                                 spin_unlock_irq(&lruvec->lru_lock);
> > > > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > > >                                 spin_lock_irq(&lruvec->lru_lock);
> > > > >                         }
> > > > >
> > > > > +                       smp_store_release(&lruvec->lrugen.draining, false);
> > > > > +
> > > > >                         spin_unlock_irq(&lruvec->lru_lock);
> > > > >                 }
> > > > >
> > > > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > > >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > > >         bool proportional_reclaim;
> > > > >         struct blk_plug plug;
> > > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > > >
> > > > > -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > > > +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > > >                 lru_gen_shrink_lruvec(lruvec, sc);
> > > > > -               return;
> > > >
> > >
> > > Hello Barry,
> > >
> > > > Is it possible to simply wait for draining to finish instead of performing
> > > > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
> > >
> > > This might introduce unexpected latency spikes during the waiting period.
> >
> > I assume latency is not a concern for a very rare
> > MGLRU on/off case. Do you require the switch to happen
> > with zero latency?
> > My main concern is the correctness of the code.
> >
> > Now the proposed patch is:
> >
> > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> >
> > Then choose MGLRU or active/inactive LRU based on
> > those values.
> >
> > However, nothing prevents those values from changing
> > after they are read. Even within the shrink path,
> > they can still change.
>
> If these values are changed during reclaim, the currently running
> reclaimer will continue to operate with the old settings, while any
> new reclaimer processes will adopt the new values. This approach
> should prevent any immediate issues, but the primary risk of this
> lockless method is the potential for a user to rapidly toggle the
> MGLRU feature, particularly during an intermediate state.
>
> >
> > So I think we need an rwsem or something similar here —
> > a read lock for shrink and a write lock for on/off. The
> > write lock should happen very rarely.
>
> We can introduce a lock-based mechanism in v2.

Honestly, the on/off toggle is quite fragile. For instance,

folio_check_references() is doing:

        if (lru_gen_enabled()) {
                if (!referenced_ptes)
                        return FOLIOREF_RECLAIM;

                return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE :
FOLIOREF_KEEP;
        }

However, `lru_gen_enabled()` does not indicate the actual LRU
where the folio resides.

`lru_gen_enabled()` is called in many places, but in this case it does
not accurately reflect where folios are placed if a dynamic toggle is
active. During the switching, many unexpected behaviors may occur.

>
> >
> > >
> > > >
> > > > Performing a shrink in an intermediate state may still involve a lot of
> > > > uncertainty, depending on how far the shrink has progressed and how much
> > > > remains in each side’s LRU？
> > >
> > > The workingset might not be reliable in this intermediate state.
> > > However, since switching MGLRU should not be a frequent operation in a
> > > production environment, I believe the workingset in this intermediate
> > > state should not be a concern. The only reason we would enable or
> > > disable MGLRU is if we find that certain workloads benefit from
> > > it—enabling it when it helps, and disabling it when it causes
> > > degradation. There should be no other scenario in which we would need
> > > to toggle MGLRU on or off.
> > >
> > > To identify which workloads can benefit from MGLRU, we must first
> > > ensure that switching it on or off is safe—which is precisely why we
> > > are proposing this patch. Once MGLRU is enabled in production, we can
> > > continue to improve it. Perhaps in the future, we can even implement a
> > > per-workload reclaim mechanism.
> >
> > To be honest, the on/off toggle is quite odd. If possible,
> > I’d prefer not to switch MGLRU or active/inactive
> > dynamically. Once it’s set up during system boot, it
> > should remain unchanged.
>
> While it is well-suited for Android environments, it is not viable for
> Kubernetes production servers, where rebooting is highly disruptive.
> This limitation is precisely why we need to introduce dynamic toggles.

Perhaps we really need to unify MGLRU with the active/inactive lists,
combining the benefits of both approaches. The dynamic toggle, as it
stands, is quite fragile.
A topic was suggested by Kairui here [1].

>
> >
> > If we want a per-workload LRU, this could be a good
> > place for eBPF to hook into folio enqueue, dequeue,
> > and scanning. There is a project related to this [1][2].
> >
> > // Policy function hooks
> > struct cache_ext_ops {
> >        s32 (*policy_init)(struct mem_cgroup *memcg);
> >        // Propose folios to evict
> >        void (*evict_folios)(struct eviction_ctx *ctx,
> >                  struct mem_cgroup *memcg);
> >        void (*folio_added)(struct folio *folio);
> >        void (*folio_accessed)(struct folio *folio);
> >        // Folio was removed: clean up metadata
> >        void (*folio_removed)(struct folio *folio);
> >        char name[CACHE_EXT_OPS_NAME_LEN];
> > };
> >
> > However, we would need a very strong and convincing
> > user case to justify it.
>
> Thanks for the info.
> We're actually already running a BPF-based reclaimer in production,
> but we don't have immediate plans to upstream or propose it just yet.

I know you are always far ahead of everyone else. I’m looking forward
to seeing your code and use cases when you are ready.

[1] https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  8:03         ` Barry Song
@ 2026-03-02  8:13           ` Yafang Shao
  2026-03-02  8:20             ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Yafang Shao @ 2026-03-02  8:13 UTC (permalink / raw)
  To: Barry Song
  Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > > > > >
> > > > > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > > > > condition exists between the state switching and the memory reclaim
> > > > > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > > > > reclaimable memory is available.
> > > > > >
> > > > > > *** Problem Description ***
> > > > > >
> > > > > > The issue arises from a "reclaim vacuum" during the transition:
> > > > > >
> > > > > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > > > >    false before the pages are drained from MGLRU lists back to
> > > > > >    traditional LRU lists.
> > > > > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > > > >    and skip the MGLRU path.
> > > > > > 3. However, these pages might not have reached the traditional LRU lists
> > > > > >    yet, or the changes are not yet visible to all CPUs due to a lack of
> > > > > >    synchronization.
> > > > > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > > > >    concludes there is no reclaimable memory, and triggers an OOM kill.
> > > > > >
> > > > > > A similar race can occur during enablement, where the reclaimer sees
> > > > > > the new state but the MGLRU lists haven't been populated via
> > > > > > fill_evictable() yet.
> > > > > >
> > > > > > *** Solution ***
> > > > > >
> > > > > > Introduce a 'draining' state to bridge the gap during transitions:
> > > > > >
> > > > > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > > > >   of 'enabled' and 'draining' flags across CPUs.
> > > > > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > > > >   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > > > >   lists first, and then fall through to traditional LRU lists instead
> > > > > >   of returning early. This ensures that folios are visible to at least
> > > > > >   one reclaim path at any given time.
> > > > > >
> > > > > > *** Reproduction ***
> > > > > >
> > > > > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > > > > a high-pressure memory cgroup (v1) environment.
> > > > > >
> > > > > > Reproduction steps:
> > > > > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > > > >    and 8GB active anonymous memory.
> > > > > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > > > >    direct reclaim.
> > > > > >
> > > > > > Reproduction script:
> > > > > > ---
> > > > > > #!/bin/bash
> > > > > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > > > > set -euo pipefail
> > > > > >
> > > > > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > > > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > > > > >
> > > > > > # Switch MGLRU dynamically in the background
> > > > > > switch_mglru() {
> > > > > >     local orig_val=$(cat "$MGLRU_FILE")
> > > > > >     if [[ "$orig_val" != "0x0000" ]]; then
> > > > > >         echo n > "$MGLRU_FILE" &
> > > > > >     else
> > > > > >         echo y > "$MGLRU_FILE" &
> > > > > >     fi
> > > > > > }
> > > > > >
> > > > > > # Setup 16G memcg
> > > > > > mkdir -p "$CGROUP_PATH"
> > > > > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > > > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > > > > >
> > > > > > # 1. Build memory pressure (File + Anon)
> > > > > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > > > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > > > > >
> > > > > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > > > > sleep 5
> > > > > >
> > > > > > # 2. Trigger switch and concurrent allocation
> > > > > > switch_mglru
> > > > > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > > > > >
> > > > > > # Check OOM counter
> > > > > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > > > > ---
> > > > > >
> > > > > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > > > > >
> > > > > > ---
> > > > > > To: linux-mm@kvack.org
> > > > > > To: linux-kernel@vger.kernel.org
> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > > > > Cc: Wei Xu <weixugc@google.com>
> > > > > > Cc: Barry Song <21cnbao@gmail.com>
> > > > > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > > Cc: Yu Zhao <yuzhao@google.com>
> > > > > > ---
> > > > > >  include/linux/mmzone.h |  2 ++
> > > > > >  mm/vmscan.c            | 14 +++++++++++---
> > > > > >  2 files changed, 13 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > > > index 7fb7331c5725..0648ce91dbc6 100644
> > > > > > --- a/include/linux/mmzone.h
> > > > > > +++ b/include/linux/mmzone.h
> > > > > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > > > >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > > > >         /* whether the multi-gen LRU is enabled */
> > > > > >         bool enabled;
> > > > > > +       /* whether the multi-gen LRU is draining to LRU */
> > > > > > +       bool draining;
> > > > > >         /* the memcg generation this lru_gen_folio belongs to */
> > > > > >         u8 gen;
> > > > > >         /* the list segment this lru_gen_folio belongs to */
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 06071995dacc..629a00681163 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > >                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > > > >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > > > > >
> > > > > > -                       lruvec->lrugen.enabled = enabled;
> > > > > > +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > > > > +                       smp_store_release(&lruvec->lrugen.draining, true);
> > > > > >
> > > > > >                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > > > >                                 spin_unlock_irq(&lruvec->lru_lock);
> > > > > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > >                                 spin_lock_irq(&lruvec->lru_lock);
> > > > > >                         }
> > > > > >
> > > > > > +                       smp_store_release(&lruvec->lrugen.draining, false);
> > > > > > +
> > > > > >                         spin_unlock_irq(&lruvec->lru_lock);
> > > > > >                 }
> > > > > >
> > > > > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > > > >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > > > >         bool proportional_reclaim;
> > > > > >         struct blk_plug plug;
> > > > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > > > >
> > > > > > -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > > > > +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > > > >                 lru_gen_shrink_lruvec(lruvec, sc);
> > > > > > -               return;
> > > > >
> > > >
> > > > Hello Barry,
> > > >
> > > > > Is it possible to simply wait for draining to finish instead of performing
> > > > > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
> > > >
> > > > This might introduce unexpected latency spikes during the waiting period.
> > >
> > > I assume latency is not a concern for a very rare
> > > MGLRU on/off case. Do you require the switch to happen
> > > with zero latency?
> > > My main concern is the correctness of the code.
> > >
> > > Now the proposed patch is:
> > >
> > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > Then choose MGLRU or active/inactive LRU based on
> > > those values.
> > >
> > > However, nothing prevents those values from changing
> > > after they are read. Even within the shrink path,
> > > they can still change.
> >
> > If these values are changed during reclaim, the currently running
> > reclaimer will continue to operate with the old settings, while any
> > new reclaimer processes will adopt the new values. This approach
> > should prevent any immediate issues, but the primary risk of this
> > lockless method is the potential for a user to rapidly toggle the
> > MGLRU feature, particularly during an intermediate state.
> >
> > >
> > > So I think we need an rwsem or something similar here —
> > > a read lock for shrink and a write lock for on/off. The
> > > write lock should happen very rarely.
> >
> > We can introduce a lock-based mechanism in v2.
>
> Honestly, the on/off toggle is quite fragile. For instance,
>
> folio_check_references() is doing:
>
>         if (lru_gen_enabled()) {
>                 if (!referenced_ptes)
>                         return FOLIOREF_RECLAIM;
>
>                 return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE :
> FOLIOREF_KEEP;
>         }
>
> However, `lru_gen_enabled()` does not indicate the actual LRU
> where the folio resides.
>
> `lru_gen_enabled()` is called in many places, but in this case it does
> not accurately reflect where folios are placed if a dynamic toggle is
> active. During the switching, many unexpected behaviors may occur.
>
> >
> > >
> > > >
> > > > >
> > > > > Performing a shrink in an intermediate state may still involve a lot of
> > > > > uncertainty, depending on how far the shrink has progressed and how much
> > > > > remains in each side’s LRU？
> > > >
> > > > The workingset might not be reliable in this intermediate state.
> > > > However, since switching MGLRU should not be a frequent operation in a
> > > > production environment, I believe the workingset in this intermediate
> > > > state should not be a concern. The only reason we would enable or
> > > > disable MGLRU is if we find that certain workloads benefit from
> > > > it—enabling it when it helps, and disabling it when it causes
> > > > degradation. There should be no other scenario in which we would need
> > > > to toggle MGLRU on or off.
> > > >
> > > > To identify which workloads can benefit from MGLRU, we must first
> > > > ensure that switching it on or off is safe—which is precisely why we
> > > > are proposing this patch. Once MGLRU is enabled in production, we can
> > > > continue to improve it. Perhaps in the future, we can even implement a
> > > > per-workload reclaim mechanism.
> > >
> > > To be honest, the on/off toggle is quite odd. If possible,
> > > I’d prefer not to switch MGLRU or active/inactive
> > > dynamically. Once it’s set up during system boot, it
> > > should remain unchanged.
> >
> > While it is well-suited for Android environments, it is not viable for
> > Kubernetes production servers, where rebooting is highly disruptive.
> > This limitation is precisely why we need to introduce dynamic toggles.
>
> Perhaps we really need to unify MGLRU with the active/inactive lists,
> combining the benefits of both approaches. The dynamic toggle, as it
> stands, is quite fragile.
> A topic was suggested by Kairui here [1].
>
> >
> > >
> > > If we want a per-workload LRU, this could be a good
> > > place for eBPF to hook into folio enqueue, dequeue,
> > > and scanning. There is a project related to this [1][2].
> > >
> > > // Policy function hooks
> > > struct cache_ext_ops {
> > >        s32 (*policy_init)(struct mem_cgroup *memcg);
> > >        // Propose folios to evict
> > >        void (*evict_folios)(struct eviction_ctx *ctx,
> > >                  struct mem_cgroup *memcg);
> > >        void (*folio_added)(struct folio *folio);
> > >        void (*folio_accessed)(struct folio *folio);
> > >        // Folio was removed: clean up metadata
> > >        void (*folio_removed)(struct folio *folio);
> > >        char name[CACHE_EXT_OPS_NAME_LEN];
> > > };
> > >
> > > However, we would need a very strong and convincing
> > > user case to justify it.
> >
> > Thanks for the info.
> > We're actually already running a BPF-based reclaimer in production,
> > but we don't have immediate plans to upstream or propose it just yet.
>
> I know you are always far ahead of everyone else. I’m looking forward
> to seeing your code and use cases when you are ready.

Don't say it that way, that is not cooperative.
We've only deployed a limited BPF-based memcg async reclaimer
internally, and it is currently scoped to our own workloads.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-02  8:13           ` Yafang Shao
@ 2026-03-02  8:20             ` Barry Song
  0 siblings, 0 replies; 27+ messages in thread
From: Barry Song @ 2026-03-02  8:20 UTC (permalink / raw)
  To: Yafang Shao
  Cc: lenohou, akpm, axelrasmussen, linux-kernel, linux-mm, weixugc,
	wjl.linux, yuanchu, yuzhao

On Mon, Mar 2, 2026 at 4:14 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]
> > > > If we want a per-workload LRU, this could be a good
> > > > place for eBPF to hook into folio enqueue, dequeue,
> > > > and scanning. There is a project related to this [1][2].
> > > >
> > > > // Policy function hooks
> > > > struct cache_ext_ops {
> > > >        s32 (*policy_init)(struct mem_cgroup *memcg);
> > > >        // Propose folios to evict
> > > >        void (*evict_folios)(struct eviction_ctx *ctx,
> > > >                  struct mem_cgroup *memcg);
> > > >        void (*folio_added)(struct folio *folio);
> > > >        void (*folio_accessed)(struct folio *folio);
> > > >        // Folio was removed: clean up metadata
> > > >        void (*folio_removed)(struct folio *folio);
> > > >        char name[CACHE_EXT_OPS_NAME_LEN];
> > > > };
> > > >
> > > > However, we would need a very strong and convincing
> > > > user case to justify it.
> > >
> > > Thanks for the info.
> > > We're actually already running a BPF-based reclaimer in production,
> > > but we don't have immediate plans to upstream or propose it just yet.
> >
> > I know you are always far ahead of everyone else. I’m looking forward
> > to seeing your code and use cases when you are ready.
>
> Don't say it that way, that is not cooperative.
> We've only deployed a limited BPF-based memcg async reclaimer
> internally, and it is currently scoped to our own workloads.
>

Don’t be so sensitive:-) When I say you are far ahead, that’s exactly
what I mean. I truly admire your work on making LRU programmable.
That’s all.

I understand that you have only deployed a limited BPF-based case,
so you need more time before sharing it. I am not criticizing you for
not sharing it yet. Please don’t misunderstand me.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-03-03  6:37 Bingfang Guo
  0 siblings, 0 replies; 27+ messages in thread
From: Bingfang Guo @ 2026-03-03  6:37 UTC (permalink / raw)
  To: laoar.shao
  Cc: 21cnbao, akpm, axelrasmussen, BINGFANG GUO, lenohou,
	linux-kernel, linux-mm, ryncsn, weixugc, wjl.linux, yuanchu,
	yuzhao

Hi all. Thanks for inviting me to the discussion. I'm glad to join you and share
my ideas and findings with you.

On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > I assume latency is not a concern for a very rare
> > > > MGLRU on/off case. Do you require the switch to happen
> > > > with zero latency?
> > > > My main concern is the correctness of the code.
> > > >
> > > > Now the proposed patch is:
> > > >
> > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > Then choose MGLRU or active/inactive LRU based on
> > > > those values.
> > > >
> > > > However, nothing prevents those values from changing
> > > > after they are read. Even within the shrink path,
> > > > they can still change.
> >
> > Hi all,
> >
> > > If these values are changed during reclaim, the currently running
> > > reclaimer will continue to operate with the old settings, while any
> > > new reclaimer processes will adopt the new values. This approach
> > > should prevent any immediate issues, but the primary risk of this
> > > lockless method is the potential for a user to rapidly toggle the
> > > MGLRU feature, particularly during an intermediate state.
> > >
> > > >
> > > > So I think we need an rwsem or something similar here —
> > > > a read lock for shrink and a write lock for on/off. The
> > > > write lock should happen very rarely.
> > >
> > > We can introduce a lock-based mechanism in v2.
> >
> > I hope we don't need a lock here. Currently there is only a static
> > key, this patch is already adding more branches, a lock will make
> > things more complex and the shrinking path is quite performance
> > sensitive.
> >
> > > >
> > > > To be honest, the on/off toggle is quite odd. If possible,
> > > > I’d prefer not to switch MGLRU or active/inactive
> > > > dynamically. Once it’s set up during system boot, it
> > > > should remain unchanged.
> > >
> > > While it is well-suited for Android environments, it is not viable for
> > > Kubernetes production servers, where rebooting is highly disruptive.
> > > This limitation is precisely why we need to introduce dynamic toggles.
> >
> > I agree with Barry, the switch isn't supposed to be a knob to be
> > turned on/off frequently. And I think in the long term we should just
> > identify the workloads where MGLRU doesn't work well, and fix MGLRU.
>
> The challenge we're currently facing is that we don't yet know which
> workloads would benefit from it ;)
> We do want to enable mglru on our production servers, but first we
> need to address the risk of OOM during the switch—that's exactly why
> we're proposing this patch.

Yes. I believe our long term target is to integrate the two LRU implementations.
But for now, it's important to keep this dynamic toggling feature and make it
robust and work well. So if users are willing to try the new LRU algorithm, they
are free to enable it after system boots for testing, and disable it if they run
into some trouble without worrying about OOM and other problems. Therefore, we
can have more users and potentially expose more problems related to MGLRU and
fix them.


On Mon, Mar 3, 2026 at 1:34 AM Barry Sone <21cnbao@gmail.com> wrote:
> 2. Ensure that shrinking and switching do not occur
> simultaneously by using something like an rwsem —
> shrinking can proceed in parallel under the read
> lock, while the (rare) switching path takes the
> write lock.

In my opinion, completely banning others from reclaming seems to demand more
than needed. We have many huge servers with services running in enourmous memcg.
In such case, waiting for the draining to complete may take so long (tens of
seconds for example) that the service get many timeout failures. But there's
high chance that reclaimers can still reclaim enough even if the draning is not
completed. So maybe we can have concurrent reclaming and state switch draining?



Regarding the discussion, I would like to propose a slightly different approach
that is already in use in production. The proposal mainly focuses on two
practical considerations:

1. State switching is a rare operation. So we should not penalize the normal
reclaim path or introduce more locks for this rare case.
2. We should avoid introducing long latency spikes during production state
transitions (e.g., switching on live machines).

The downstream solution is very similar to a combination of all your proposals,
but with some radical attempts to try to avoid sleeping therefore reduce lags
from waiting as much as possible. At the same time, we keep a last-chance wait
to prevent early OOMs from happening.

We use a static key to indicate that the state change is in progress. All
operations are encapsulated in that slow path so no extra overhead for the
normal path. If we are in the draining process, we first try reclaiming from
where lrugen->enabled says we are, if we still have a lot of retry times left.
With no retries left, simply wait until the we pass the race window.

--
Thanks
Bingfang

--
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3f..d7ff7a6ed088 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2652,6 +2652,43 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 
 #ifdef CONFIG_LRU_GEN
 
+DEFINE_STATIC_KEY_FALSE(lru_gen_draining);
+
+static inline bool lru_gen_is_draining(void)
+{
+       return static_branch_unlikely(&lru_gen_draining);
+}
+
+/*
+ * Lazily wait for the draining thread to finish if it's running.
+ *
+ * Return: whether we'd like to reclaim from multi-gen LRU.
+ */
+static inline bool lru_gen_draining_wait(struct lruvec *lruvec, struct scan_control *sc)
+{
+       bool global_enabled = lru_gen_enabled();
+
+       /* Try reclaiming from the current LRU first */
+       if (sc->priority > DEF_PRIORITY / 2)
+               return READ_ONCE(lruvec->lrugen.enabled);
+
+       /* Oops, try from the other side... */
+       if (sc->priority > 1)
+               return global_enabled;
+
+       /*
+        * If we see lrugen.enabled is consistent here, when we get the lru
+        * spinlock, the migrating thread will have filled the lruvec with some
+        * pages, so we can continue without waiting.
+        */
+       while (global_enabled ^ READ_ONCE(lruvec->lrugen.enabled)) {
+               /* Not switching this one yet. Wait for a while. */
+               schedule_timeout_uninterruptible(1);
+       }
+
+       return global_enabled;
+}
+
 #ifdef CONFIG_LRU_GEN_ENABLED
 DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
 #define get_cap(cap)   static_branch_likely(&lru_gen_caps[cap])
@@ -5171,6 +5208,8 @@ static void lru_gen_change_state(bool enabled)
        if (enabled == lru_gen_enabled())
                goto unlock;
 
+       static_branch_enable_cpuslocked(&lru_gen_draining);
+
        if (enabled)
                static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
        else
@@ -5201,6 +5240,9 @@ static void lru_gen_change_state(bool enabled)
 
                cond_resched();
        } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+       static_branch_disable_cpuslocked(&lru_gen_draining);
+
 unlock:
        mutex_unlock(&state_mutex);
        put_online_mems();
@@ -5752,6 +5794,16 @@ late_initcall(init_lru_gen);
 
 #else /* !CONFIG_LRU_GEN */
 
+static inline bool lru_gen_is_draining(void)
+{
+       return false;
+}
+
+static inline bool shrink_lruvec_draining(struct lruvec *lruvec, struct scan_control *sc)
+{
+       return false;
+}
+
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
        BUILD_BUG();
@@ -5780,7 +5832,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
        bool proportional_reclaim;
        struct blk_plug plug;
 
-       if (lru_gen_enabled() && !root_reclaim(sc)) {
+       if (lru_gen_is_draining() && lru_gen_draining_wait(lruvec, sc)) {
+               lru_gen_shrink_lruvec(lruvec, sc);
+               return;
+       } else if (lru_gen_enabled() && !root_reclaim(sc)) {
                lru_gen_shrink_lruvec(lruvec, sc);
                return;
        }

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-03-03  8:28 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
2026-02-28 22:41   ` Barry Song
2026-03-01  4:10     ` Barry Song
2026-03-02  5:50   ` Yafang Shao
2026-03-02  6:58     ` Barry Song
2026-03-02  7:43       ` Yafang Shao
2026-03-02  8:00         ` Kairui Song
2026-03-02  8:15           ` Barry Song
2026-03-02  8:25           ` Yafang Shao
2026-03-02  9:20             ` Barry Song
2026-03-02  9:47               ` Kairui Song
2026-03-02 14:35                 ` Yafang Shao
2026-03-02 17:51                   ` Yuanchu Xie
2026-03-03  1:34                     ` Barry Song
2026-03-03  1:40                       ` Axel Rasmussen
2026-03-03  2:43                         ` Yafang Shao
2026-03-03  8:27                           ` Bingfang Guo
2026-03-02 16:26           ` Michal Hocko
2026-03-02  8:03         ` Barry Song
2026-03-02  8:13           ` Yafang Shao
2026-03-02  8:20             ` Barry Song
2026-03-03  6:37 Bingfang Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox