linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-02-28 16:10 Leno Hou
  2026-02-28 18:58 ` Andrew Morton
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Leno Hou @ 2026-02-28 16:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Leno Hou, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Barry Song, Jialing Wang, Yafang Shao, Yu Zhao

When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim
path. This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

*** Problem Description ***

The issue arises from a "reclaim vacuum" during the transition:

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
   false before the pages are drained from MGLRU lists back to
   traditional LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
   and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
   yet, or the changes are not yet visible to all CPUs due to a lack of
   synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
   concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees
the new state but the MGLRU lists haven't been populated via
fill_evictable() yet.

*** Solution ***

Introduce a 'draining' state to bridge the gap during transitions:

- Use smp_store_release() and smp_load_acquire() to ensure the visibility
  of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
  is in the 'draining' state, the reclaimer will attempt to scan MGLRU
  lists first, and then fall through to traditional LRU lists instead
  of returning early. This ensures that folios are visible to at least
  one reclaim path at any given time.

*** Reproduction ***

The issue was consistently reproduced on v6.1.157 and v6.18.3 using
a high-pressure memory cgroup (v1) environment.

Reproduction steps:
1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
   and 8GB active anonymous memory.
2. Toggle MGLRU state while performing new memory allocations to force
   direct reclaim.

Reproduction script:
---
#!/bin/bash
# Fixed reproduction for memcg OOM during MGLRU toggle
set -euo pipefail

MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"

# Switch MGLRU dynamically in the background
switch_mglru() {
    local orig_val=$(cat "$MGLRU_FILE")
    if [[ "$orig_val" != "0x0000" ]]; then
        echo n > "$MGLRU_FILE" &
    else
        echo y > "$MGLRU_FILE" &
    fi
}

# Setup 16G memcg
mkdir -p "$CGROUP_PATH"
echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
echo $$ > "$CGROUP_PATH/cgroup.procs"

# 1. Build memory pressure (File + Anon)
dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache

stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
sleep 5

# 2. Trigger switch and concurrent allocation
switch_mglru
stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"

# Check OOM counter
grep oom_kill "$CGROUP_PATH/memory.oom_control"
---

Signed-off-by: Leno Hou <lenohou@gmail.com>

---
To: linux-mm@kvack.org
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Jialing Wang <wjl.linux@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
---
 include/linux/mmzone.h |  2 ++
 mm/vmscan.c            | 14 +++++++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c5725..0648ce91dbc6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,8 @@ struct lru_gen_folio {
 	atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
 	/* whether the multi-gen LRU is enabled */
 	bool enabled;
+	/* whether the multi-gen LRU is draining to LRU */
+	bool draining;
 	/* the memcg generation this lru_gen_folio belongs to */
 	u8 gen;
 	/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06071995dacc..629a00681163 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
 			VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
 			VM_WARN_ON_ONCE(!state_is_valid(lruvec));
 
-			lruvec->lrugen.enabled = enabled;
+			smp_store_release(&lruvec->lrugen.enabled, enabled);
+			smp_store_release(&lruvec->lrugen.draining, true);
 
 			while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
 				spin_unlock_irq(&lruvec->lru_lock);
@@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
 				spin_lock_irq(&lruvec->lru_lock);
 			}
 
+			smp_store_release(&lruvec->lrugen.draining, false);
+
 			spin_unlock_irq(&lruvec->lru_lock);
 		}
 
@@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 	bool proportional_reclaim;
 	struct blk_plug plug;
+	bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
+	bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
 
-	if (lru_gen_enabled() && !root_reclaim(sc)) {
+	if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
 		lru_gen_shrink_lruvec(lruvec, sc);
-		return;
+
+		if (!lru_draining)
+			return;
+
 	}
 
 	get_scan_count(lruvec, sc, nr);
-- 
2.52.0



^ permalink raw reply	[flat|nested] 27+ messages in thread
* Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-03-03  6:37 Bingfang Guo
  0 siblings, 0 replies; 27+ messages in thread
From: Bingfang Guo @ 2026-03-03  6:37 UTC (permalink / raw)
  To: laoar.shao
  Cc: 21cnbao, akpm, axelrasmussen, BINGFANG GUO, lenohou,
	linux-kernel, linux-mm, ryncsn, weixugc, wjl.linux, yuanchu,
	yuzhao

Hi all. Thanks for inviting me to the discussion. I'm glad to join you and share
my ideas and findings with you.

On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > I assume latency is not a concern for a very rare
> > > > MGLRU on/off case. Do you require the switch to happen
> > > > with zero latency?
> > > > My main concern is the correctness of the code.
> > > >
> > > > Now the proposed patch is:
> > > >
> > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > Then choose MGLRU or active/inactive LRU based on
> > > > those values.
> > > >
> > > > However, nothing prevents those values from changing
> > > > after they are read. Even within the shrink path,
> > > > they can still change.
> >
> > Hi all,
> >
> > > If these values are changed during reclaim, the currently running
> > > reclaimer will continue to operate with the old settings, while any
> > > new reclaimer processes will adopt the new values. This approach
> > > should prevent any immediate issues, but the primary risk of this
> > > lockless method is the potential for a user to rapidly toggle the
> > > MGLRU feature, particularly during an intermediate state.
> > >
> > > >
> > > > So I think we need an rwsem or something similar here —
> > > > a read lock for shrink and a write lock for on/off. The
> > > > write lock should happen very rarely.
> > >
> > > We can introduce a lock-based mechanism in v2.
> >
> > I hope we don't need a lock here. Currently there is only a static
> > key, this patch is already adding more branches, a lock will make
> > things more complex and the shrinking path is quite performance
> > sensitive.
> >
> > > >
> > > > To be honest, the on/off toggle is quite odd. If possible,
> > > > I’d prefer not to switch MGLRU or active/inactive
> > > > dynamically. Once it’s set up during system boot, it
> > > > should remain unchanged.
> > >
> > > While it is well-suited for Android environments, it is not viable for
> > > Kubernetes production servers, where rebooting is highly disruptive.
> > > This limitation is precisely why we need to introduce dynamic toggles.
> >
> > I agree with Barry, the switch isn't supposed to be a knob to be
> > turned on/off frequently. And I think in the long term we should just
> > identify the workloads where MGLRU doesn't work well, and fix MGLRU.
>
> The challenge we're currently facing is that we don't yet know which
> workloads would benefit from it ;)
> We do want to enable mglru on our production servers, but first we
> need to address the risk of OOM during the switch—that's exactly why
> we're proposing this patch.

Yes. I believe our long term target is to integrate the two LRU implementations.
But for now, it's important to keep this dynamic toggling feature and make it
robust and work well. So if users are willing to try the new LRU algorithm, they
are free to enable it after system boots for testing, and disable it if they run
into some trouble without worrying about OOM and other problems. Therefore, we
can have more users and potentially expose more problems related to MGLRU and
fix them.


On Mon, Mar 3, 2026 at 1:34 AM Barry Sone <21cnbao@gmail.com> wrote:
> 2. Ensure that shrinking and switching do not occur
> simultaneously by using something like an rwsem —
> shrinking can proceed in parallel under the read
> lock, while the (rare) switching path takes the
> write lock.

In my opinion, completely banning others from reclaming seems to demand more
than needed. We have many huge servers with services running in enourmous memcg.
In such case, waiting for the draining to complete may take so long (tens of
seconds for example) that the service get many timeout failures. But there's
high chance that reclaimers can still reclaim enough even if the draning is not
completed. So maybe we can have concurrent reclaming and state switch draining?



Regarding the discussion, I would like to propose a slightly different approach
that is already in use in production. The proposal mainly focuses on two
practical considerations:

1. State switching is a rare operation. So we should not penalize the normal
reclaim path or introduce more locks for this rare case.
2. We should avoid introducing long latency spikes during production state
transitions (e.g., switching on live machines).

The downstream solution is very similar to a combination of all your proposals,
but with some radical attempts to try to avoid sleeping therefore reduce lags
from waiting as much as possible. At the same time, we keep a last-chance wait
to prevent early OOMs from happening.

We use a static key to indicate that the state change is in progress. All
operations are encapsulated in that slow path so no extra overhead for the
normal path. If we are in the draining process, we first try reclaiming from
where lrugen->enabled says we are, if we still have a lot of retry times left.
With no retries left, simply wait until the we pass the race window.

--
Thanks
Bingfang

--
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3f..d7ff7a6ed088 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2652,6 +2652,43 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 
 #ifdef CONFIG_LRU_GEN
 
+DEFINE_STATIC_KEY_FALSE(lru_gen_draining);
+
+static inline bool lru_gen_is_draining(void)
+{
+       return static_branch_unlikely(&lru_gen_draining);
+}
+
+/*
+ * Lazily wait for the draining thread to finish if it's running.
+ *
+ * Return: whether we'd like to reclaim from multi-gen LRU.
+ */
+static inline bool lru_gen_draining_wait(struct lruvec *lruvec, struct scan_control *sc)
+{
+       bool global_enabled = lru_gen_enabled();
+
+       /* Try reclaiming from the current LRU first */
+       if (sc->priority > DEF_PRIORITY / 2)
+               return READ_ONCE(lruvec->lrugen.enabled);
+
+       /* Oops, try from the other side... */
+       if (sc->priority > 1)
+               return global_enabled;
+
+       /*
+        * If we see lrugen.enabled is consistent here, when we get the lru
+        * spinlock, the migrating thread will have filled the lruvec with some
+        * pages, so we can continue without waiting.
+        */
+       while (global_enabled ^ READ_ONCE(lruvec->lrugen.enabled)) {
+               /* Not switching this one yet. Wait for a while. */
+               schedule_timeout_uninterruptible(1);
+       }
+
+       return global_enabled;
+}
+
 #ifdef CONFIG_LRU_GEN_ENABLED
 DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
 #define get_cap(cap)   static_branch_likely(&lru_gen_caps[cap])
@@ -5171,6 +5208,8 @@ static void lru_gen_change_state(bool enabled)
        if (enabled == lru_gen_enabled())
                goto unlock;
 
+       static_branch_enable_cpuslocked(&lru_gen_draining);
+
        if (enabled)
                static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
        else
@@ -5201,6 +5240,9 @@ static void lru_gen_change_state(bool enabled)
 
                cond_resched();
        } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+       static_branch_disable_cpuslocked(&lru_gen_draining);
+
 unlock:
        mutex_unlock(&state_mutex);
        put_online_mems();
@@ -5752,6 +5794,16 @@ late_initcall(init_lru_gen);
 
 #else /* !CONFIG_LRU_GEN */
 
+static inline bool lru_gen_is_draining(void)
+{
+       return false;
+}
+
+static inline bool shrink_lruvec_draining(struct lruvec *lruvec, struct scan_control *sc)
+{
+       return false;
+}
+
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
        BUILD_BUG();
@@ -5780,7 +5832,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
        bool proportional_reclaim;
        struct blk_plug plug;
 
-       if (lru_gen_enabled() && !root_reclaim(sc)) {
+       if (lru_gen_is_draining() && lru_gen_draining_wait(lruvec, sc)) {
+               lru_gen_shrink_lruvec(lruvec, sc);
+               return;
+       } else if (lru_gen_enabled() && !root_reclaim(sc)) {
                lru_gen_shrink_lruvec(lruvec, sc);
                return;
        }

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-03-03  8:28 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-28 16:10 [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
2026-02-28 22:41   ` Barry Song
2026-03-01  4:10     ` Barry Song
2026-03-02  5:50   ` Yafang Shao
2026-03-02  6:58     ` Barry Song
2026-03-02  7:43       ` Yafang Shao
2026-03-02  8:00         ` Kairui Song
2026-03-02  8:15           ` Barry Song
2026-03-02  8:25           ` Yafang Shao
2026-03-02  9:20             ` Barry Song
2026-03-02  9:47               ` Kairui Song
2026-03-02 14:35                 ` Yafang Shao
2026-03-02 17:51                   ` Yuanchu Xie
2026-03-03  1:34                     ` Barry Song
2026-03-03  1:40                       ` Axel Rasmussen
2026-03-03  2:43                         ` Yafang Shao
2026-03-03  8:27                           ` Bingfang Guo
2026-03-02 16:26           ` Michal Hocko
2026-03-02  8:03         ` Barry Song
2026-03-02  8:13           ` Yafang Shao
2026-03-02  8:20             ` Barry Song
2026-03-03  6:37 Bingfang Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox