From: Barry Song <21cnbao@gmail.com>
To: lenohou@gmail.com
Cc: 21cnbao@gmail.com, akpm@linux-foundation.org,
axelrasmussen@google.com, laoar.shao@gmail.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
weixugc@google.com, wjl.linux@gmail.com, yuanchu@google.com,
yuzhao@google.com
Subject: Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
Date: Sun, 1 Mar 2026 05:28:37 +0800 [thread overview]
Message-ID: <20260228212837.59661-1-21cnbao@gmail.com> (raw)
In-Reply-To: <20260228161008.707-1-lenohou@gmail.com>
On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
>
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim
> path. This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> *** Problem Description ***
>
> The issue arises from a "reclaim vacuum" during the transition:
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> false before the pages are drained from MGLRU lists back to
> traditional LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
> yet, or the changes are not yet visible to all CPUs due to a lack of
> synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
> concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees
> the new state but the MGLRU lists haven't been populated via
> fill_evictable() yet.
>
> *** Solution ***
>
> Introduce a 'draining' state to bridge the gap during transitions:
>
> - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> of 'enabled' and 'draining' flags across CPUs.
> - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> lists first, and then fall through to traditional LRU lists instead
> of returning early. This ensures that folios are visible to at least
> one reclaim path at any given time.
>
> *** Reproduction ***
>
> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> a high-pressure memory cgroup (v1) environment.
>
> Reproduction steps:
> 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> and 8GB active anonymous memory.
> 2. Toggle MGLRU state while performing new memory allocations to force
> direct reclaim.
>
> Reproduction script:
> ---
> #!/bin/bash
> # Fixed reproduction for memcg OOM during MGLRU toggle
> set -euo pipefail
>
> MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
>
> # Switch MGLRU dynamically in the background
> switch_mglru() {
> local orig_val=$(cat "$MGLRU_FILE")
> if [[ "$orig_val" != "0x0000" ]]; then
> echo n > "$MGLRU_FILE" &
> else
> echo y > "$MGLRU_FILE" &
> fi
> }
>
> # Setup 16G memcg
> mkdir -p "$CGROUP_PATH"
> echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> echo $$ > "$CGROUP_PATH/cgroup.procs"
>
> # 1. Build memory pressure (File + Anon)
> dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
>
> stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> sleep 5
>
> # 2. Trigger switch and concurrent allocation
> switch_mglru
> stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
>
> # Check OOM counter
> grep oom_kill "$CGROUP_PATH/memory.oom_control"
> ---
>
> Signed-off-by: Leno Hou <lenohou@gmail.com>
>
> ---
> To: linux-mm@kvack.org
> To: linux-kernel@vger.kernel.org
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Yuanchu Xie <yuanchu@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Barry Song <21cnbao@gmail.com>
> Cc: Jialing Wang <wjl.linux@gmail.com>
> Cc: Yafang Shao <laoar.shao@gmail.com>
> Cc: Yu Zhao <yuzhao@google.com>
> ---
> include/linux/mmzone.h | 2 ++
> mm/vmscan.c | 14 +++++++++++---
> 2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7fb7331c5725..0648ce91dbc6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -509,6 +509,8 @@ struct lru_gen_folio {
> atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> /* whether the multi-gen LRU is enabled */
> bool enabled;
> + /* whether the multi-gen LRU is draining to LRU */
> + bool draining;
> /* the memcg generation this lru_gen_folio belongs to */
> u8 gen;
> /* the list segment this lru_gen_folio belongs to */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06071995dacc..629a00681163 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> VM_WARN_ON_ONCE(!state_is_valid(lruvec));
>
> - lruvec->lrugen.enabled = enabled;
> + smp_store_release(&lruvec->lrugen.enabled, enabled);
> + smp_store_release(&lruvec->lrugen.draining, true);
>
> while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> spin_unlock_irq(&lruvec->lru_lock);
> @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> spin_lock_irq(&lruvec->lru_lock);
> }
>
> + smp_store_release(&lruvec->lrugen.draining, false);
> +
> spin_unlock_irq(&lruvec->lru_lock);
> }
>
> @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> bool proportional_reclaim;
> struct blk_plug plug;
> + bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> + bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
>
> - if (lru_gen_enabled() && !root_reclaim(sc)) {
> + if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> lru_gen_shrink_lruvec(lruvec, sc);
> - return;
Is it possible to simply wait for draining to finish instead of performing
an lru_gen/lru shrink while lru_gen is being disabled or enabled?
Performing a shrink in an intermediate state may still involve a lot of
uncertainty, depending on how far the shrink has progressed and how much
remains in each side’s LRU?
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..ba306e986050 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -509,6 +509,8 @@ struct lru_gen_folio {
atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
/* whether the multi-gen LRU is enabled */
bool enabled;
+ /* whether the multi-gen LRU is switching from/to active/inactive LRU */
+ bool switching;
/* the memcg generation this lru_gen_folio belongs to */
u8 gen;
/* the list segment this lru_gen_folio belongs to */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..60fc611067c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5196,6 +5196,7 @@ static void lru_gen_change_state(bool enabled)
VM_WARN_ON_ONCE(!state_is_valid(lruvec));
lruvec->lrugen.enabled = enabled;
+ smp_store_release(&lruvec->lrugen.switching, true);
while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
spin_unlock_irq(&lruvec->lru_lock);
@@ -5203,6 +5204,8 @@ static void lru_gen_change_state(bool enabled)
spin_lock_irq(&lruvec->lru_lock);
}
+ smp_store_release(&lruvec->lrugen.switching, false);
+
spin_unlock_irq(&lruvec->lru_lock);
}
@@ -5780,6 +5783,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
bool proportional_reclaim;
struct blk_plug plug;
+#ifdef CONFIG_LRU_GEN
+ while (smp_load_acquire(&lruvec->lrugen.switching))
+ schedule_timeout_uninterruptible(HZ/100);
+#endif
if (lru_gen_enabled() && !root_reclaim(sc)) {
lru_gen_shrink_lruvec(lruvec, sc);
return;
--
> +
> + if (!lru_draining)
> + return;
> +
> }
>
> get_scan_count(lruvec, sc, nr);
> --
> 2.52.0
>
Thanks
Barry
next prev parent reply other threads:[~2026-02-28 21:28 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-28 16:10 Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song [this message]
2026-02-28 22:41 ` Barry Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260228212837.59661-1-21cnbao@gmail.com \
--to=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=laoar.shao@gmail.com \
--cc=lenohou@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=weixugc@google.com \
--cc=wjl.linux@gmail.com \
--cc=yuanchu@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox