linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: lenohou@gmail.com, akpm@linux-foundation.org,
	axelrasmussen@google.com,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, weixugc@google.com,  wjl.linux@gmail.com,
	yuanchu@google.com, yuzhao@google.com
Subject: Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
Date: Mon, 2 Mar 2026 13:50:17 +0800	[thread overview]
Message-ID: <CALOAHbBKr5ni=Ap2ASq2hx041WAghd+CzqbXGBSFPExBMJzcUg@mail.gmail.com> (raw)
In-Reply-To: <20260228212837.59661-1-21cnbao@gmail.com>

On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> >
> > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > condition exists between the state switching and the memory reclaim
> > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > reclaimable memory is available.
> >
> > *** Problem Description ***
> >
> > The issue arises from a "reclaim vacuum" during the transition:
> >
> > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> >    false before the pages are drained from MGLRU lists back to
> >    traditional LRU lists.
> > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> >    and skip the MGLRU path.
> > 3. However, these pages might not have reached the traditional LRU lists
> >    yet, or the changes are not yet visible to all CPUs due to a lack of
> >    synchronization.
> > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> >    concludes there is no reclaimable memory, and triggers an OOM kill.
> >
> > A similar race can occur during enablement, where the reclaimer sees
> > the new state but the MGLRU lists haven't been populated via
> > fill_evictable() yet.
> >
> > *** Solution ***
> >
> > Introduce a 'draining' state to bridge the gap during transitions:
> >
> > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> >   of 'enabled' and 'draining' flags across CPUs.
> > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> >   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> >   lists first, and then fall through to traditional LRU lists instead
> >   of returning early. This ensures that folios are visible to at least
> >   one reclaim path at any given time.
> >
> > *** Reproduction ***
> >
> > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > a high-pressure memory cgroup (v1) environment.
> >
> > Reproduction steps:
> > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> >    and 8GB active anonymous memory.
> > 2. Toggle MGLRU state while performing new memory allocations to force
> >    direct reclaim.
> >
> > Reproduction script:
> > ---
> > #!/bin/bash
> > # Fixed reproduction for memcg OOM during MGLRU toggle
> > set -euo pipefail
> >
> > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> >
> > # Switch MGLRU dynamically in the background
> > switch_mglru() {
> >     local orig_val=$(cat "$MGLRU_FILE")
> >     if [[ "$orig_val" != "0x0000" ]]; then
> >         echo n > "$MGLRU_FILE" &
> >     else
> >         echo y > "$MGLRU_FILE" &
> >     fi
> > }
> >
> > # Setup 16G memcg
> > mkdir -p "$CGROUP_PATH"
> > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > echo $$ > "$CGROUP_PATH/cgroup.procs"
> >
> > # 1. Build memory pressure (File + Anon)
> > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> >
> > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > sleep 5
> >
> > # 2. Trigger switch and concurrent allocation
> > switch_mglru
> > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> >
> > # Check OOM counter
> > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > ---
> >
> > Signed-off-by: Leno Hou <lenohou@gmail.com>
> >
> > ---
> > To: linux-mm@kvack.org
> > To: linux-kernel@vger.kernel.org
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > Cc: Yuanchu Xie <yuanchu@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Barry Song <21cnbao@gmail.com>
> > Cc: Jialing Wang <wjl.linux@gmail.com>
> > Cc: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Yu Zhao <yuzhao@google.com>
> > ---
> >  include/linux/mmzone.h |  2 ++
> >  mm/vmscan.c            | 14 +++++++++++---
> >  2 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 7fb7331c5725..0648ce91dbc6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> >         /* whether the multi-gen LRU is enabled */
> >         bool enabled;
> > +       /* whether the multi-gen LRU is draining to LRU */
> > +       bool draining;
> >         /* the memcg generation this lru_gen_folio belongs to */
> >         u8 gen;
> >         /* the list segment this lru_gen_folio belongs to */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 06071995dacc..629a00681163 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> >                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> >
> > -                       lruvec->lrugen.enabled = enabled;
> > +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> > +                       smp_store_release(&lruvec->lrugen.draining, true);
> >
> >                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> >                                 spin_unlock_irq(&lruvec->lru_lock);
> > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> >                                 spin_lock_irq(&lruvec->lru_lock);
> >                         }
> >
> > +                       smp_store_release(&lruvec->lrugen.draining, false);
> > +
> >                         spin_unlock_irq(&lruvec->lru_lock);
> >                 }
> >
> > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> >         bool proportional_reclaim;
> >         struct blk_plug plug;
> > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> >
> > -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> > +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> >                 lru_gen_shrink_lruvec(lruvec, sc);
> > -               return;
>

Hello Barry,

> Is it possible to simply wait for draining to finish instead of performing
> an lru_gen/lru shrink while lru_gen is being disabled or enabled?

This might introduce unexpected latency spikes during the waiting period.

>
> Performing a shrink in an intermediate state may still involve a lot of
> uncertainty, depending on how far the shrink has progressed and how much
> remains in each side’s LRU?

The workingset might not be reliable in this intermediate state.
However, since switching MGLRU should not be a frequent operation in a
production environment, I believe the workingset in this intermediate
state should not be a concern. The only reason we would enable or
disable MGLRU is if we find that certain workloads benefit from
it—enabling it when it helps, and disabling it when it causes
degradation. There should be no other scenario in which we would need
to toggle MGLRU on or off.

To identify which workloads can benefit from MGLRU, we must first
ensure that switching it on or off is safe—which is precisely why we
are proposing this patch. Once MGLRU is enabled in production, we can
continue to improve it. Perhaps in the future, we can even implement a
per-workload reclaim mechanism.


--
Regards
Yafang


  parent reply	other threads:[~2026-03-02  5:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-28 16:10 Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
2026-02-28 22:41   ` Barry Song
2026-03-01  4:10     ` Barry Song
2026-03-02  5:50   ` Yafang Shao [this message]
2026-03-02  6:58     ` Barry Song
2026-03-02  7:43       ` Yafang Shao
2026-03-02  8:00         ` Kairui Song
2026-03-02  8:15           ` Barry Song
2026-03-02  8:25           ` Yafang Shao
2026-03-02  9:20             ` Barry Song
2026-03-02  9:47               ` Kairui Song
2026-03-02  8:03         ` Barry Song
2026-03-02  8:13           ` Yafang Shao
2026-03-02  8:20             ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALOAHbBKr5ni=Ap2ASq2hx041WAghd+CzqbXGBSFPExBMJzcUg@mail.gmail.com' \
    --to=laoar.shao@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=lenohou@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=weixugc@google.com \
    --cc=wjl.linux@gmail.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox