Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yafang Shao <laoar.shao@gmail.com>
To: Barry Song <21cnbao@gmail.com>
Cc: lenohou@gmail.com, akpm@linux-foundation.org,
	axelrasmussen@google.com,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, weixugc@google.com,  wjl.linux@gmail.com,
	yuanchu@google.com, yuzhao@google.com
Subject: Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
Date: Mon, 2 Mar 2026 16:13:25 +0800	[thread overview]
Message-ID: <CALOAHbCtYoBLbbzX0DU2xuTeDehW__LiJfd16ZrUBG+TijfH_w@mail.gmail.com> (raw)
In-Reply-To: <CAGsJ_4zAY+1feahcVHhJ0NyKHxFZZHJpey1Qd=FAnSnHhh3Yyg@mail.gmail.com>

On Mon, Mar 2, 2026 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 1:50 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 1, 2026 at 5:28 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sun, Mar 1, 2026 at 12:10 AM Leno Hou <lenohou@gmail.com> wrote:
> > > > > >
> > > > > > When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> > > > > > condition exists between the state switching and the memory reclaim
> > > > > > path. This can lead to unexpected cgroup OOM kills, even when plenty of
> > > > > > reclaimable memory is available.
> > > > > >
> > > > > > *** Problem Description ***
> > > > > >
> > > > > > The issue arises from a "reclaim vacuum" during the transition:
> > > > > >
> > > > > > 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> > > > > >    false before the pages are drained from MGLRU lists back to
> > > > > >    traditional LRU lists.
> > > > > > 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> > > > > >    and skip the MGLRU path.
> > > > > > 3. However, these pages might not have reached the traditional LRU lists
> > > > > >    yet, or the changes are not yet visible to all CPUs due to a lack of
> > > > > >    synchronization.
> > > > > > 4. get_scan_count() subsequently finds traditional LRU lists empty,
> > > > > >    concludes there is no reclaimable memory, and triggers an OOM kill.
> > > > > >
> > > > > > A similar race can occur during enablement, where the reclaimer sees
> > > > > > the new state but the MGLRU lists haven't been populated via
> > > > > > fill_evictable() yet.
> > > > > >
> > > > > > *** Solution ***
> > > > > >
> > > > > > Introduce a 'draining' state to bridge the gap during transitions:
> > > > > >
> > > > > > - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> > > > > >   of 'enabled' and 'draining' flags across CPUs.
> > > > > > - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> > > > > >   is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> > > > > >   lists first, and then fall through to traditional LRU lists instead
> > > > > >   of returning early. This ensures that folios are visible to at least
> > > > > >   one reclaim path at any given time.
> > > > > >
> > > > > > *** Reproduction ***
> > > > > >
> > > > > > The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> > > > > > a high-pressure memory cgroup (v1) environment.
> > > > > >
> > > > > > Reproduction steps:
> > > > > > 1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
> > > > > >    and 8GB active anonymous memory.
> > > > > > 2. Toggle MGLRU state while performing new memory allocations to force
> > > > > >    direct reclaim.
> > > > > >
> > > > > > Reproduction script:
> > > > > > ---
> > > > > > #!/bin/bash
> > > > > > # Fixed reproduction for memcg OOM during MGLRU toggle
> > > > > > set -euo pipefail
> > > > > >
> > > > > > MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
> > > > > > CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
> > > > > >
> > > > > > # Switch MGLRU dynamically in the background
> > > > > > switch_mglru() {
> > > > > >     local orig_val=$(cat "$MGLRU_FILE")
> > > > > >     if [[ "$orig_val" != "0x0000" ]]; then
> > > > > >         echo n > "$MGLRU_FILE" &
> > > > > >     else
> > > > > >         echo y > "$MGLRU_FILE" &
> > > > > >     fi
> > > > > > }
> > > > > >
> > > > > > # Setup 16G memcg
> > > > > > mkdir -p "$CGROUP_PATH"
> > > > > > echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
> > > > > > echo $$ > "$CGROUP_PATH/cgroup.procs"
> > > > > >
> > > > > > # 1. Build memory pressure (File + Anon)
> > > > > > dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
> > > > > > dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
> > > > > >
> > > > > > stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
> > > > > > sleep 5
> > > > > >
> > > > > > # 2. Trigger switch and concurrent allocation
> > > > > > switch_mglru
> > > > > > stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"
> > > > > >
> > > > > > # Check OOM counter
> > > > > > grep oom_kill "$CGROUP_PATH/memory.oom_control"
> > > > > > ---
> > > > > >
> > > > > > Signed-off-by: Leno Hou <lenohou@gmail.com>
> > > > > >
> > > > > > ---
> > > > > > To: linux-mm@kvack.org
> > > > > > To: linux-kernel@vger.kernel.org
> > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > > > Cc: Yuanchu Xie <yuanchu@google.com>
> > > > > > Cc: Wei Xu <weixugc@google.com>
> > > > > > Cc: Barry Song <21cnbao@gmail.com>
> > > > > > Cc: Jialing Wang <wjl.linux@gmail.com>
> > > > > > Cc: Yafang Shao <laoar.shao@gmail.com>
> > > > > > Cc: Yu Zhao <yuzhao@google.com>
> > > > > > ---
> > > > > >  include/linux/mmzone.h |  2 ++
> > > > > >  mm/vmscan.c            | 14 +++++++++++---
> > > > > >  2 files changed, 13 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > > > index 7fb7331c5725..0648ce91dbc6 100644
> > > > > > --- a/include/linux/mmzone.h
> > > > > > +++ b/include/linux/mmzone.h
> > > > > > @@ -509,6 +509,8 @@ struct lru_gen_folio {
> > > > > >         atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > > > >         /* whether the multi-gen LRU is enabled */
> > > > > >         bool enabled;
> > > > > > +       /* whether the multi-gen LRU is draining to LRU */
> > > > > > +       bool draining;
> > > > > >         /* the memcg generation this lru_gen_folio belongs to */
> > > > > >         u8 gen;
> > > > > >         /* the list segment this lru_gen_folio belongs to */
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 06071995dacc..629a00681163 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -5222,7 +5222,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > >                         VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > > > > >                         VM_WARN_ON_ONCE(!state_is_valid(lruvec));
> > > > > >
> > > > > > -                       lruvec->lrugen.enabled = enabled;
> > > > > > +                       smp_store_release(&lruvec->lrugen.enabled, enabled);
> > > > > > +                       smp_store_release(&lruvec->lrugen.draining, true);
> > > > > >
> > > > > >                         while (!(enabled ? fill_evictable(lruvec) : drain_evictable(lruvec))) {
> > > > > >                                 spin_unlock_irq(&lruvec->lru_lock);
> > > > > > @@ -5230,6 +5231,8 @@ static void lru_gen_change_state(bool enabled)
> > > > > >                                 spin_lock_irq(&lruvec->lru_lock);
> > > > > >                         }
> > > > > >
> > > > > > +                       smp_store_release(&lruvec->lrugen.draining, false);
> > > > > > +
> > > > > >                         spin_unlock_irq(&lruvec->lru_lock);
> > > > > >                 }
> > > > > >
> > > > > > @@ -5813,10 +5816,15 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > > > > >         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> > > > > >         bool proportional_reclaim;
> > > > > >         struct blk_plug plug;
> > > > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > > > >
> > > > > > -       if (lru_gen_enabled() && !root_reclaim(sc)) {
> > > > > > +       if (lrugen_enabled || lru_draining && !root_reclaim(sc)) {
> > > > > >                 lru_gen_shrink_lruvec(lruvec, sc);
> > > > > > -               return;
> > > > >
> > > >
> > > > Hello Barry,
> > > >
> > > > > Is it possible to simply wait for draining to finish instead of performing
> > > > > an lru_gen/lru shrink while lru_gen is being disabled or enabled?
> > > >
> > > > This might introduce unexpected latency spikes during the waiting period.
> > >
> > > I assume latency is not a concern for a very rare
> > > MGLRU on/off case. Do you require the switch to happen
> > > with zero latency?
> > > My main concern is the correctness of the code.
> > >
> > > Now the proposed patch is:
> > >
> > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > >
> > > Then choose MGLRU or active/inactive LRU based on
> > > those values.
> > >
> > > However, nothing prevents those values from changing
> > > after they are read. Even within the shrink path,
> > > they can still change.
> >
> > If these values are changed during reclaim, the currently running
> > reclaimer will continue to operate with the old settings, while any
> > new reclaimer processes will adopt the new values. This approach
> > should prevent any immediate issues, but the primary risk of this
> > lockless method is the potential for a user to rapidly toggle the
> > MGLRU feature, particularly during an intermediate state.
> >
> > >
> > > So I think we need an rwsem or something similar here —
> > > a read lock for shrink and a write lock for on/off. The
> > > write lock should happen very rarely.
> >
> > We can introduce a lock-based mechanism in v2.
>
> Honestly, the on/off toggle is quite fragile. For instance,
>
> folio_check_references() is doing:
>
>         if (lru_gen_enabled()) {
>                 if (!referenced_ptes)
>                         return FOLIOREF_RECLAIM;
>
>                 return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE :
> FOLIOREF_KEEP;
>         }
>
> However, `lru_gen_enabled()` does not indicate the actual LRU
> where the folio resides.
>
> `lru_gen_enabled()` is called in many places, but in this case it does
> not accurately reflect where folios are placed if a dynamic toggle is
> active. During the switching, many unexpected behaviors may occur.
>
> >
> > >
> > > >
> > > > >
> > > > > Performing a shrink in an intermediate state may still involve a lot of
> > > > > uncertainty, depending on how far the shrink has progressed and how much
> > > > > remains in each side’s LRU？
> > > >
> > > > The workingset might not be reliable in this intermediate state.
> > > > However, since switching MGLRU should not be a frequent operation in a
> > > > production environment, I believe the workingset in this intermediate
> > > > state should not be a concern. The only reason we would enable or
> > > > disable MGLRU is if we find that certain workloads benefit from
> > > > it—enabling it when it helps, and disabling it when it causes
> > > > degradation. There should be no other scenario in which we would need
> > > > to toggle MGLRU on or off.
> > > >
> > > > To identify which workloads can benefit from MGLRU, we must first
> > > > ensure that switching it on or off is safe—which is precisely why we
> > > > are proposing this patch. Once MGLRU is enabled in production, we can
> > > > continue to improve it. Perhaps in the future, we can even implement a
> > > > per-workload reclaim mechanism.
> > >
> > > To be honest, the on/off toggle is quite odd. If possible,
> > > I’d prefer not to switch MGLRU or active/inactive
> > > dynamically. Once it’s set up during system boot, it
> > > should remain unchanged.
> >
> > While it is well-suited for Android environments, it is not viable for
> > Kubernetes production servers, where rebooting is highly disruptive.
> > This limitation is precisely why we need to introduce dynamic toggles.
>
> Perhaps we really need to unify MGLRU with the active/inactive lists,
> combining the benefits of both approaches. The dynamic toggle, as it
> stands, is quite fragile.
> A topic was suggested by Kairui here [1].
>
> >
> > >
> > > If we want a per-workload LRU, this could be a good
> > > place for eBPF to hook into folio enqueue, dequeue,
> > > and scanning. There is a project related to this [1][2].
> > >
> > > // Policy function hooks
> > > struct cache_ext_ops {
> > >        s32 (*policy_init)(struct mem_cgroup *memcg);
> > >        // Propose folios to evict
> > >        void (*evict_folios)(struct eviction_ctx *ctx,
> > >                  struct mem_cgroup *memcg);
> > >        void (*folio_added)(struct folio *folio);
> > >        void (*folio_accessed)(struct folio *folio);
> > >        // Folio was removed: clean up metadata
> > >        void (*folio_removed)(struct folio *folio);
> > >        char name[CACHE_EXT_OPS_NAME_LEN];
> > > };
> > >
> > > However, we would need a very strong and convincing
> > > user case to justify it.
> >
> > Thanks for the info.
> > We're actually already running a BPF-based reclaimer in production,
> > but we don't have immediate plans to upstream or propose it just yet.
>
> I know you are always far ahead of everyone else. I’m looking forward
> to seeing your code and use cases when you are ready.

Don't say it that way, that is not cooperative.
We've only deployed a limited BPF-based memcg async reclaimer
internally, and it is currently scoped to our own workloads.

-- 
Regards
Yafang

next prev parent reply	other threads:[~2026-03-02  8:14 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-28 16:10 Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
2026-02-28 22:41   ` Barry Song
2026-03-01  4:10     ` Barry Song
2026-03-02  5:50   ` Yafang Shao
2026-03-02  6:58     ` Barry Song
2026-03-02  7:43       ` Yafang Shao
2026-03-02  8:00         ` Kairui Song
2026-03-02  8:15           ` Barry Song
2026-03-02  8:25           ` Yafang Shao
2026-03-02  9:20             ` Barry Song
2026-03-02  9:47               ` Kairui Song
2026-03-02  8:03         ` Barry Song
2026-03-02  8:13           ` Yafang Shao [this message]
2026-03-02  8:20             ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbCtYoBLbbzX0DU2xuTeDehW__LiJfd16ZrUBG+TijfH_w@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=lenohou@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=weixugc@google.com \
    --cc=wjl.linux@gmail.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox