Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Bingfang Guo <bfguo@icloud.com>
To: laoar.shao@gmail.com
Cc: 21cnbao@gmail.com, akpm@linux-foundation.org,
	axelrasmussen@google.com, BINGFANG GUO <bingfangguo@tencent.com>,
	lenohou@gmail.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, ryncsn@gmail.com, weixugc@google.com,
	wjl.linux@gmail.com, yuanchu@google.com, yuzhao@google.com
Subject: Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching
Date: Tue, 3 Mar 2026 14:37:27 +0800	[thread overview]
Message-ID: <76ADAAC7-7616-4D84-9EF5-32DE7B350B1B@icloud.com> (raw)

Hi all. Thanks for inviting me to the discussion. I'm glad to join you and share
my ideas and findings with you.

On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Mar 2, 2026 at 4:00 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Mar 2, 2026 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 2:58 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > I assume latency is not a concern for a very rare
> > > > MGLRU on/off case. Do you require the switch to happen
> > > > with zero latency?
> > > > My main concern is the correctness of the code.
> > > >
> > > > Now the proposed patch is:
> > > >
> > > > +       bool lrugen_enabled = smp_load_acquire(&lruvec->lrugen.enabled);
> > > > +       bool lru_draining = smp_load_acquire(&lruvec->lrugen.draining);
> > > >
> > > > Then choose MGLRU or active/inactive LRU based on
> > > > those values.
> > > >
> > > > However, nothing prevents those values from changing
> > > > after they are read. Even within the shrink path,
> > > > they can still change.
> >
> > Hi all,
> >
> > > If these values are changed during reclaim, the currently running
> > > reclaimer will continue to operate with the old settings, while any
> > > new reclaimer processes will adopt the new values. This approach
> > > should prevent any immediate issues, but the primary risk of this
> > > lockless method is the potential for a user to rapidly toggle the
> > > MGLRU feature, particularly during an intermediate state.
> > >
> > > >
> > > > So I think we need an rwsem or something similar here —
> > > > a read lock for shrink and a write lock for on/off. The
> > > > write lock should happen very rarely.
> > >
> > > We can introduce a lock-based mechanism in v2.
> >
> > I hope we don't need a lock here. Currently there is only a static
> > key, this patch is already adding more branches, a lock will make
> > things more complex and the shrinking path is quite performance
> > sensitive.
> >
> > > >
> > > > To be honest, the on/off toggle is quite odd. If possible,
> > > > I’d prefer not to switch MGLRU or active/inactive
> > > > dynamically. Once it’s set up during system boot, it
> > > > should remain unchanged.
> > >
> > > While it is well-suited for Android environments, it is not viable for
> > > Kubernetes production servers, where rebooting is highly disruptive.
> > > This limitation is precisely why we need to introduce dynamic toggles.
> >
> > I agree with Barry, the switch isn't supposed to be a knob to be
> > turned on/off frequently. And I think in the long term we should just
> > identify the workloads where MGLRU doesn't work well, and fix MGLRU.
>
> The challenge we're currently facing is that we don't yet know which
> workloads would benefit from it ;)
> We do want to enable mglru on our production servers, but first we
> need to address the risk of OOM during the switch—that's exactly why
> we're proposing this patch.

Yes. I believe our long term target is to integrate the two LRU implementations.
But for now, it's important to keep this dynamic toggling feature and make it
robust and work well. So if users are willing to try the new LRU algorithm, they
are free to enable it after system boots for testing, and disable it if they run
into some trouble without worrying about OOM and other problems. Therefore, we
can have more users and potentially expose more problems related to MGLRU and
fix them.


On Mon, Mar 3, 2026 at 1:34 AM Barry Sone <21cnbao@gmail.com> wrote:
> 2. Ensure that shrinking and switching do not occur
> simultaneously by using something like an rwsem —
> shrinking can proceed in parallel under the read
> lock, while the (rare) switching path takes the
> write lock.

In my opinion, completely banning others from reclaming seems to demand more
than needed. We have many huge servers with services running in enourmous memcg.
In such case, waiting for the draining to complete may take so long (tens of
seconds for example) that the service get many timeout failures. But there's
high chance that reclaimers can still reclaim enough even if the draning is not
completed. So maybe we can have concurrent reclaming and state switch draining?



Regarding the discussion, I would like to propose a slightly different approach
that is already in use in production. The proposal mainly focuses on two
practical considerations:

1. State switching is a rare operation. So we should not penalize the normal
reclaim path or introduce more locks for this rare case.
2. We should avoid introducing long latency spikes during production state
transitions (e.g., switching on live machines).

The downstream solution is very similar to a combination of all your proposals,
but with some radical attempts to try to avoid sleeping therefore reduce lags
from waiting as much as possible. At the same time, we keep a last-chance wait
to prevent early OOMs from happening.

We use a static key to indicate that the state change is in progress. All
operations are encapsulated in that slow path so no extra overhead for the
normal path. If we are in the draining process, we first try reclaiming from
where lrugen->enabled says we are, if we still have a lot of retry times left.
With no retries left, simply wait until the we pass the race window.

--
Thanks
Bingfang

--
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3f..d7ff7a6ed088 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2652,6 +2652,43 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
 
 #ifdef CONFIG_LRU_GEN
 
+DEFINE_STATIC_KEY_FALSE(lru_gen_draining);
+
+static inline bool lru_gen_is_draining(void)
+{
+       return static_branch_unlikely(&lru_gen_draining);
+}
+
+/*
+ * Lazily wait for the draining thread to finish if it's running.
+ *
+ * Return: whether we'd like to reclaim from multi-gen LRU.
+ */
+static inline bool lru_gen_draining_wait(struct lruvec *lruvec, struct scan_control *sc)
+{
+       bool global_enabled = lru_gen_enabled();
+
+       /* Try reclaiming from the current LRU first */
+       if (sc->priority > DEF_PRIORITY / 2)
+               return READ_ONCE(lruvec->lrugen.enabled);
+
+       /* Oops, try from the other side... */
+       if (sc->priority > 1)
+               return global_enabled;
+
+       /*
+        * If we see lrugen.enabled is consistent here, when we get the lru
+        * spinlock, the migrating thread will have filled the lruvec with some
+        * pages, so we can continue without waiting.
+        */
+       while (global_enabled ^ READ_ONCE(lruvec->lrugen.enabled)) {
+               /* Not switching this one yet. Wait for a while. */
+               schedule_timeout_uninterruptible(1);
+       }
+
+       return global_enabled;
+}
+
 #ifdef CONFIG_LRU_GEN_ENABLED
 DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
 #define get_cap(cap)   static_branch_likely(&lru_gen_caps[cap])
@@ -5171,6 +5208,8 @@ static void lru_gen_change_state(bool enabled)
        if (enabled == lru_gen_enabled())
                goto unlock;
 
+       static_branch_enable_cpuslocked(&lru_gen_draining);
+
        if (enabled)
                static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
        else
@@ -5201,6 +5240,9 @@ static void lru_gen_change_state(bool enabled)
 
                cond_resched();
        } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+       static_branch_disable_cpuslocked(&lru_gen_draining);
+
 unlock:
        mutex_unlock(&state_mutex);
        put_online_mems();
@@ -5752,6 +5794,16 @@ late_initcall(init_lru_gen);
 
 #else /* !CONFIG_LRU_GEN */
 
+static inline bool lru_gen_is_draining(void)
+{
+       return false;
+}
+
+static inline bool shrink_lruvec_draining(struct lruvec *lruvec, struct scan_control *sc)
+{
+       return false;
+}
+
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
        BUILD_BUG();
@@ -5780,7 +5832,10 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
        bool proportional_reclaim;
        struct blk_plug plug;
 
-       if (lru_gen_enabled() && !root_reclaim(sc)) {
+       if (lru_gen_is_draining() && lru_gen_draining_wait(lruvec, sc)) {
+               lru_gen_shrink_lruvec(lruvec, sc);
+               return;
+       } else if (lru_gen_enabled() && !root_reclaim(sc)) {
                lru_gen_shrink_lruvec(lruvec, sc);
                return;
        }

next             reply	other threads:[~2026-03-03  6:37 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-03  6:37 Bingfang Guo [this message]
  -- strict thread matches above, loose matches on Subject: below --
2026-02-28 16:10 Leno Hou
2026-02-28 18:58 ` Andrew Morton
2026-02-28 19:12 ` kernel test robot
2026-02-28 19:23 ` kernel test robot
2026-02-28 20:15 ` kernel test robot
2026-02-28 21:28 ` Barry Song
2026-02-28 22:41   ` Barry Song
2026-03-01  4:10     ` Barry Song
2026-03-02  5:50   ` Yafang Shao
2026-03-02  6:58     ` Barry Song
2026-03-02  7:43       ` Yafang Shao
2026-03-02  8:00         ` Kairui Song
2026-03-02  8:15           ` Barry Song
2026-03-02  8:25           ` Yafang Shao
2026-03-02  9:20             ` Barry Song
2026-03-02  9:47               ` Kairui Song
2026-03-02 14:35                 ` Yafang Shao
2026-03-02 17:51                   ` Yuanchu Xie
2026-03-03  1:34                     ` Barry Song
2026-03-03  1:40                       ` Axel Rasmussen
2026-03-03  2:43                         ` Yafang Shao
2026-03-03  8:27                           ` Bingfang Guo
2026-03-02 16:26           ` Michal Hocko
2026-03-02  8:03         ` Barry Song
2026-03-02  8:13           ` Yafang Shao
2026-03-02  8:20             ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=76ADAAC7-7616-4D84-9EF5-32DE7B350B1B@icloud.com \
    --to=bfguo@icloud.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=bingfangguo@tencent.com \
    --cc=laoar.shao@gmail.com \
    --cc=lenohou@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryncsn@gmail.com \
    --cc=weixugc@google.com \
    --cc=wjl.linux@gmail.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox