From: David Rientjes <rientjes@google.com>
To: Kairui Song <ryncsn@gmail.com>, Michal Hocko <mhocko@suse.com>,
Shakeel Butt <shakeel.butt@linux.dev>
Cc: lsf-pc@lists.linux-foundation.org,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] Improving MGLRU
Date: Thu, 26 Feb 2026 23:11:11 -0800 (PST) [thread overview]
Message-ID: <87c70313-64ce-fb3f-4b0a-8db0526b6c54@google.com> (raw)
In-Reply-To: <CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com>
On Fri, 20 Feb 2026, Kairui Song wrote:
> Hi All,
>
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
>
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
>
> And I've been looking at a few major issues here:
>
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.
>
I think this would be a very useful topic to discuss and I really like how
this was framed in the context of what needs to be addressed so that MGLRU
can be on a path to becoming the default implementation and we can
eliminate two separate implementations. Yes, MGLRU can form the basis of
several extensions that are possible, like working set reporting, but its
existence in the kernel shouldn't be based on future shiny features alone;
I think priority number one should be ensuring that these issues, as well
as others, are properly addressed with the goal of having a single unified
implementation in the kernel that does not regress for end users.
One topic we can add here is oom handling with MGLRU, so adding in Michal
and Shakeel. MGLRU has working set protection to avoid thrashing by
configuring min_ttl_ms in sysfs. That can end up being very useful, and
would probably be even more useful if there was a per-memcg version of it,
but it doesn't work well for NUMA. That's because we get a new oom kill
context that is triggered from kswapd threads when aging is done, not by
direct allocators like we're used to:
4167) /*
4168) * The main goal is to OOM kill if every generation from all memcgs is
4169) * younger than min_ttl. However, another possibility is all memcgs are
4170) * either too small or below min.
4171) */
4172) if (!reclaimable && mutex_trylock(&oom_lock)) {
4173) struct oom_control oc = {
4174) .gfp_mask = sc->gfp_mask,
4175) };
4176)
4177) out_of_memory(&oc);
4178)
4179) mutex_unlock(&oom_lock);
4180) }
That obviously just calls into the oom killer without any context about
*which* node we're trying to free memory on. The worst case scenario is
that we oom kill every process on a single node without ever freeing
memory for kswapd's node.
So I doubt that anybody is using this to actively defend against thrashing
today, at least on NUMA systems.
One way to address this would be to consider resident memory on the nodes
included in oc->nodemask when making oom kill decisions and then
initialize an empty nodemaks here, sets pgdat->node_id, and passes it in.
But it should be part of a larger discussion about how we handle targeted
oom killing on specific NUMA nodes that would be applicable for cpusets,
mempolicies, etc.
For cpusets, for example, we only look at the eligibility of a thread to
allocate on a node, not the amount of anticipated freeing from that node
on oom kill. We could trivially do the same thing here for MGLRU, but it
would kinda suck to go around oom killing processes that only have a
single page on your target node. (But, hey, better than the status quo
today here!)
So we should talk about node-targeted oom killing and how that would make
sense so that we can wire it up here if min_ttl_ms is to be used for
MGLRU, at least for NUMA systems. It's a tricky problem in oom contexts
to be able to get at the information, per thread, that you want to
consider to determine eligiblity but perhaps even more of a tricky problem
when you have that information about what heursitic you use to compare
processes with lots of memory on the system vs lots of memory on the node.
Has this been considered before? For kswapd induced oom killing like this
to work, it would have to be solved.
next prev parent reply other threads:[~2026-02-27 7:11 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-19 17:25 Kairui Song
2026-02-20 18:24 ` Johannes Weiner
2026-02-21 6:03 ` Kairui Song
2026-02-26 1:55 ` Kalesh Singh
2026-02-26 3:06 ` Kairui Song
2026-02-26 10:10 ` wangzicheng
2026-02-26 15:54 ` Matthew Wilcox
2026-02-27 4:31 ` [LSF/MM/BPF] " Barry Song
2026-02-27 3:30 ` Barry Song
2026-02-27 7:11 ` David Rientjes [this message]
2026-02-27 10:29 ` [LSF/MM/BPF TOPIC] " Vernon Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87c70313-64ce-fb3f-4b0a-8db0526b6c54@google.com \
--to=rientjes@google.com \
--cc=axelrasmussen@google.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@suse.com \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox