linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Rientjes <rientjes@google.com>
To: Kairui Song <ryncsn@gmail.com>, Michal Hocko <mhocko@suse.com>,
	 Shakeel Butt <shakeel.butt@linux.dev>
Cc: lsf-pc@lists.linux-foundation.org,
	 Axel Rasmussen <axelrasmussen@google.com>,
	 Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	 linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] Improving MGLRU
Date: Thu, 26 Feb 2026 23:11:11 -0800 (PST)	[thread overview]
Message-ID: <87c70313-64ce-fb3f-4b0a-8db0526b6c54@google.com> (raw)
In-Reply-To: <CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com>

On Fri, 20 Feb 2026, Kairui Song wrote:

> Hi All,
> 
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
> 
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
> 
> And I've been looking at a few major issues here:
> 
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.
> 

I think this would be a very useful topic to discuss and I really like how 
this was framed in the context of what needs to be addressed so that MGLRU 
can be on a path to becoming the default implementation and we can 
eliminate two separate implementations.  Yes, MGLRU can form the basis of 
several extensions that are possible, like working set reporting, but its 
existence in the kernel shouldn't be based on future shiny features alone; 
I think priority number one should be ensuring that these issues, as well 
as others, are properly addressed with the goal of having a single unified 
implementation in the kernel that does not regress for end users.

One topic we can add here is oom handling with MGLRU, so adding in Michal 
and Shakeel.  MGLRU has working set protection to avoid thrashing by 
configuring min_ttl_ms in sysfs.  That can end up being very useful, and 
would probably be even more useful if there was a per-memcg version of it, 
but it doesn't work well for NUMA.  That's because we get a new oom kill 
context that is triggered from kswapd threads when aging is done, not by 
direct allocators like we're used to:

4167)	/*
4168)	 * The main goal is to OOM kill if every generation from all memcgs is
4169)	 * younger than min_ttl. However, another possibility is all memcgs are
4170)	 * either too small or below min.
4171)	 */
4172)	if (!reclaimable && mutex_trylock(&oom_lock)) {
4173)		struct oom_control oc = {
4174)			.gfp_mask = sc->gfp_mask,
4175)		};
4176) 
4177)		out_of_memory(&oc);
4178) 
4179)		mutex_unlock(&oom_lock);
4180)	}

That obviously just calls into the oom killer without any context about 
*which* node we're trying to free memory on.  The worst case scenario is 
that we oom kill every process on a single node without ever freeing 
memory for kswapd's node.

So I doubt that anybody is using this to actively defend against thrashing 
today, at least on NUMA systems.

One way to address this would be to consider resident memory on the nodes 
included in oc->nodemask when making oom kill decisions and then 
initialize an empty nodemaks here, sets pgdat->node_id, and passes it in.  
But it should be part of a larger discussion about how we handle targeted 
oom killing on specific NUMA nodes that would be applicable for cpusets, 
mempolicies, etc.

For cpusets, for example, we only look at the eligibility of a thread to 
allocate on a node, not the amount of anticipated freeing from that node 
on oom kill.  We could trivially do the same thing here for MGLRU, but it 
would kinda suck to go around oom killing processes that only have a 
single page on your target node.  (But, hey, better than the status quo 
today here!)

So we should talk about node-targeted oom killing and how that would make 
sense so that we can wire it up here if min_ttl_ms is to be used for 
MGLRU, at least for NUMA systems.  It's a tricky problem in oom contexts 
to be able to get at the information, per thread, that you want to 
consider to determine eligiblity but perhaps even more of a tricky problem 
when you have that information about what heursitic you use to compare 
processes with lots of memory on the system vs lots of memory on the node.

Has this been considered before?  For kswapd induced oom killing like this 
to work, it would have to be solved.


  parent reply	other threads:[~2026-02-27  7:11 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19 17:25 Kairui Song
2026-02-20 18:24 ` Johannes Weiner
2026-02-21  6:03   ` Kairui Song
2026-02-26  1:55 ` Kalesh Singh
2026-02-26  3:06   ` Kairui Song
2026-02-26 10:10     ` wangzicheng
2026-02-26 15:54 ` Matthew Wilcox
2026-02-27  4:31   ` [LSF/MM/BPF] " Barry Song
2026-02-27  3:30 ` Barry Song
2026-02-27  7:11 ` David Rientjes [this message]
2026-02-27 10:29 ` [LSF/MM/BPF TOPIC] " Vernon Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87c70313-64ce-fb3f-4b0a-8db0526b6c54@google.com \
    --to=rientjes@google.com \
    --cc=axelrasmussen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox