Re: [LSF/MM/BPF] Improving MGLRU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Barry Song <21cnbao@gmail.com>
To: Gregory Price <gourry@gourry.net>
Cc: willy@infradead.org, axelrasmussen@google.com,
	linux-mm@kvack.org,  lsf-pc@lists.linux-foundation.org,
	ryncsn@gmail.com, weixugc@google.com,  yuanchu@google.com
Subject: Re: [LSF/MM/BPF] Improving MGLRU
Date: Thu, 5 Mar 2026 14:27:27 +0800	[thread overview]
Message-ID: <CAGsJ_4xsN2Kfa_f_WaZ-h9Ex7dHk6okyZxnW6oEcXW=kLXwLXw@mail.gmail.com> (raw)
In-Reply-To: <aaXM7xNSJaJBsety@gourry-fedora-PF4VCD3F>

On Tue, Mar 3, 2026 at 1:46 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Fri, Feb 27, 2026 at 12:31:39PM +0800, Barry Song wrote:
> > >> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > >> today. There are many reasons MGLRU is still not the only LRU implementation in
> > >> the kernel.
> >
> > > To my mind, the biggest problem with MGLRU is that Google dumped it on us
> > > and ran away.  Commit 44958000bada claimed that it was now maintained and
> > > added three people as maintainers.  In the six months since that commit,
> > > none of those three people have any commits in mm/!  This is a shameful
> > > state of affairs.
> > >
> > > I say rip it out.
> >
> > Hi Matthew,
> > Can we keep it for now? Kairui, Zicheng, and I are working on it.
> >
> > From what I’ve seen, it performs much better than the active/inactive
> > approach after applying a few vendor hooks on Android, such as forced
> > aging and avoiding direct activation of read-ahead folios during page
> > faults, among others. To be honest, performance was worse than
> > active/inactive without those hooks, which are still not in mainline.
> >
> > It just needs more work. MGLRU has many strong design aspects, including
> > using more generations to differentiate cold from hot, the look-around
> > mechanism to reduce scanning overhead by leveraging cache locality,
> > and data structure designs that minimize lock holding.
>
> In presentations where the distribution of generations is shown for
> different workloads, I've seen many bi-modal distributions for MGLRU
> (where oldest and youngest contain the bulk of the folios).
>
> It makes the value of multiple generations questionable - especially at
> the level MGLRU emulates it right now (multiple generations PLUS multiple
> tiers within those generations).
>
> One of the issues with MGLRU is it's actually quite difficult to
> determine which feature it introduces (there are 7 or 8 major features)
> is responsible for producing any given effect on a workload.

true. MGLRU has multiple features:

1. lru_gen_look_around — exploits spatial locality by scanning
adjacent PTEs of a young PTE.

This is also beneficial for active/inactive LRU, as it helps reduce
rmap cost. The Android kernel once had a hook to enable it for the
active/inactive LRU:

https://android.googlesource.com/kernel/common.git/+/76541556a9a3540

2. page table walks for aging — further exploit spatial locality.
The aging path prefers walking page tables to look for young PTEs
and promote hot pages.

I didn’t observe any improvement on ARM64 Android, but I did notice
increased mmap_lock contention. Disabling it actually reduced CPU
usage, rather than increasing it as the patch claimed. Perhaps this is
because ARM64 lacks a non-leaf young bit, making the scanning cost
quite high?

3. fallback to the other type when one type has only two generations.

isolate_folios():
                scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
                if (scanned) // scanned will be set to 0 if the type
has only two gens
                        return scanned;

                type = !type;

This seems to be a major issue with MGLRU, making swappiness largely
ineffective. People have been complaining about over-reclamation of
file pages even when they set a high swappiness to prefer reclaiming
anonymous pages.

4. very aggressively promote mapped folios.

Active/inactive LRU relies on scanning and detecting young PTEs to
promote mapped folios from inactive to active, whereas MGLRU
promotes mapped folios directly to the youngest generation.

Active/inactive LRU should be able to retain read-ahead and
map_around folios that haven’t actually been accessed in inactive, but
MGLRU promotes all of them indiscriminately.

This can sometimes be appropriate, but it often overshoots:

void folio_add_lru(struct folio *folio)
{
        VM_BUG_ON_FOLIO(folio_test_active(folio) &&
                        folio_test_unevictable(folio), folio);
        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

        /* see the comment in lru_gen_folio_seq() */
        if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
            lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
                folio_set_active(folio);

        folio_batch_add_and_move(folio, lru_add);
}

In particular, I observed that read-ahead folios triggered by faults were
being promoted, which significantly degrades MGLRU performance on
low-memory devices. I attempted to mitigate this by:

https://lore.kernel.org/linux-mm/20260225223712.3685-1-21cnbao@gmail.com/

5. min_ttl_ms - thrashing prevention

This might be a good option, but I’ve noticed that people often don’t
know how to use it or how to integrate it with the Android OOM
killer. As a result, I see users leaving it untouched. I’m not sure if
any Android users are actually using it—if there are, please let me
know.

6. gen, tier, bloom filter

This replaces active/inactive and compares file versus anon aging,
handling scan balance between anon and file.
I’m not sure they are definitely better, but they do seem much more
complex than active/inactive.

7. Missing shrink_active_list() — the function to demote folios from
active to inactive.

active/inactive perform rmap and scan PTEs to demote folios from active to
inactive before reclamation. MGLRU, however, seems to always
promote—finding young folios and moving them to the new
generation, while older folios automatically move to the old
generation.

This seems to reduce reclamation cost significantly, as folio_referenced()
would otherwise need to perform rmap and scan PTEs in each process
to clear access bits in shrink_active_list().

Points 1 and 7 might explain why we have observed MGLRU showing lower
CPU usage than active/inactive.

8. swappiness concept difference.

In active/inactive LRU, even with swappiness set to 0, anonymous pages
still have a chance to be reclaimed if file pages run out.

In MGLRU, setting swappiness=0 effectively disables anon reclamation,
which can lead to cold/hot inversion of anon pages:

inc_min_seq():

        /* For anon type, skip the check if swappiness is zero (file only) */
        if (!type && !swappiness)
                goto done;

        /* prevent cold/hot inversion if the type is evictable */
        for (zone = 0; zone < MAX_NR_ZONES; zone++) {
                struct list_head *head = &lrugen->folios[old_gen][type][zone];

I wonder if setting swappiness=201 could also cause file cold/hot inversion?
        /* For file type, skip the check if swappiness is anon only */
        if (type && (swappiness == SWAPPINESS_ANON_ONLY))
                goto done;

So, when people set swappiness=201 to force shrinking anonymous pages
only, it might put file folios at risk?

Together with point 3 - MGLRU’s swappiness has a much less clear effect
on reclaiming file versus anon pages compared to active/inactive,
highlighting a significant difference between MGLRU and
active/inactive behavior.

Considering all of the above, I feel MGLRU is quite different from
active/inactive. Trying to unify them seems like merging two
completely different approaches. Still, active/inactive might have
some useful lessons to learn from MGLRU, particularly on how to
reduce reclamation cost.

>
> In a random test over the weekend where I turned everything but
> multiple generations off (no page table scan, no bloom filter, etc -
> MGLRU just defaults to a multi-gen FIFO) I found that streaming
> workloads did better this way.

I understand your point, I'd say there will always be cases where LRU
is not the most suitable algorithm.

Perhaps an eBPF-programmable LRU could also be a direction
worth exploring. We could set different eBPF programs for
different workloads? There is a project in this area:

https://dl.acm.org/doi/pdf/10.1145/3731569.3764820
https://github.com/cache-ext/cache_ext

>
> Makes sense, given that MGLRU is trying to protect working set,
> but I didn't expect it to be that dramatic.
>
> It seems at best problematic to argue "We just need more heuristics!",
> but clearly MGLRU "works, for some definition of the word works".

The goal of the in-kernel LRU is probably suitable for most workloads,
but not “good” enough for all workloads :-)

Thanks
Barry

next prev parent reply	other threads:[~2026-03-05  6:27 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19 17:25 [LSF/MM/BPF TOPIC] " Kairui Song
2026-02-20 18:24 ` Johannes Weiner
2026-02-21  6:03   ` Kairui Song
2026-02-26  1:55 ` Kalesh Singh
2026-02-26  3:06   ` Kairui Song
2026-02-26 10:10     ` wangzicheng
2026-02-26 15:54 ` Matthew Wilcox
2026-02-27  4:31   ` [LSF/MM/BPF] " Barry Song
2026-03-02 17:46     ` Gregory Price
2026-03-05  6:27       ` Barry Song [this message]
2026-03-05  7:31         ` Gregory Price
2026-02-27 17:55   ` [LSF/MM/BPF TOPIC] " Shakeel Butt
2026-02-27 18:50     ` Gregory Price
2026-03-03  1:31     ` Axel Rasmussen
2026-03-03 13:39       ` Shakeel Butt
2026-03-05  6:46         ` Chen Ridong
2026-03-03  1:30   ` Axel Rasmussen
2026-02-27  3:30 ` [LSF/MM/BPF] " Barry Song
2026-03-02 11:10   ` Kairui Song
2026-03-03  4:06     ` Barry Song
2026-02-27  7:11 ` [LSF/MM/BPF TOPIC] " David Rientjes
2026-02-27 10:29 ` Vernon Yang
2026-03-02 12:17   ` Kairui Song
  -- strict thread matches above, loose matches on Subject: below --
2026-02-19 17:09 [LSF/MM/BPF] " Kairui Song
2026-02-24 17:19 ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4xsN2Kfa_f_WaZ-h9Ex7dHk6okyZxnW6oEcXW=kLXwLXw@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=gourry@gourry.net \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryncsn@gmail.com \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=yuanchu@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox