From: Suren Baghdasaryan <surenb@google.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF] Improving MGLRU
Date: Tue, 24 Feb 2026 09:19:26 -0800 [thread overview]
Message-ID: <CAJuCfpFb7opwBdm9fzSeFguCCsq7CAiBY1C4J2o07LR6J74E-g@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7AkYOtUL2HuZjBu5dJw=RTL7W2L1+zVv=SCOyHKYwc3AA@mail.gmail.com>
On Thu, Feb 19, 2026 at 9:10 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi All,
>
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
>
> And I've been looking at a few major issues here:
>
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo, smap.
> 4. Some reclaim behavior is less controllable.
>
> And other issues too.
> And I think there isn't a simple solution, but it can definitely be solved. I
> would like to propose a session to discuss a few ideas on how to solve this, and
> perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
> session to discuss some ideas about improving MGLRU and making it the only LRU.
>
> Some parts are just ideas, so far I have a working series [2] following the
> LFU and metric unification idea below, solving 2) and 3) above, and
> providing some very basic infrastructures for 1). Would try to send that as
> RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.
>
> So far, I already observed a 30% reduction of refault of total folios in
> some workloads, including Tpcc and YCSB, and several critical regressions
> compared to Active / Inactive are gone, PG_workingset and PG_referenced are
> gone, yet things like PSI are more accurate (see below), and still stay
> bitwise compatible with Active / Inactive LRU. If it went smoothly,
> we might be able to unify and have only one LRU.
>
> Following topic and ideas are the key points:
>
> 1. Flags usage: which is solvable, and the hard part is mostly about
> implementation details: MGLRU uses (at least) 3 extra flags for the gen
> number, and we are expecting it to use more gen flags to support more than 4
> gen. These flags can be moved to the tail of the LRU pointer after carefully
> modifying the kernel's convention on LRU operations. That would allow us to
> use up to 6 bits for the gen number and support up to 63 gens. The lower bit
> of both pointers can be packed together for CAS on gen numbers. Reducing
> flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
> the LRU pointer tail, which could also be a way.
>
> struct folio {
> /* ... */
> union {
> struct list_head lru;
> + struct lru_gen_list_head lru_gen;
>
> So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
> `lru`, which contains encoded info. We might be able to move all LRU-related
> flags there.
>
> Ordinary folio lists are still just fine, since when the folio is isolated,
> `lru` is still there. But places like folio split, will need to
> check if that's
> a lruvec folio, or folio on an ordinary list.
>
> This part is just an idea yet. But might make us able to have up to 63 gens
> in upstream and enable build for every config.
>
> 2. Regressions: Currently regression is a more major problem for us.
> From our perspective, almost all regressions are caused by an under- or
> overprotected file cache. MGLRU's PID protection either gets too aggressive
> or too passive or just have a too long latency. To fix that, I'd propose a
> LFU-like design and relax the PID's aggressiveness to make it much more
> proactive and effective for file folios. The idea is always use 3 bits in
> the page flags to count the referenced time (which would also replace
> PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
> refaults, and many regressions are gone. A flow chart of how the MGLRU idea
> might work:
>
> ========== MGLFU Tiering ==========
> Access 3 bit lru_gen lru_gen |(R - PG_referenced | W - PG_workingset)
> Count L|W|R refs tier |(L - LRU_GEN_REFS)
> 0 0|0|0 0 0 | - Readahead & Cache
> 1 0|0|1 1 0 | - LRU_REFS_REFERENCED
> ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
> 2 0|1|0 2 0 | - LRU_REFS_WORKINGSET
> 3 0|1|1 3 1 | - Frequently used
> 4 1|0|0* 4 2 |
> 5 1|0|1* 5 2 |
> 6 1|1|0* 6 3 |
> 7 1|1|1* 7 3 | - LRU_REFS_MAX
> ---------- PROMOTION ----------> --+ - <promote to next gen>
>
> Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
> than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
> access, and remove the force protection of folios on eviction. This provides
> a more proactive protection.
>
> And this might also give other frameworks like DAMON a nicer interface to
> interact with MGLRU, since the referenced count can promote every folio and
> count accesses in a more reasonable and unified way for MGLRU now.
>
> NOTE: Still changing this design according to test results, e.g. maybe
> we should optionally still use 4 bits, so the final solution might not
> be the same.
>
> Another potential improvement on the regression issue is implementing the
> refault distance as I once proposed [1], which can have a huge gain for some
> workloads with heavy file folio usage. Maybe we can have both.
>
> 3. Metrics: The key here is about the meaning of page flags, including
> PG_workingset and PG_referenced. These two flags are set/cleared very
> differently for MGLRU compared to Active / Inactive LRU, but many other
> components are still using them as metrics for Active / Inactive LRU. Hence,
> I would propose to introduce a different mechanism to unify and replace these
> two flags: Using the 3 bits in the page flags field reserved for LFU-like
> tracking above, to determine the folio status.
>
> Then following the above LFU-like idea, and using helpers like:
>
> static inline bool folio_is_referenced(const struct folio *folio)
> {
> return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
> }
>
> static inline bool folio_is_workingset(const struct folio *folio)
> {
> return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
> }
>
> static inline bool folio_is_referenced_by_bit(struct folio *folio)
> { /* For compatibility */
> return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
> }
>
> static inline void folio_mark_workingset_by_bit(struct folio *folio)
> { /* For compatibility */
> set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
> BIT(LRU_REFS_PGOFF + 1));
> }
>
> To tell if a folio belongs to a working set or is referenced. The definition
> of workingset will be simplified as follows: a set referenced more than twice
> for MGLRU, and decoupled from MGLRU's tiering.
>
> 4. MGLRU's swappiness is kind of useless in some situations compared to
> Active / Inactive LRU, since its force protects the youngest two gen, so
> quite often we can only reclaim one type of folios. To workaround that, the
> user usually runs force aging before reclaim. So, can we just remove the
> force protection of the youngest two gens?
>
> 5. Async aging and aging optimization are also required to make the above ideas
> work better.
>
> 6. Other issues and discussion on whether the above improvements will help
> solve them or make them worse. e.g.
>
> For eBPF extension, using eBPF to determine which gen a folio should be
> landed given the shadow and after we have more than 4 gens, might be very
> helpful and enough for many workload customizations.
>
> Can we just ignore the shadow for anon folios? MGLRU basically activates
> anon folios unconditionally, especially if we combined with the LFU like
> idea above we might only want to track the 3 bit count, and get rid of
> the extra bit usage in the shadow. The eviction performance might be even
> better, and other components like swap table [3] will have more bits to use
> for better performance and more features.
>
> The goal is:
>
> - Reduce MGLRU's page flag usage to be identical or less compared to Active /
> Inactive LRU.
> - Eliminate regressions.
> - Unify or improve the metrics.
> - Provides more extensibility.
There might be some overlap with this topic proposal:
https://lore.kernel.org/all/cb0c0a0bfc7247cf85858eecf0db6eca@honor.com/
but either way I'm interested in participating, especially on the
topics of regressions and reclaim behavior as it's very relevant for
Android.
>
> Link: https://lwn.net/Articles/945266/ [1]
> Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
> [3]
>
prev parent reply other threads:[~2026-02-24 17:19 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-19 17:09 Kairui Song
2026-02-24 17:19 ` Suren Baghdasaryan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJuCfpFb7opwBdm9fzSeFguCCsq7CAiBY1C4J2o07LR6J74E-g@mail.gmail.com \
--to=surenb@google.com \
--cc=axelrasmussen@google.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ryncsn@gmail.com \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox