Re: [LSF/MM/BPF] Improving MGLRU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Suren Baghdasaryan <surenb@google.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org,
	 Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>,  Wei Xu <weixugc@google.com>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF] Improving MGLRU
Date: Tue, 24 Feb 2026 09:19:26 -0800	[thread overview]
Message-ID: <CAJuCfpFb7opwBdm9fzSeFguCCsq7CAiBY1C4J2o07LR6J74E-g@mail.gmail.com> (raw)
In-Reply-To: <CAMgjq7AkYOtUL2HuZjBu5dJw=RTL7W2L1+zVv=SCOyHKYwc3AA@mail.gmail.com>

On Thu, Feb 19, 2026 at 9:10 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi All,
>
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
>
> And I've been looking at a few major issues here:
>
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo, smap.
> 4. Some reclaim behavior is less controllable.
>
> And other issues too.
> And I think there isn't a simple solution, but it can definitely be solved. I
> would like to propose a session to discuss a few ideas on how to solve this, and
> perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
> session to discuss some ideas about improving MGLRU and making it the only LRU.
>
> Some parts are just ideas, so far I have a working series [2] following the
> LFU and metric unification idea below, solving 2) and 3) above, and
> providing some very basic infrastructures for 1). Would try to send that as
> RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.
>
> So far, I already observed a 30% reduction of refault of total folios in
> some workloads, including Tpcc and YCSB, and several critical regressions
> compared to Active / Inactive are gone, PG_workingset and PG_referenced are
> gone, yet things like PSI are more accurate (see below), and still stay
> bitwise compatible with Active / Inactive LRU. If it went smoothly,
> we might be able to unify and have only one LRU.
>
> Following topic and ideas are the key points:
>
> 1. Flags usage: which is solvable, and the hard part is mostly about
>    implementation details: MGLRU uses (at least) 3 extra flags for the gen
>    number, and we are expecting it to use more gen flags to support more than 4
>    gen. These flags can be moved to the tail of the LRU pointer after carefully
>    modifying the kernel's convention on LRU operations. That would allow us to
>    use up to 6 bits for the gen number and support up to 63 gens. The lower bit
>    of both pointers can be packed together for CAS on gen numbers. Reducing
>    flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
>    the LRU pointer tail, which could also be a way.
>
>    struct folio {
>        /* ... */
>        union {
>                struct list_head lru;
>    +           struct lru_gen_list_head lru_gen;
>
>    So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
>    `lru`, which contains encoded info. We might be able to move all LRU-related
>    flags there.
>
>    Ordinary folio lists are still just fine, since when the folio is isolated,
>    `lru` is still there. But places like folio split, will need to
> check if that's
>    a lruvec folio, or folio on an ordinary list.
>
>    This part is just an idea yet. But might make us able to have up to 63 gens
>    in upstream and enable build for every config.
>
> 2. Regressions: Currently regression is a more major problem for us.
>    From our perspective, almost all regressions are caused by an under- or
>    overprotected file cache. MGLRU's PID protection either gets too aggressive
>    or too passive or just have a too long latency. To fix that, I'd propose a
>    LFU-like design and relax the PID's aggressiveness to make it much more
>    proactive and effective for file folios. The idea is always use 3 bits in
>    the page flags to count the referenced time (which would also replace
>    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
>    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
>    might work:
>
>    ========== MGLFU Tiering ==========
>    Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
>    Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
>    0       0|0|0    0        0       | - Readahead & Cache
>    1       0|0|1    1        0       | - LRU_REFS_REFERENCED
>    ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
>    2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
>    3       0|1|1    3        1       | - Frequently used
>    4       1|0|0*   4        2       |
>    5       1|0|1*   5        2       |
>    6       1|1|0*   6        3       |
>    7       1|1|1*   7        3       | - LRU_REFS_MAX
>    ---------- PROMOTION ----------> --+ - <promote to next gen>
>
>    Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
>    than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
>    access, and remove the force protection of folios on eviction. This provides
>    a more proactive protection.
>
>    And this might also give other frameworks like DAMON a nicer interface to
>    interact with MGLRU, since the referenced count can promote every folio and
>    count accesses in a more reasonable and unified way for MGLRU now.
>
>    NOTE: Still changing this design according to test results, e.g. maybe
>    we should optionally still use 4 bits, so the final solution might not
>    be the same.
>
>    Another potential improvement on the regression issue is implementing the
>    refault distance as I once proposed [1], which can have a huge gain for some
>    workloads with heavy file folio usage. Maybe we can have both.
>
> 3. Metrics: The key here is about the meaning of page flags, including
>    PG_workingset and PG_referenced. These two flags are set/cleared very
>    differently for MGLRU compared to Active / Inactive LRU, but many other
>    components are still using them as metrics for Active / Inactive LRU. Hence,
>    I would propose to introduce a different mechanism to unify and replace these
>    two flags: Using the 3 bits in the page flags field reserved for LFU-like
>    tracking above, to determine the folio status.
>
>    Then following the above LFU-like idea, and using helpers like:
>
>    static inline bool folio_is_referenced(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
>    }
>
>    static inline bool folio_is_workingset(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
>    }
>
>    static inline bool folio_is_referenced_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
>    }
>
>    static inline void folio_mark_workingset_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
> BIT(LRU_REFS_PGOFF + 1));
>    }
>
>    To tell if a folio belongs to a working set or is referenced. The definition
>    of workingset will be simplified as follows: a set referenced more than twice
>    for MGLRU, and decoupled from MGLRU's tiering.
>
> 4. MGLRU's swappiness is kind of useless in some situations compared to
>    Active / Inactive LRU, since its force protects the youngest two gen, so
>    quite often we can only reclaim one type of folios. To workaround that, the
>    user usually runs force aging before reclaim. So, can we just remove the
>    force protection of the youngest two gens?
>
> 5. Async aging and aging optimization are also required to make the above ideas
>    work better.
>
> 6. Other issues and discussion on whether the above improvements will help
>    solve them or make them worse. e.g.
>
>    For eBPF extension, using eBPF to determine which gen a folio should be
>    landed given the shadow and after we have more than 4 gens, might be very
>    helpful and enough for many workload customizations.
>
>    Can we just ignore the shadow for anon folios? MGLRU basically activates
>    anon folios unconditionally, especially if we combined with the LFU like
>    idea above we might only want to track the 3 bit count, and get rid of
>    the extra bit usage in the shadow. The eviction performance might be even
>    better, and other components like swap table [3] will have more bits to use
>    for better performance and more features.
>
> The goal is:
>
> - Reduce MGLRU's page flag usage to be identical or less compared to Active /
>   Inactive LRU.
> - Eliminate regressions.
> - Unify or improve the metrics.
> - Provides more extensibility.

There might be some overlap with this topic proposal:
https://lore.kernel.org/all/cb0c0a0bfc7247cf85858eecf0db6eca@honor.com/
but either way I'm interested in participating, especially on the
topics of regressions and reclaim behavior as it's very relevant for
Android.

>
> Link: https://lwn.net/Articles/945266/ [1]
> Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
> [3]
>

     prev parent reply	other threads:[~2026-02-24 17:19 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19 17:09 Kairui Song
2026-02-24 17:19 ` Suren Baghdasaryan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJuCfpFb7opwBdm9fzSeFguCCsq7CAiBY1C4J2o07LR6J74E-g@mail.gmail.com \
    --to=surenb@google.com \
    --cc=axelrasmussen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryncsn@gmail.com \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox