[LSF/MM/BPF TOPIC] Improving MGLRU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Improving MGLRU
@ 2026-02-19 17:25 Kairui Song
  2026-02-20 18:24 ` Johannes Weiner
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Kairui Song @ 2026-02-19 17:25 UTC (permalink / raw)
  To: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

Hi All,

Apologies I forgot to add the proper tag in the previous email so
resending this.

MGLRU has been introduced in the mainline for years, but we still have two LRUs
today. There are many reasons MGLRU is still not the only LRU implementation in
the kernel.

And I've been looking at a few major issues here:

1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
LRU.
2. Regressions: MGLRU might cause regression, even though in many workloads it
outperforms Active/Inactive by a lot.
3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
/proc/meminfo.
4. Some reclaim behavior is less controllable.

And other issues too.
And I think there isn't a simple solution, but it can definitely be solved. I
would like to propose a session to discuss a few ideas on how to solve this, and
perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
session to discuss some ideas about improving MGLRU and making it the only LRU.

Some parts are just ideas, so far I have a working series [2] following the
LFU and metric unification idea below, solving 2) and 3) above, and
providing some very basic infrastructures for 1). Would try to send that as
RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.

So far, I already observed a 30% reduction of refault of total folios in
some workloads, including Tpcc and YCSB, and several critical regressions
compared to Active / Inactive are gone, PG_workingset and PG_referenced are
gone, yet things like PSI are more accurate (see below), and still stay
bitwise compatible with Active / Inactive LRU. If it went smoothly,
we might be able to unify and have only one LRU.

Following topic and ideas are the key points:

1. Flags usage: which is solvable, and the hard part is mostly about
   implementation details: MGLRU uses (at least) 3 extra flags for the gen
   number, and we are expecting it to use more gen flags to support more than 4
   gen. These flags can be moved to the tail of the LRU pointer after carefully
   modifying the kernel's convention on LRU operations. That would allow us to
   use up to 6 bits for the gen number and support up to 63 gens. The lower bit
   of both pointers can be packed together for CAS on gen numbers. Reducing
   flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
   the LRU pointer tail, which could also be a way.

   struct folio {
       /* ... */
       union {
               struct list_head lru;
   +           struct lru_gen_list_head lru_gen;

   So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
   `lru`, which contains encoded info. We might be able to move all LRU-related
   flags there.

   Ordinary folio lists are still just fine, since when the folio is isolated,
   `lru` is still there. But places like folio split, will need to
check if that's
   a lruvec folio, or folio on an ordinary list.

   This part is just an idea yet. But might make us able to have up to 63 gens
   in upstream and enable build for every config.

2. Regressions: Currently regression is a more major problem for us.
   From our perspective, almost all regressions are caused by an under- or
   overprotected file cache. MGLRU's PID protection either gets too aggressive
   or too passive or just have a too long latency. To fix that, I'd propose a
   LFU-like design and relax the PID's aggressiveness to make it much more
   proactive and effective for file folios. The idea is always use 3 bits in
   the page flags to count the referenced time (which would also replace
   PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
   refaults, and many regressions are gone. A flow chart of how the MGLRU idea
   might work:

   ========== MGLFU Tiering ==========
   Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
   Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
   0       0|0|0    0        0       | - Readahead & Cache
   1       0|0|1    1        0       | - LRU_REFS_REFERENCED
   ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
   2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
   3       0|1|1    3        1       | - Frequently used
   4       1|0|0*   4        2       |
   5       1|0|1*   5        2       |
   6       1|1|0*   6        3       |
   7       1|1|1*   7        3       | - LRU_REFS_MAX
   ---------- PROMOTION ----------> --+ - <promote to next gen>

   Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
   than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
   access, and remove the force protection of folios on eviction. This provides
   a more proactive protection.

   And this might also give other frameworks like DAMON a nicer interface to
   interact with MGLRU, since the referenced count can promote every folio and
   count accesses in a more reasonable and unified way for MGLRU now.

   NOTE: Still changing this design according to test results, e.g. maybe
   we should optionally still use 4 bits, so the final solution might not
   be the same.

   Another potential improvement on the regression issue is implementing the
   refault distance as I once proposed [1], which can have a huge gain for some
   workloads with heavy file folio usage. Maybe we can have both.

3. Metrics: The key here is about the meaning of page flags, including
   PG_workingset and PG_referenced. These two flags are set/cleared very
   differently for MGLRU compared to Active / Inactive LRU, but many other
   components are still using them as metrics for Active / Inactive LRU. Hence,
   I would propose to introduce a different mechanism to unify and replace these
   two flags: Using the 3 bits in the page flags field reserved for LFU-like
   tracking above, to determine the folio status.

   Then following the above LFU-like idea, and using helpers like:

   static inline bool folio_is_referenced(const struct folio *folio)
   {
    return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
   }

   static inline bool folio_is_workingset(const struct folio *folio)
   {
    return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
   }

   static inline bool folio_is_referenced_by_bit(struct folio *folio)
   {    /* For compatibility */
    return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
   }

   static inline void folio_mark_workingset_by_bit(struct folio *folio)
   {    /* For compatibility */
    set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
BIT(LRU_REFS_PGOFF + 1));
   }

   To tell if a folio belongs to a working set or is referenced. The definition
   of workingset will be simplified as follows: a set referenced more than twice
   for MGLRU, and decoupled from MGLRU's tiering.

4. MGLRU's swappiness is kind of useless in some situations compared to
   Active / Inactive LRU, since its force protects the youngest two gen, so
   quite often we can only reclaim one type of folios. To workaround that, the
   user usually runs force aging before reclaim. So, can we just remove the
   force protection of the youngest two gens?

5. Async aging and aging optimization are also required to make the above ideas
   work better.

6. Other issues and discussion on whether the above improvements will help
   solve them or make them worse. e.g.

   For eBPF extension, using eBPF to determine which gen a folio should be
   landed given the shadow and after we have more than 4 gens, might be very
   helpful and enough for many workload customizations.

   Can we just ignore the shadow for anon folios? MGLRU basically activates
   anon folios unconditionally, especially if we combined with the LFU like
   idea above we might only want to track the 3 bit count, and get rid of
   the extra bit usage in the shadow. The eviction performance might be even
   better, and other components like swap table [3] will have more bits to use
   for better performance and more features.

The goal is:

- Reduce MGLRU's page flag usage to be identical or less compared to Active /
  Inactive LRU.
- Eliminate regressions.
- Unify or improve the metrics.
- Provides more extensibility.

Link: https://lwn.net/Articles/945266/ [1]
Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
[3]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
@ 2026-02-20 18:24 ` Johannes Weiner
  2026-02-21  6:03   ` Kairui Song
  2026-02-26  1:55 ` Kalesh Singh
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Johannes Weiner @ 2026-02-20 18:24 UTC (permalink / raw)
  To: Kairui Song; +Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> Hi All,
> 
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
> 
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
> 
> And I've been looking at a few major issues here:
> 
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.

I would be very interested in discussing this topic as well.

> 2. Regressions: Currently regression is a more major problem for us.
>    From our perspective, almost all regressions are caused by an under- or
>    overprotected file cache. MGLRU's PID protection either gets too aggressive
>    or too passive or just have a too long latency. To fix that, I'd propose a
>    LFU-like design and relax the PID's aggressiveness to make it much more
>    proactive and effective for file folios. The idea is always use 3 bits in
>    the page flags to count the referenced time (which would also replace
>    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
>    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
>    might work:

Are you referring to refaults on the page cache side, or swapins?

Last time we evaluated MGLRU on Meta workloads, we noticed that it
tends to do better with zswap, but worse with disk swap. It seemed to
just prefer reclaiming anon, period.

For the balancing between anon and file to work well in all
situations, it needs to have a notion of backend speed and factor in
the respective cost of misses on each side.

> 4. MGLRU's swappiness is kind of useless in some situations compared to
>    Active / Inactive LRU, since its force protects the youngest two gen, so
>    quite often we can only reclaim one type of folios. To workaround that, the
>    user usually runs force aging before reclaim. So, can we just remove the
>    force protection of the youngest two gens?

[...]

> 6. Other issues and discussion on whether the above improvements will help
>    solve them or make them worse. e.g.

[...]

>    Can we just ignore the shadow for anon folios? MGLRU basically activates
>    anon folios unconditionally, especially if we combined with the LFU like
>    idea above we might only want to track the 3 bit count, and get rid of
>    the extra bit usage in the shadow. The eviction performance might be even
>    better, and other components like swap table [3] will have more bits to use
>    for better performance and more features.

On the face of it, both of these sounds problematic to me. Why are
anon pages special cased?

The cost of reclaiming a page is:

    reuse frequency * cost of a miss

The *type* of the page is not all that meaningful for workload
performance. The wait time is qualitatively the same.

If you assume every refaulting anon is hot, it'll fall apart when the
anon set is huge and has little locality.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-20 18:24 ` Johannes Weiner
@ 2026-02-21  6:03   ` Kairui Song
  0 siblings, 0 replies; 26+ messages in thread
From: Kairui Song @ 2026-02-21  6:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: lsf-pc, Chen Ridong, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Sat, Feb 21, 2026 at 2:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> > Hi All,
> >
> > Apologies I forgot to add the proper tag in the previous email so
> > resending this.
> >
> > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > today. There are many reasons MGLRU is still not the only LRU implementation in
> > the kernel.
> >
> > And I've been looking at a few major issues here:
> >
> > 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> > LRU.
> > 2. Regressions: MGLRU might cause regression, even though in many workloads it
> > outperforms Active/Inactive by a lot.
> > 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> > /proc/meminfo.
> > 4. Some reclaim behavior is less controllable.
>
> I would be very interested in discussing this topic as well.

Thanks, glad to hear that!

>
> > 2. Regressions: Currently regression is a more major problem for us.
> >    From our perspective, almost all regressions are caused by an under- or
> >    overprotected file cache. MGLRU's PID protection either gets too aggressive
> >    or too passive or just have a too long latency. To fix that, I'd propose a
> >    LFU-like design and relax the PID's aggressiveness to make it much more
> >    proactive and effective for file folios. The idea is always use 3 bits in
> >    the page flags to count the referenced time (which would also replace
> >    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
> >    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
> >    might work:
>
> Are you referring to refaults on the page cache side, or swapins?
>
> Last time we evaluated MGLRU on Meta workloads, we noticed that it
> tends to do better with zswap, but worse with disk swap. It seemed to
> just prefer reclaiming anon, period.
>
> For the balancing between anon and file to work well in all
> situations, it needs to have a notion of backend speed and factor in
> the respective cost of misses on each side.

A bit more than that. When there is no swap, MGLRU still performs
worse in some workloads like MongoDB. From what I've noticed that's
because the PID protection is a bit too passive, and there is a force
protection in sort_folio which sometimes seems too aggressive.
Active/Inactive LRU will just move a folio to head if it's accessed
twice while in RAM, but MGLRU won't do so, as result hotter file
folios are evicted equally as the colder one until the PID gets
triggered, or still gets protected even if it hasn't been used for a
while. And by the time PID finally gets triggered, the workload might
has changed. This is fixable using the approach I mentioned though,
and it seems to be better than the Active/Inactive in all our known
cases after that, whether that is a good fix worth discussion.

I also notice Ridong has a series to apply a "heat" based reclaim,
which also looks interesting.

> >    Can we just ignore the shadow for anon folios? MGLRU basically activates
> >    anon folios unconditionally, especially if we combined with the LFU like
> >    idea above we might only want to track the 3 bit count, and get rid of
> >    the extra bit usage in the shadow. The eviction performance might be even
> >    better, and other components like swap table [3] will have more bits to use
> >    for better performance and more features.
>
> On the face of it, both of these sounds problematic to me. Why are
> anon pages special cased?
>
> The cost of reclaiming a page is:
>
>     reuse frequency * cost of a miss
>
> The *type* of the page is not all that meaningful for workload
> performance. The wait time is qualitatively the same.
>
> If you assume every refaulting anon is hot, it'll fall apart when the
> anon set is huge and has little locality.

Sorry I didn't make it clear. For MGLRU currently it already ignored
the shadow distance for re-activation. And yeah, basically all anons
are activated on fault, which turns out to be quite nice? None MGLRU
users considered that as a problem and in fact the performance looks
good.

Of course we can restore the old behavior to test the folio
against some distance (gen distance or eviction distance), or push it
further to only keep the reference bit (not completely ignore the
shadow, just only keep the reference bits, if the LFU + PID still
works well without the distance), and gain more performance and bits
to use.

BTW I tried to restore the refault distance behavior for both anon and
file folios sometime ago:
https://lwn.net/Articles/945266/

For file folios it indeed looked better, anon folios seems unchanged.
But later tests showed that it doesn't apply to all cases, and I think
something better can be used as suggested in this topic.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
  2026-02-20 18:24 ` Johannes Weiner
@ 2026-02-26  1:55 ` Kalesh Singh
  2026-02-26  3:06   ` Kairui Song
  2026-02-26 15:54 ` Matthew Wilcox
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Kalesh Singh @ 2026-02-26  1:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm,
	android-mm, Suren Baghdasaryan, T.J. Mercier

On Thu, Feb 19, 2026 at 9:26 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi All,
>
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
>
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
>
> And I've been looking at a few major issues here:
>
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.
>
> And other issues too.
> And I think there isn't a simple solution, but it can definitely be solved. I
> would like to propose a session to discuss a few ideas on how to solve this, and
> perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
> session to discuss some ideas about improving MGLRU and making it the only LRU.
>
> Some parts are just ideas, so far I have a working series [2] following the
> LFU and metric unification idea below, solving 2) and 3) above, and
> providing some very basic infrastructures for 1). Would try to send that as
> RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.
>
> So far, I already observed a 30% reduction of refault of total folios in
> some workloads, including Tpcc and YCSB, and several critical regressions
> compared to Active / Inactive are gone, PG_workingset and PG_referenced are
> gone, yet things like PSI are more accurate (see below), and still stay
> bitwise compatible with Active / Inactive LRU. If it went smoothly,
> we might be able to unify and have only one LRU.
>
> Following topic and ideas are the key points:
>
> 1. Flags usage: which is solvable, and the hard part is mostly about
>    implementation details: MGLRU uses (at least) 3 extra flags for the gen
>    number, and we are expecting it to use more gen flags to support more than 4
>    gen. These flags can be moved to the tail of the LRU pointer after carefully
>    modifying the kernel's convention on LRU operations. That would allow us to
>    use up to 6 bits for the gen number and support up to 63 gens. The lower bit
>    of both pointers can be packed together for CAS on gen numbers. Reducing
>    flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
>    the LRU pointer tail, which could also be a way.
>
>    struct folio {
>        /* ... */
>        union {
>                struct list_head lru;
>    +           struct lru_gen_list_head lru_gen;
>
>    So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
>    `lru`, which contains encoded info. We might be able to move all LRU-related
>    flags there.
>
>    Ordinary folio lists are still just fine, since when the folio is isolated,
>    `lru` is still there. But places like folio split, will need to
> check if that's
>    a lruvec folio, or folio on an ordinary list.
>
>    This part is just an idea yet. But might make us able to have up to 63 gens
>    in upstream and enable build for every config.
>
> 2. Regressions: Currently regression is a more major problem for us.
>    From our perspective, almost all regressions are caused by an under- or
>    overprotected file cache. MGLRU's PID protection either gets too aggressive
>    or too passive or just have a too long latency. To fix that, I'd propose a
>    LFU-like design and relax the PID's aggressiveness to make it much more
>    proactive and effective for file folios. The idea is always use 3 bits in
>    the page flags to count the referenced time (which would also replace
>    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
>    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
>    might work:
>
>    ========== MGLFU Tiering ==========
>    Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
>    Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
>    0       0|0|0    0        0       | - Readahead & Cache
>    1       0|0|1    1        0       | - LRU_REFS_REFERENCED
>    ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
>    2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
>    3       0|1|1    3        1       | - Frequently used
>    4       1|0|0*   4        2       |
>    5       1|0|1*   5        2       |
>    6       1|1|0*   6        3       |
>    7       1|1|1*   7        3       | - LRU_REFS_MAX
>    ---------- PROMOTION ----------> --+ - <promote to next gen>
>
>    Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
>    than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
>    access, and remove the force protection of folios on eviction. This provides
>    a more proactive protection.
>
>    And this might also give other frameworks like DAMON a nicer interface to
>    interact with MGLRU, since the referenced count can promote every folio and
>    count accesses in a more reasonable and unified way for MGLRU now.
>
>    NOTE: Still changing this design according to test results, e.g. maybe
>    we should optionally still use 4 bits, so the final solution might not
>    be the same.
>
>    Another potential improvement on the regression issue is implementing the
>    refault distance as I once proposed [1], which can have a huge gain for some
>    workloads with heavy file folio usage. Maybe we can have both.
>
> 3. Metrics: The key here is about the meaning of page flags, including
>    PG_workingset and PG_referenced. These two flags are set/cleared very
>    differently for MGLRU compared to Active / Inactive LRU, but many other
>    components are still using them as metrics for Active / Inactive LRU. Hence,
>    I would propose to introduce a different mechanism to unify and replace these
>    two flags: Using the 3 bits in the page flags field reserved for LFU-like
>    tracking above, to determine the folio status.
>
>    Then following the above LFU-like idea, and using helpers like:
>
>    static inline bool folio_is_referenced(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
>    }
>
>    static inline bool folio_is_workingset(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
>    }
>
>    static inline bool folio_is_referenced_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
>    }
>
>    static inline void folio_mark_workingset_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
> BIT(LRU_REFS_PGOFF + 1));
>    }
>
>    To tell if a folio belongs to a working set or is referenced. The definition
>    of workingset will be simplified as follows: a set referenced more than twice
>    for MGLRU, and decoupled from MGLRU's tiering.
>
> 4. MGLRU's swappiness is kind of useless in some situations compared to
>    Active / Inactive LRU, since its force protects the youngest two gen, so
>    quite often we can only reclaim one type of folios. To workaround that, the
>    user usually runs force aging before reclaim. So, can we just remove the
>    force protection of the youngest two gens?
>
> 5. Async aging and aging optimization are also required to make the above ideas
>    work better.
>
> 6. Other issues and discussion on whether the above improvements will help
>    solve them or make them worse. e.g.
>
>    For eBPF extension, using eBPF to determine which gen a folio should be
>    landed given the shadow and after we have more than 4 gens, might be very
>    helpful and enough for many workload customizations.
>
>    Can we just ignore the shadow for anon folios? MGLRU basically activates
>    anon folios unconditionally, especially if we combined with the LFU like
>    idea above we might only want to track the 3 bit count, and get rid of
>    the extra bit usage in the shadow. The eviction performance might be even
>    better, and other components like swap table [3] will have more bits to use
>    for better performance and more features.
>
> The goal is:
>
> - Reduce MGLRU's page flag usage to be identical or less compared to Active /
>   Inactive LRU.
> - Eliminate regressions.
> - Unify or improve the metrics.
> - Provides more extensibility.

Hi Kairui,

I would be very interested in joining this discussion at LSF/MM.

We use MGLRU on Android. While the reduced CPU usage leads to power
improvements for mobile devices, we've run into a few notable issues
as well.

Off the top of my head:

1. Direct Reclaim Latency: We've observed that direct reclaim tail
latencies can sometimes be significantly higher with MGLRU.

2. PSI and OOM Response: Tying directly into your point about metrics,
the PSI memory pressure generated by MGLRU is consistently 30% to 40%
lower than the Active/Inactive LRU on Android workloads. Because
user-space OOM daemons like lmkd rely heavily on these metrics, this
causes them to be less quick to react to actual memory pressure.

3. Misleading Conventional LRU Metrics: We've noticed patterns in
standard memory tracking where nr_active and nr_inactive show sharp
vertical cliffs and rises. Since MGLRU derives these metrics by
mapping the two youngest generations to "active" and the two oldest to
"inactive," every time a new generation is created (incrementing the
seq counter), the second youngest generation (before the increment) is
suddenly recategorized as inactive (after the increment). Because the
newly created generation is empty, this manifests as a massive,
instantaneous drop in active pages and a corresponding spike in
inactive pages.

I'd love to participate and discuss how we might tackle these
regressions and metrics.

Thanks,
Kalesh

>
> Link: https://lwn.net/Articles/945266/ [1]
> Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
> [3]
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-26  1:55 ` Kalesh Singh
@ 2026-02-26  3:06   ` Kairui Song
  2026-02-26 10:10     ` wangzicheng
  0 siblings, 1 reply; 26+ messages in thread
From: Kairui Song @ 2026-02-26  3:06 UTC (permalink / raw)
  To: Kalesh Singh, wangzicheng
  Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm,
	android-mm, Suren Baghdasaryan, T.J. Mercier, Barry Song

On Thu, Feb 26, 2026 at 9:55 AM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Thu, Feb 19, 2026 at 9:26 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > Hi All,
> >
> > Apologies I forgot to add the proper tag in the previous email so
> > resending this.
> >
> > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > today. There are many reasons MGLRU is still not the only LRU implementation in
> > the kernel.
> Hi Kairui,
>
> I would be very interested in joining this discussion at LSF/MM.
>
> We use MGLRU on Android. While the reduced CPU usage leads to power
> improvements for mobile devices, we've run into a few notable issues
> as well.

Hi Kelash,

Glad to discuss this with you.

>
> Off the top of my head:
>
> 1. Direct Reclaim Latency: We've observed that direct reclaim tail
> latencies can sometimes be significantly higher with MGLRU.
>
> 2. PSI and OOM Response: Tying directly into your point about metrics,
> the PSI memory pressure generated by MGLRU is consistently 30% to 40%
> lower than the Active/Inactive LRU on Android workloads. Because
> user-space OOM daemons like lmkd rely heavily on these metrics, this
> causes them to be less quick to react to actual memory pressure.

Yes, this is one of the main issues for us too. Per our observation
one cause for that is MGLRU's usage of flags like PG_workingset is
different from active / inactive LRU, and flags like the PG_workingset
flags are bound with tiering now, so changing that requires some
redesign of how tiering works too. Which is one of the motivations
behind the LFU like tiering design I mentioned. That should make
things like PSI or readahead stable again.

> 3. Misleading Conventional LRU Metrics: We've noticed patterns in
> standard memory tracking where nr_active and nr_inactive show sharp
> vertical cliffs and rises. Since MGLRU derives these metrics by
> mapping the two youngest generations to "active" and the two oldest to
> "inactive," every time a new generation is created (incrementing the
> seq counter), the second youngest generation (before the increment) is
> suddenly recategorized as inactive (after the increment). Because the
> newly created generation is empty, this manifests as a massive,
> instantaneous drop in active pages and a corresponding spike in
> inactive pages.

That's also a major problem for things like K8s. The cliffs and rises
confuses the cloud scheduler. Our solution is also based on that new
tiering design, and counting the number of folios in different tiers
instead of gens will greatly improve the usability of nr_active /
nr_inactive. Whether this is a good design can be discussed.

>
> I'd love to participate and discuss how we might tackle these
> regressions and metrics.

Looking forward to that!

I also noticed Zicheng has another proposal, I've discussed with him
too previously about some ideas, hopefully we will make some progress
on this.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-26  3:06   ` Kairui Song
@ 2026-02-26 10:10     ` wangzicheng
  0 siblings, 0 replies; 26+ messages in thread
From: wangzicheng @ 2026-02-26 10:10 UTC (permalink / raw)
  To: Kairui Song, Kalesh Singh
  Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm,
	android-mm, Suren Baghdasaryan, T.J. Mercier, Barry Song,
	wangtao, gao xu, wangxin 00023513



> -----Original Message-----
> From: Kairui Song <ryncsn@gmail.com>
> Sent: Thursday, February 26, 2026 11:07 AM
> To: Kalesh Singh <kaleshsingh@google.com>; wangzicheng
> <wangzicheng@honor.com>
> Cc: lsf-pc@lists.linux-foundation.org; Axel Rasmussen
> <axelrasmussen@google.com>; Yuanchu Xie <yuanchu@google.com>; Wei
> Xu <weixugc@google.com>; linux-mm <linux-mm@kvack.org>; android-mm
> <android-mm@google.com>; Suren Baghdasaryan <surenb@google.com>;
> T.J. Mercier <tjmercier@google.com>; Barry Song <21cnbao@gmail.com>
> Subject: Re: [LSF/MM/BPF TOPIC] Improving MGLRU
> 
> On Thu, Feb 26, 2026 at 9:55 AM Kalesh Singh <kaleshsingh@google.com>
> wrote:
> >
> > On Thu, Feb 19, 2026 at 9:26 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > Hi All,
> > >
> > > Apologies I forgot to add the proper tag in the previous email so
> > > resending this.
> > >
> > > MGLRU has been introduced in the mainline for years, but we still have
> two LRUs
> > > today. There are many reasons MGLRU is still not the only LRU
> implementation in
> > > the kernel.
> > Hi Kairui,
> >
> > I would be very interested in joining this discussion at LSF/MM.
> >
> > We use MGLRU on Android. While the reduced CPU usage leads to power
> > improvements for mobile devices, we've run into a few notable issues
> > as well.
> 
> Hi Kelash,
> 
> Glad to discuss this with you.
> 
> >
> > Off the top of my head:
> >
> > 1. Direct Reclaim Latency: We've observed that direct reclaim tail
> > latencies can sometimes be significantly higher with MGLRU.
> >
> > 2. PSI and OOM Response: Tying directly into your point about metrics,
> > the PSI memory pressure generated by MGLRU is consistently 30% to 40%
> > lower than the Active/Inactive LRU on Android workloads. Because
> > user-space OOM daemons like lmkd rely heavily on these metrics, this
> > causes them to be less quick to react to actual memory pressure.
> 
> Yes, this is one of the main issues for us too. Per our observation
> one cause for that is MGLRU's usage of flags like PG_workingset is
> different from active / inactive LRU, and flags like the PG_workingset
> flags are bound with tiering now, so changing that requires some
> redesign of how tiering works too. Which is one of the motivations
> behind the LFU like tiering design I mentioned. That should make
> things like PSI or readahead stable again.
> 
> > 3. Misleading Conventional LRU Metrics: We've noticed patterns in
> > standard memory tracking where nr_active and nr_inactive show sharp
> > vertical cliffs and rises. Since MGLRU derives these metrics by
> > mapping the two youngest generations to "active" and the two oldest to
> > "inactive," every time a new generation is created (incrementing the
> > seq counter), the second youngest generation (before the increment) is
> > suddenly recategorized as inactive (after the increment). Because the
> > newly created generation is empty, this manifests as a massive,
> > instantaneous drop in active pages and a corresponding spike in
> > inactive pages.
> 
> That's also a major problem for things like K8s. The cliffs and rises
> confuses the cloud scheduler. Our solution is also based on that new
> tiering design, and counting the number of folios in different tiers
> instead of gens will greatly improve the usability of nr_active /
> nr_inactive. Whether this is a good design can be discussed.
> 
> >
> > I'd love to participate and discuss how we might tackle these
> > regressions and metrics.
> 
> Looking forward to that!
> 
> I also noticed Zicheng has another proposal, I've discussed with him
> too previously about some ideas, hopefully we will make some progress
> on this.

Hi Kairui, hi Kalesh,

Yes, we’re interested in this work.

We see file pages being under-protected in smartphone workload, and an LFU-like
approach sounds promising to better promote and protect hot file pages.
Kairui has shared the patches; we’ll backport them to our tree and report back
once we have results from our workloads.

Best,
Zicheng


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
  2026-02-20 18:24 ` Johannes Weiner
  2026-02-26  1:55 ` Kalesh Singh
@ 2026-02-26 15:54 ` Matthew Wilcox
  2026-02-27  4:31   ` [LSF/MM/BPF] " Barry Song
                     ` (2 more replies)
  2026-02-27  3:30 ` [LSF/MM/BPF] " Barry Song
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 26+ messages in thread
From: Matthew Wilcox @ 2026-02-26 15:54 UTC (permalink / raw)
  To: Kairui Song; +Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.

To my mind, the biggest problem with MGLRU is that Google dumped it on us
and ran away.  Commit 44958000bada claimed that it was now maintained and
added three people as maintainers.  In the six months since that commit,
none of those three people have any commits in mm/!  This is a shameful
state of affairs.

I say rip it out.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
                   ` (2 preceding siblings ...)
  2026-02-26 15:54 ` Matthew Wilcox
@ 2026-02-27  3:30 ` Barry Song
  2026-03-02 11:10   ` Kairui Song
  2026-02-27  7:11 ` [LSF/MM/BPF TOPIC] " David Rientjes
  2026-02-27 10:29 ` Vernon Yang
  5 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2026-02-27  3:30 UTC (permalink / raw)
  To: ryncsn; +Cc: axelrasmussen, linux-mm, lsf-pc, weixugc, yuanchu

> 4. MGLRU's swappiness is kind of useless in some situations compared to
>   Active / Inactive LRU, since its force protects the youngest two gen, so
>   quite often we can only reclaim one type of folios. To workaround that, the
>   user usually runs force aging before reclaim. So, can we just remove the
>   force protection of the youngest two gens?

I guess not—MGLRU needs at least two generations to function,
similar to active and inactive lists, meaning it requires two lists.

You Zhao mentioned this in commit ec1c86b25f4b:
"This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS."

But I do feel the issue is that anon and file folios currently share
the same generations. This may make anon and file be reclaimed more
fairly, but isn’t swappiness meant to allow some imbalance? Sharing
generations causes them to keep catching up with each other. We might
consider providing separate generations for them.

Thanks
Barry


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-02-26 15:54 ` Matthew Wilcox
@ 2026-02-27  4:31   ` Barry Song
  2026-03-02 17:46     ` Gregory Price
  2026-02-27 17:55   ` [LSF/MM/BPF TOPIC] " Shakeel Butt
  2026-03-03  1:30   ` Axel Rasmussen
  2 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2026-02-27  4:31 UTC (permalink / raw)
  To: willy; +Cc: axelrasmussen, linux-mm, lsf-pc, ryncsn, weixugc, yuanchu

>> MGLRU has been introduced in the mainline for years, but we still have two LRUs
>> today. There are many reasons MGLRU is still not the only LRU implementation in
>> the kernel.

> To my mind, the biggest problem with MGLRU is that Google dumped it on us
> and ran away.  Commit 44958000bada claimed that it was now maintained and
> added three people as maintainers.  In the six months since that commit,
> none of those three people have any commits in mm/!  This is a shameful
> state of affairs.
> 
> I say rip it out.

Hi Matthew,  
Can we keep it for now? Kairui, Zicheng, and I are working on it.

From what I’ve seen, it performs much better than the active/inactive  
approach after applying a few vendor hooks on Android, such as forced  
aging and avoiding direct activation of read-ahead folios during page  
faults, among others. To be honest, performance was worse than  
active/inactive without those hooks, which are still not in mainline.  

It just needs more work. MGLRU has many strong design aspects, including  
using more generations to differentiate cold from hot, the look-around  
mechanism to reduce scanning overhead by leveraging cache locality,  
and data structure designs that minimize lock holding.

Best regards
Barry

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
                   ` (3 preceding siblings ...)
  2026-02-27  3:30 ` [LSF/MM/BPF] " Barry Song
@ 2026-02-27  7:11 ` David Rientjes
  2026-02-27 10:29 ` Vernon Yang
  5 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2026-02-27  7:11 UTC (permalink / raw)
  To: Kairui Song, Michal Hocko, Shakeel Butt
  Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Fri, 20 Feb 2026, Kairui Song wrote:

> Hi All,
> 
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
> 
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
> 
> And I've been looking at a few major issues here:
> 
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.
> 

I think this would be a very useful topic to discuss and I really like how 
this was framed in the context of what needs to be addressed so that MGLRU 
can be on a path to becoming the default implementation and we can 
eliminate two separate implementations.  Yes, MGLRU can form the basis of 
several extensions that are possible, like working set reporting, but its 
existence in the kernel shouldn't be based on future shiny features alone; 
I think priority number one should be ensuring that these issues, as well 
as others, are properly addressed with the goal of having a single unified 
implementation in the kernel that does not regress for end users.

One topic we can add here is oom handling with MGLRU, so adding in Michal 
and Shakeel.  MGLRU has working set protection to avoid thrashing by 
configuring min_ttl_ms in sysfs.  That can end up being very useful, and 
would probably be even more useful if there was a per-memcg version of it, 
but it doesn't work well for NUMA.  That's because we get a new oom kill 
context that is triggered from kswapd threads when aging is done, not by 
direct allocators like we're used to:

4167)	/*
4168)	 * The main goal is to OOM kill if every generation from all memcgs is
4169)	 * younger than min_ttl. However, another possibility is all memcgs are
4170)	 * either too small or below min.
4171)	 */
4172)	if (!reclaimable && mutex_trylock(&oom_lock)) {
4173)		struct oom_control oc = {
4174)			.gfp_mask = sc->gfp_mask,
4175)		};
4176) 
4177)		out_of_memory(&oc);
4178) 
4179)		mutex_unlock(&oom_lock);
4180)	}

That obviously just calls into the oom killer without any context about 
*which* node we're trying to free memory on.  The worst case scenario is 
that we oom kill every process on a single node without ever freeing 
memory for kswapd's node.

So I doubt that anybody is using this to actively defend against thrashing 
today, at least on NUMA systems.

One way to address this would be to consider resident memory on the nodes 
included in oc->nodemask when making oom kill decisions and then 
initialize an empty nodemaks here, sets pgdat->node_id, and passes it in.  
But it should be part of a larger discussion about how we handle targeted 
oom killing on specific NUMA nodes that would be applicable for cpusets, 
mempolicies, etc.

For cpusets, for example, we only look at the eligibility of a thread to 
allocate on a node, not the amount of anticipated freeing from that node 
on oom kill.  We could trivially do the same thing here for MGLRU, but it 
would kinda suck to go around oom killing processes that only have a 
single page on your target node.  (But, hey, better than the status quo 
today here!)

So we should talk about node-targeted oom killing and how that would make 
sense so that we can wire it up here if min_ttl_ms is to be used for 
MGLRU, at least for NUMA systems.  It's a tricky problem in oom contexts 
to be able to get at the information, per thread, that you want to 
consider to determine eligiblity but perhaps even more of a tricky problem 
when you have that information about what heursitic you use to compare 
processes with lots of memory on the system vs lots of memory on the node.

Has this been considered before?  For kswapd induced oom killing like this 
to work, it would have to be solved.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
                   ` (4 preceding siblings ...)
  2026-02-27  7:11 ` [LSF/MM/BPF TOPIC] " David Rientjes
@ 2026-02-27 10:29 ` Vernon Yang
  2026-03-02 12:17   ` Kairui Song
  5 siblings, 1 reply; 26+ messages in thread
From: Vernon Yang @ 2026-02-27 10:29 UTC (permalink / raw)
  To: Kairui Song; +Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> Hi All,
>
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
>
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
>
> And I've been looking at a few major issues here:
>
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.
>
> And other issues too.
> And I think there isn't a simple solution, but it can definitely be solved. I
> would like to propose a session to discuss a few ideas on how to solve this, and
> perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
> session to discuss some ideas about improving MGLRU and making it the only LRU.
>
> Some parts are just ideas, so far I have a working series [2] following the
> LFU and metric unification idea below, solving 2) and 3) above, and
> providing some very basic infrastructures for 1). Would try to send that as
> RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.
>
> So far, I already observed a 30% reduction of refault of total folios in
> some workloads, including Tpcc and YCSB, and several critical regressions
> compared to Active / Inactive are gone, PG_workingset and PG_referenced are
> gone, yet things like PSI are more accurate (see below), and still stay
> bitwise compatible with Active / Inactive LRU. If it went smoothly,
> we might be able to unify and have only one LRU.
>
> Following topic and ideas are the key points:
>
> 1. Flags usage: which is solvable, and the hard part is mostly about
>    implementation details: MGLRU uses (at least) 3 extra flags for the gen
>    number, and we are expecting it to use more gen flags to support more than 4
>    gen. These flags can be moved to the tail of the LRU pointer after carefully
>    modifying the kernel's convention on LRU operations. That would allow us to
>    use up to 6 bits for the gen number and support up to 63 gens. The lower bit
>    of both pointers can be packed together for CAS on gen numbers. Reducing
>    flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
>    the LRU pointer tail, which could also be a way.
>
>    struct folio {
>        /* ... */
>        union {
>                struct list_head lru;
>    +           struct lru_gen_list_head lru_gen;
>
>    So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
>    `lru`, which contains encoded info. We might be able to move all LRU-related
>    flags there.
>
>    Ordinary folio lists are still just fine, since when the folio is isolated,
>    `lru` is still there. But places like folio split, will need to
> check if that's
>    a lruvec folio, or folio on an ordinary list.
>
>    This part is just an idea yet. But might make us able to have up to 63 gens
>    in upstream and enable build for every config.
>
> 2. Regressions: Currently regression is a more major problem for us.
>    From our perspective, almost all regressions are caused by an under- or
>    overprotected file cache. MGLRU's PID protection either gets too aggressive
>    or too passive or just have a too long latency. To fix that, I'd propose a
>    LFU-like design and relax the PID's aggressiveness to make it much more
>    proactive and effective for file folios. The idea is always use 3 bits in
>    the page flags to count the referenced time (which would also replace
>    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
>    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
>    might work:
>
>    ========== MGLFU Tiering ==========
>    Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
>    Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
>    0       0|0|0    0        0       | - Readahead & Cache
>    1       0|0|1    1        0       | - LRU_REFS_REFERENCED
>    ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
>    2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
>    3       0|1|1    3        1       | - Frequently used
>    4       1|0|0*   4        2       |
>    5       1|0|1*   5        2       |
>    6       1|1|0*   6        3       |
>    7       1|1|1*   7        3       | - LRU_REFS_MAX
>    ---------- PROMOTION ----------> --+ - <promote to next gen>
>
>    Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
>    than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
>    access, and remove the force protection of folios on eviction. This provides
>    a more proactive protection.
>
>    And this might also give other frameworks like DAMON a nicer interface to
>    interact with MGLRU, since the referenced count can promote every folio and
>    count accesses in a more reasonable and unified way for MGLRU now.
>
>    NOTE: Still changing this design according to test results, e.g. maybe
>    we should optionally still use 4 bits, so the final solution might not
>    be the same.
>
>    Another potential improvement on the regression issue is implementing the
>    refault distance as I once proposed [1], which can have a huge gain for some
>    workloads with heavy file folio usage. Maybe we can have both.
>
> 3. Metrics: The key here is about the meaning of page flags, including
>    PG_workingset and PG_referenced. These two flags are set/cleared very
>    differently for MGLRU compared to Active / Inactive LRU, but many other
>    components are still using them as metrics for Active / Inactive LRU. Hence,
>    I would propose to introduce a different mechanism to unify and replace these
>    two flags: Using the 3 bits in the page flags field reserved for LFU-like
>    tracking above, to determine the folio status.
>
>    Then following the above LFU-like idea, and using helpers like:
>
>    static inline bool folio_is_referenced(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
>    }
>
>    static inline bool folio_is_workingset(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
>    }
>
>    static inline bool folio_is_referenced_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
>    }
>
>    static inline void folio_mark_workingset_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
> BIT(LRU_REFS_PGOFF + 1));
>    }
>
>    To tell if a folio belongs to a working set or is referenced. The definition
>    of workingset will be simplified as follows: a set referenced more than twice
>    for MGLRU, and decoupled from MGLRU's tiering.
>
> 4. MGLRU's swappiness is kind of useless in some situations compared to
>    Active / Inactive LRU, since its force protects the youngest two gen, so
>    quite often we can only reclaim one type of folios. To workaround that, the
>    user usually runs force aging before reclaim. So, can we just remove the
>    force protection of the youngest two gens?

Hi Kairui,

I would be very interested in discussing this topic as well.

In Linux desktop distributions, when the system rapidly enters low
memory state, it is almost impossible to enter S4, the success rate
only is 10%. When analyzing this issue, it was identified as the
inability to reclaim memory. Further investigation revealed that:

1. This phenomenon does not occur with Active/Inactive LRU, it only
   exists with MGLRU.
2. If force aging is performed before entering S4, the success rate
   exceeds 90%.

Detailed memory information is as follows.

MemFree:          269944 kB
Active:          4095536 kB
Inactive:        2831960 kB
Active(anon):    2667952 kB
Inactive(anon):   247208 kB
Active(file):    1427584 kB
Inactive(file):  2584752 kB

Since its force protects the youngest two gen, when wanting to reclaim
memory larger than the "Inactive" size, the MGLRU hard to reclaim enough
memory. e.g. hibernation mode call shrink_all_memory(3G).

We addressed this issue by implementing a retry mechanism similar to
memory.reclaim, the success rate of s4 has increased from 10% to 100%.

If we could directly remove the force protection of the youngest two
generations, this issue would also be resolved, and the solution would
be more universally applicable.

--
Cheers,
Vernon

> 5. Async aging and aging optimization are also required to make the above ideas
>    work better.
>
> 6. Other issues and discussion on whether the above improvements will help
>    solve them or make them worse. e.g.
>
>    For eBPF extension, using eBPF to determine which gen a folio should be
>    landed given the shadow and after we have more than 4 gens, might be very
>    helpful and enough for many workload customizations.
>
>    Can we just ignore the shadow for anon folios? MGLRU basically activates
>    anon folios unconditionally, especially if we combined with the LFU like
>    idea above we might only want to track the 3 bit count, and get rid of
>    the extra bit usage in the shadow. The eviction performance might be even
>    better, and other components like swap table [3] will have more bits to use
>    for better performance and more features.
>
> The goal is:
>
> - Reduce MGLRU's page flag usage to be identical or less compared to Active /
>   Inactive LRU.
> - Eliminate regressions.
> - Unify or improve the metrics.
> - Provides more extensibility.
>
> Link: https://lwn.net/Articles/945266/ [1]
> Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
> [3]
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-26 15:54 ` Matthew Wilcox
  2026-02-27  4:31   ` [LSF/MM/BPF] " Barry Song
@ 2026-02-27 17:55   ` Shakeel Butt
  2026-02-27 18:50     ` Gregory Price
  2026-03-03  1:31     ` Axel Rasmussen
  2026-03-03  1:30   ` Axel Rasmussen
  2 siblings, 2 replies; 26+ messages in thread
From: Shakeel Butt @ 2026-02-27 17:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kairui Song, lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Thu, Feb 26, 2026 at 03:54:22PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > today. There are many reasons MGLRU is still not the only LRU implementation in
> > the kernel.
> 
> To my mind, the biggest problem with MGLRU is that Google dumped it on us
> and ran away.  Commit 44958000bada claimed that it was now maintained and
> added three people as maintainers.  In the six months since that commit,
> none of those three people have any commits in mm/!  This is a shameful
> state of affairs.
> 
> I say rip it out.

I have very similar concerns. Though rather than ripping it out, I would like
we put efforts in unifying the two reclaim mechanism (traditional & MGLRU) over
improving MGLRU.

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-27 17:55   ` [LSF/MM/BPF TOPIC] " Shakeel Butt
@ 2026-02-27 18:50     ` Gregory Price
  2026-03-03  1:31     ` Axel Rasmussen
  1 sibling, 0 replies; 26+ messages in thread
From: Gregory Price @ 2026-02-27 18:50 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Matthew Wilcox, Kairui Song, lsf-pc, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, linux-mm

On Fri, Feb 27, 2026 at 09:55:52AM -0800, Shakeel Butt wrote:
> On Thu, Feb 26, 2026 at 03:54:22PM +0000, Matthew Wilcox wrote:
> > On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> > > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > > today. There are many reasons MGLRU is still not the only LRU implementation in
> > > the kernel.
> > 
> > To my mind, the biggest problem with MGLRU is that Google dumped it on us
> > and ran away.  Commit 44958000bada claimed that it was now maintained and
> > added three people as maintainers.  In the six months since that commit,
> > none of those three people have any commits in mm/!  This is a shameful
> > state of affairs.
> > 
> > I say rip it out.
> 
> I have very similar concerns. Though rather than ripping it out, I would like
> we put efforts in unifying the two reclaim mechanism (traditional & MGLRU) over
> improving MGLRU.
> 

I would agree.

If we could make the baseline MGLRU 2-generation and with behavioral
parity with the current LRU, then adding the additional generations
is just a mechanical change - and doesn't hurt anyone (default= 2 gen). 

But my understanding is MGLRU has behavior differences regarding its
preferences on how it ages anon vs file.

That mistake will cause significant pain in unifying them.

~Gregory


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-02-27  3:30 ` [LSF/MM/BPF] " Barry Song
@ 2026-03-02 11:10   ` Kairui Song
  2026-03-03  4:06     ` Barry Song
  0 siblings, 1 reply; 26+ messages in thread
From: Kairui Song @ 2026-03-02 11:10 UTC (permalink / raw)
  To: Barry Song, David Rientjes
  Cc: axelrasmussen, linux-mm, lsf-pc, weixugc, yuanchu

On Fri, Feb 27, 2026 at 11:30 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > 4. MGLRU's swappiness is kind of useless in some situations compared to
> >   Active / Inactive LRU, since its force protects the youngest two gen, so
> >   quite often we can only reclaim one type of folios. To workaround that, the
> >   user usually runs force aging before reclaim. So, can we just remove the
> >   force protection of the youngest two gens?
>
> I guess not—MGLRU needs at least two generations to function,
> similar to active and inactive lists, meaning it requires two lists.

Hi Barry,

You are right. But I think that doesn't mean we can't never reclaim
the folios in the oldest gen? Or maybe, just let the kernel itself
perform aging when one type of folios is not reclaimable.

We have an internal workaround for forces aging, and waits for sync
aging if one type of folios are not reclaimable (without the wait, we
still hit the MIN_NR_GEN protect again since aging is not finished).
And without the MIN_NR_GEN protection we might end up over reclaiming
without aging.

The problem with that is that the OOM killer became very slow to
trigger since aging is costly, so the system will hang for minutes
before OOM is triggered when it should get triggered immediately.

And for the OOM part I saw David Rientjes also mentioned the TTL
config in MGLRU, I do think TTL is a good idea, we just need to figure
out a good way to make better use of that.

I think a feasible solution might be (just idea): implement async
aging; decouple aging and reclaim, reclaim just keep shrinking
whatever is oldest; and optionally improve thrashing and OOM with TTL.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-27 10:29 ` Vernon Yang
@ 2026-03-02 12:17   ` Kairui Song
  0 siblings, 0 replies; 26+ messages in thread
From: Kairui Song @ 2026-03-02 12:17 UTC (permalink / raw)
  To: Vernon Yang; +Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Fri, Feb 27, 2026 at 6:29 PM Vernon Yang <vernon2gm@gmail.com> wrote:
> Hi Kairui,
>
> I would be very interested in discussing this topic as well.
>
> In Linux desktop distributions, when the system rapidly enters low
> memory state, it is almost impossible to enter S4, the success rate
> only is 10%. When analyzing this issue, it was identified as the
> inability to reclaim memory. Further investigation revealed that:

Hi Vernon,

Thanks for the comment!

> If we could directly remove the force protection of the youngest two
> generations, this issue would also be resolved, and the solution would
> be more universally applicable.

Yeah, that's also what I have in mind. Such issues should be fixable
if we find a way to remove or optimize gen protection.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-02-27  4:31   ` [LSF/MM/BPF] " Barry Song
@ 2026-03-02 17:46     ` Gregory Price
  2026-03-05  6:27       ` Barry Song
  0 siblings, 1 reply; 26+ messages in thread
From: Gregory Price @ 2026-03-02 17:46 UTC (permalink / raw)
  To: Barry Song
  Cc: willy, axelrasmussen, linux-mm, lsf-pc, ryncsn, weixugc, yuanchu

On Fri, Feb 27, 2026 at 12:31:39PM +0800, Barry Song wrote:
> >> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> >> today. There are many reasons MGLRU is still not the only LRU implementation in
> >> the kernel.
> 
> > To my mind, the biggest problem with MGLRU is that Google dumped it on us
> > and ran away.  Commit 44958000bada claimed that it was now maintained and
> > added three people as maintainers.  In the six months since that commit,
> > none of those three people have any commits in mm/!  This is a shameful
> > state of affairs.
> > 
> > I say rip it out.
> 
> Hi Matthew,  
> Can we keep it for now? Kairui, Zicheng, and I are working on it.
> 
> From what I’ve seen, it performs much better than the active/inactive  
> approach after applying a few vendor hooks on Android, such as forced  
> aging and avoiding direct activation of read-ahead folios during page  
> faults, among others. To be honest, performance was worse than  
> active/inactive without those hooks, which are still not in mainline.  
> 
> It just needs more work. MGLRU has many strong design aspects, including  
> using more generations to differentiate cold from hot, the look-around 
> mechanism to reduce scanning overhead by leveraging cache locality,  
> and data structure designs that minimize lock holding.

In presentations where the distribution of generations is shown for
different workloads, I've seen many bi-modal distributions for MGLRU
(where oldest and youngest contain the bulk of the folios).

It makes the value of multiple generations questionable - especially at
the level MGLRU emulates it right now (multiple generations PLUS multiple
tiers within those generations).

One of the issues with MGLRU is it's actually quite difficult to
determine which feature it introduces (there are 7 or 8 major features)
is responsible for producing any given effect on a workload.

In a random test over the weekend where I turned everything but
multiple generations off (no page table scan, no bloom filter, etc -
MGLRU just defaults to a multi-gen FIFO) I found that streaming
workloads did better this way.  

Makes sense, given that MGLRU is trying to protect working set,
but I didn't expect it to be that dramatic.

It seems at best problematic to argue "We just need more heuristics!",
but clearly MGLRU "works, for some definition of the word works".

~Gregory

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-26 15:54 ` Matthew Wilcox
  2026-02-27  4:31   ` [LSF/MM/BPF] " Barry Song
  2026-02-27 17:55   ` [LSF/MM/BPF TOPIC] " Shakeel Butt
@ 2026-03-03  1:30   ` Axel Rasmussen
  2 siblings, 0 replies; 26+ messages in thread
From: Axel Rasmussen @ 2026-03-03  1:30 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Kairui Song, lsf-pc, Yuanchu Xie, Wei Xu, linux-mm

On Thu, Feb 26, 2026 at 7:54 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > today. There are many reasons MGLRU is still not the only LRU implementation in
> > the kernel.
>
> To my mind, the biggest problem with MGLRU is that Google dumped it on us
> and ran away.  Commit 44958000bada claimed that it was now maintained and
> added three people as maintainers.  In the six months since that commit,
> none of those three people have any commits in mm/!  This is a shameful
> state of affairs.
>
> I say rip it out.

I acknowledge this is a big problem. We have let the community down
here, and we plan to correct this starting in April, e.g. by working
together with Kairui and others to address outstanding issues.

As part of qualifying MGLRU in our own production environment we've
been debugging and developing expertise on a variety of kernel
versions up to 6.18. Priority has been given to this qualification,
and did not offer enough attention to upstream engagement. We plan to
post any fixes / improvements discovered during this process to the
mailing list.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-27 17:55   ` [LSF/MM/BPF TOPIC] " Shakeel Butt
  2026-02-27 18:50     ` Gregory Price
@ 2026-03-03  1:31     ` Axel Rasmussen
  2026-03-03 13:39       ` Shakeel Butt
  1 sibling, 1 reply; 26+ messages in thread
From: Axel Rasmussen @ 2026-03-03  1:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Matthew Wilcox, Kairui Song, lsf-pc, Yuanchu Xie, Wei Xu, linux-mm

On Fri, Feb 27, 2026 at 9:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Feb 26, 2026 at 03:54:22PM +0000, Matthew Wilcox wrote:
> > On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> > > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > > today. There are many reasons MGLRU is still not the only LRU implementation in
> > > the kernel.
> >
> > To my mind, the biggest problem with MGLRU is that Google dumped it on us
> > and ran away.  Commit 44958000bada claimed that it was now maintained and
> > added three people as maintainers.  In the six months since that commit,
> > none of those three people have any commits in mm/!  This is a shameful
> > state of affairs.
> >
> > I say rip it out.
>
> I have very similar concerns. Though rather than ripping it out, I would like
> we put efforts in unifying the two reclaim mechanism (traditional & MGLRU) over
> improving MGLRU.

Shakeel, I think this is a great idea. If you have any ideas around
low hanging fruit here, please share. I'm planning to invest much more
time here going forward, so I'd be happy to turn some ideas into
patches. :)

>
> >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-03-02 11:10   ` Kairui Song
@ 2026-03-03  4:06     ` Barry Song
  2026-03-05 17:13       ` David Stevens
  0 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2026-03-03  4:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: David Rientjes, axelrasmussen, linux-mm, lsf-pc, weixugc, yuanchu

On Mon, Mar 2, 2026 at 7:10 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Feb 27, 2026 at 11:30 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > 4. MGLRU's swappiness is kind of useless in some situations compared to
> > >   Active / Inactive LRU, since its force protects the youngest two gen, so
> > >   quite often we can only reclaim one type of folios. To workaround that, the
> > >   user usually runs force aging before reclaim. So, can we just remove the
> > >   force protection of the youngest two gens?
> >
> > I guess not—MGLRU needs at least two generations to function,
> > similar to active and inactive lists, meaning it requires two lists.
>
> Hi Barry,
>
> You are right. But I think that doesn't mean we can't never reclaim
> the folios in the oldest gen? Or maybe, just let the kernel itself

I think we could reclaim the oldest generation even when
only two generations remain. However, that would make
MGLRU more conceptually confusing. We currently map the
youngest two generations to “active” and the oldest two
to “inactive.”

If there are only two generations, they effectively both
fall into the “active” category, so reclaiming one of them
would mean reclaiming from “active,” which feels rather
counterintuitive to me.

So I’d prefer a two-step approach:
1. Age pages to form inactive generations.
2. Reclaim the “inactive” generations

rather than reclaiming active generations directly.

> perform aging when one type of folios is not reclaimable.

I would prefer to avoid having only two generations.
Ideally, new generations should be created before reaching
that point—similar to the active→inactive transition,
but driven by aging.

>
> We have an internal workaround for forces aging, and waits for sync
> aging if one type of folios are not reclaimable (without the wait, we
> still hit the MIN_NR_GEN protect again since aging is not finished).
> And without the MIN_NR_GEN protection we might end up over reclaiming
> without aging.

I see your point—I did exactly the same thing in Android.
However, there’s a significant problem. If anon has two
generations and files have four, they end up sharing
generations. To age anon, we would also need to move file
folios between generations; otherwise, the hottest and
oldest generations would overlap, causing cold/hot
inversion. Furthermore, in inc_min_seq(), moving folios
means the oldest generation gets pushed into the second-
oldest generation:

new_gen = folio_inc_gen(lruvec, folio, false);
list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);

This is far from ideal, as it still mixes cold and hot pages
to some extent. Could we keep anon and file generations
separate instead? I feel this is a strong requirement and
likely the first step toward making swappiness work properly.

>
> The problem with that is that the OOM killer became very slow to
> trigger since aging is costly, so the system will hang for minutes
> before OOM is triggered when it should get triggered immediately.

There’s a shrink_active_list() in active/inactive to
prevent inactivation starvation. We likely need something
similar.

A key difference between MGLRU and active/inactive is that
active/inactive performs demotion—moving pages from
active to inactive, with the ability to specify anon or file
types—whereas MGLRU performs promotion, scanning PTEs
to identify young folios for new generations without
distinguishing between anon and file. This could slow down
MGLRU aging exactly when faster memory reclamation is
needed?

Of course, we could treat mm_state as null and skip
walk_mm() for scanning PTEs, but this would make aging
purely a matter of moving folios, without any basis in
whether the PTEs are actually young?

>
> And for the OOM part I saw David Rientjes also mentioned the TTL
> config in MGLRU, I do think TTL is a good idea, we just need to figure
> out a good way to make better use of that.
>
> I think a feasible solution might be (just idea): implement async
> aging; decouple aging and reclaim, reclaim just keep shrinking
> whatever is oldest; and optionally improve thrashing and OOM with TTL.

I’m not sure we want to add a separate thread for async
aging, since kswapd is already quite complex. Could async
aging be handled mainly by kswapd instead? For direct
reclamation cases, if aging is urgent, we might just skip
walk_mm(), or alternatively call inc_max_seq() directly. On
Android, we once completely disabled walk_mm() and only
observed positive effects, which also reduced mmap_lock
contention. So I’m thinking we could consider disabling
walk_mm() by default on hardware that lacks non-leaf
(e.g., PMD) access bits.

I agree that we can leverage TTL to improve OOM handling
and reduce thrashing.

Thanks
Barry

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-03-03  1:31     ` Axel Rasmussen
@ 2026-03-03 13:39       ` Shakeel Butt
  2026-03-05  6:46         ` Chen Ridong
  0 siblings, 1 reply; 26+ messages in thread
From: Shakeel Butt @ 2026-03-03 13:39 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Matthew Wilcox, Kairui Song, lsf-pc, Yuanchu Xie, Wei Xu,
	linux-mm, chenridong

On Mon, Mar 02, 2026 at 05:31:26PM -0800, Axel Rasmussen wrote:
> On Fri, Feb 27, 2026 at 9:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Feb 26, 2026 at 03:54:22PM +0000, Matthew Wilcox wrote:
> > > On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> > > > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > > > today. There are many reasons MGLRU is still not the only LRU implementation in
> > > > the kernel.
> > >
> > > To my mind, the biggest problem with MGLRU is that Google dumped it on us
> > > and ran away.  Commit 44958000bada claimed that it was now maintained and
> > > added three people as maintainers.  In the six months since that commit,
> > > none of those three people have any commits in mm/!  This is a shameful
> > > state of affairs.
> > >
> > > I say rip it out.
> >
> > I have very similar concerns. Though rather than ripping it out, I would like
> > we put efforts in unifying the two reclaim mechanism (traditional & MGLRU) over
> > improving MGLRU.
> 
> Shakeel, I think this is a great idea. If you have any ideas around
> low hanging fruit here, please share. I'm planning to invest much more
> time here going forward, so I'd be happy to turn some ideas into
> patches. :)

I think we can start with memcg LRU on which Chen (CCed) was working on. Also
why not propose a lsfmm discussion on the topic, I am sure many folks will be
interested.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-03-02 17:46     ` Gregory Price
@ 2026-03-05  6:27       ` Barry Song
  2026-03-05  7:31         ` Gregory Price
  0 siblings, 1 reply; 26+ messages in thread
From: Barry Song @ 2026-03-05  6:27 UTC (permalink / raw)
  To: Gregory Price
  Cc: willy, axelrasmussen, linux-mm, lsf-pc, ryncsn, weixugc, yuanchu

On Tue, Mar 3, 2026 at 1:46 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Fri, Feb 27, 2026 at 12:31:39PM +0800, Barry Song wrote:
> > >> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > >> today. There are many reasons MGLRU is still not the only LRU implementation in
> > >> the kernel.
> >
> > > To my mind, the biggest problem with MGLRU is that Google dumped it on us
> > > and ran away.  Commit 44958000bada claimed that it was now maintained and
> > > added three people as maintainers.  In the six months since that commit,
> > > none of those three people have any commits in mm/!  This is a shameful
> > > state of affairs.
> > >
> > > I say rip it out.
> >
> > Hi Matthew,
> > Can we keep it for now? Kairui, Zicheng, and I are working on it.
> >
> > From what I’ve seen, it performs much better than the active/inactive
> > approach after applying a few vendor hooks on Android, such as forced
> > aging and avoiding direct activation of read-ahead folios during page
> > faults, among others. To be honest, performance was worse than
> > active/inactive without those hooks, which are still not in mainline.
> >
> > It just needs more work. MGLRU has many strong design aspects, including
> > using more generations to differentiate cold from hot, the look-around
> > mechanism to reduce scanning overhead by leveraging cache locality,
> > and data structure designs that minimize lock holding.
>
> In presentations where the distribution of generations is shown for
> different workloads, I've seen many bi-modal distributions for MGLRU
> (where oldest and youngest contain the bulk of the folios).
>
> It makes the value of multiple generations questionable - especially at
> the level MGLRU emulates it right now (multiple generations PLUS multiple
> tiers within those generations).
>
> One of the issues with MGLRU is it's actually quite difficult to
> determine which feature it introduces (there are 7 or 8 major features)
> is responsible for producing any given effect on a workload.

true. MGLRU has multiple features:

1. lru_gen_look_around — exploits spatial locality by scanning
adjacent PTEs of a young PTE.

This is also beneficial for active/inactive LRU, as it helps reduce
rmap cost. The Android kernel once had a hook to enable it for the
active/inactive LRU:

https://android.googlesource.com/kernel/common.git/+/76541556a9a3540

2. page table walks for aging — further exploit spatial locality.
The aging path prefers walking page tables to look for young PTEs
and promote hot pages.

I didn’t observe any improvement on ARM64 Android, but I did notice
increased mmap_lock contention. Disabling it actually reduced CPU
usage, rather than increasing it as the patch claimed. Perhaps this is
because ARM64 lacks a non-leaf young bit, making the scanning cost
quite high?

3. fallback to the other type when one type has only two generations.

isolate_folios():
                scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
                if (scanned) // scanned will be set to 0 if the type
has only two gens
                        return scanned;

                type = !type;

This seems to be a major issue with MGLRU, making swappiness largely
ineffective. People have been complaining about over-reclamation of
file pages even when they set a high swappiness to prefer reclaiming
anonymous pages.

4. very aggressively promote mapped folios.

Active/inactive LRU relies on scanning and detecting young PTEs to
promote mapped folios from inactive to active, whereas MGLRU
promotes mapped folios directly to the youngest generation.

Active/inactive LRU should be able to retain read-ahead and
map_around folios that haven’t actually been accessed in inactive, but
MGLRU promotes all of them indiscriminately.

This can sometimes be appropriate, but it often overshoots:

void folio_add_lru(struct folio *folio)
{
        VM_BUG_ON_FOLIO(folio_test_active(folio) &&
                        folio_test_unevictable(folio), folio);
        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

        /* see the comment in lru_gen_folio_seq() */
        if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
            lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
                folio_set_active(folio);

        folio_batch_add_and_move(folio, lru_add);
}

In particular, I observed that read-ahead folios triggered by faults were
being promoted, which significantly degrades MGLRU performance on
low-memory devices. I attempted to mitigate this by:

https://lore.kernel.org/linux-mm/20260225223712.3685-1-21cnbao@gmail.com/

5. min_ttl_ms - thrashing prevention

This might be a good option, but I’ve noticed that people often don’t
know how to use it or how to integrate it with the Android OOM
killer. As a result, I see users leaving it untouched. I’m not sure if
any Android users are actually using it—if there are, please let me
know.

6. gen, tier, bloom filter

This replaces active/inactive and compares file versus anon aging,
handling scan balance between anon and file.
I’m not sure they are definitely better, but they do seem much more
complex than active/inactive.

7. Missing shrink_active_list() — the function to demote folios from
active to inactive.

active/inactive perform rmap and scan PTEs to demote folios from active to
inactive before reclamation. MGLRU, however, seems to always
promote—finding young folios and moving them to the new
generation, while older folios automatically move to the old
generation.

This seems to reduce reclamation cost significantly, as folio_referenced()
would otherwise need to perform rmap and scan PTEs in each process
to clear access bits in shrink_active_list().

Points 1 and 7 might explain why we have observed MGLRU showing lower
CPU usage than active/inactive.

8. swappiness concept difference.

In active/inactive LRU, even with swappiness set to 0, anonymous pages
still have a chance to be reclaimed if file pages run out.

In MGLRU, setting swappiness=0 effectively disables anon reclamation,
which can lead to cold/hot inversion of anon pages:

inc_min_seq():

        /* For anon type, skip the check if swappiness is zero (file only) */
        if (!type && !swappiness)
                goto done;

        /* prevent cold/hot inversion if the type is evictable */
        for (zone = 0; zone < MAX_NR_ZONES; zone++) {
                struct list_head *head = &lrugen->folios[old_gen][type][zone];

I wonder if setting swappiness=201 could also cause file cold/hot inversion?
        /* For file type, skip the check if swappiness is anon only */
        if (type && (swappiness == SWAPPINESS_ANON_ONLY))
                goto done;

So, when people set swappiness=201 to force shrinking anonymous pages
only, it might put file folios at risk?

Together with point 3 - MGLRU’s swappiness has a much less clear effect
on reclaiming file versus anon pages compared to active/inactive,
highlighting a significant difference between MGLRU and
active/inactive behavior.

Considering all of the above, I feel MGLRU is quite different from
active/inactive. Trying to unify them seems like merging two
completely different approaches. Still, active/inactive might have
some useful lessons to learn from MGLRU, particularly on how to
reduce reclamation cost.

>
> In a random test over the weekend where I turned everything but
> multiple generations off (no page table scan, no bloom filter, etc -
> MGLRU just defaults to a multi-gen FIFO) I found that streaming
> workloads did better this way.

I understand your point, I'd say there will always be cases where LRU
is not the most suitable algorithm.

Perhaps an eBPF-programmable LRU could also be a direction
worth exploring. We could set different eBPF programs for
different workloads? There is a project in this area:

https://dl.acm.org/doi/pdf/10.1145/3731569.3764820
https://github.com/cache-ext/cache_ext

>
> Makes sense, given that MGLRU is trying to protect working set,
> but I didn't expect it to be that dramatic.
>
> It seems at best problematic to argue "We just need more heuristics!",
> but clearly MGLRU "works, for some definition of the word works".

The goal of the in-kernel LRU is probably suitable for most workloads,
but not “good” enough for all workloads :-)

Thanks
Barry

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-03-03 13:39       ` Shakeel Butt
@ 2026-03-05  6:46         ` Chen Ridong
  0 siblings, 0 replies; 26+ messages in thread
From: Chen Ridong @ 2026-03-05  6:46 UTC (permalink / raw)
  To: Shakeel Butt, Axel Rasmussen
  Cc: Matthew Wilcox, Kairui Song, lsf-pc, Yuanchu Xie, Wei Xu, linux-mm



On 2026/3/3 21:39, Shakeel Butt wrote:
> On Mon, Mar 02, 2026 at 05:31:26PM -0800, Axel Rasmussen wrote:
>> On Fri, Feb 27, 2026 at 9:56 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>
>>> On Thu, Feb 26, 2026 at 03:54:22PM +0000, Matthew Wilcox wrote:
>>>> On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
>>>>> MGLRU has been introduced in the mainline for years, but we still have two LRUs
>>>>> today. There are many reasons MGLRU is still not the only LRU implementation in
>>>>> the kernel.
>>>>
>>>> To my mind, the biggest problem with MGLRU is that Google dumped it on us
>>>> and ran away.  Commit 44958000bada claimed that it was now maintained and
>>>> added three people as maintainers.  In the six months since that commit,
>>>> none of those three people have any commits in mm/!  This is a shameful
>>>> state of affairs.
>>>>
>>>> I say rip it out.
>>>
>>> I have very similar concerns. Though rather than ripping it out, I would like
>>> we put efforts in unifying the two reclaim mechanism (traditional & MGLRU) over
>>> improving MGLRU.
>>
>> Shakeel, I think this is a great idea. If you have any ideas around
>> low hanging fruit here, please share. I'm planning to invest much more
>> time here going forward, so I'd be happy to turn some ideas into
>> patches. :)
> 
> I think we can start with memcg LRU on which Chen (CCed) was working on. Also
> why not propose a lsfmm discussion on the topic, I am sure many folks will be
> interested.

Think you for cc.

I am very interested about this topic.

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-03-05  6:27       ` Barry Song
@ 2026-03-05  7:31         ` Gregory Price
  0 siblings, 0 replies; 26+ messages in thread
From: Gregory Price @ 2026-03-05  7:31 UTC (permalink / raw)
  To: Barry Song
  Cc: willy, axelrasmussen, linux-mm, lsf-pc, ryncsn, weixugc, yuanchu

On Thu, Mar 05, 2026 at 02:27:27PM +0800, Barry Song wrote:

... Just trimming before, promise i read everything ...

> 1. lru_gen_look_around — exploits spatial locality by scanning
> adjacent PTEs of a young PTE.
> 
> 2. page table walks for aging — further exploit spatial locality.
> The aging path prefers walking page tables to look for young PTEs
> and promote hot pages.
> 
> 3. fallback to the other type when one type has only two generations.
> 
> 4. very aggressively promote mapped folios.
> 
> 5. min_ttl_ms - thrashing prevention
> 
> 6. gen, tier, bloom filter
> 
> 7. Missing shrink_active_list() — the function to demote folios from
> active to inactive.
> 
> 8. swappiness concept difference.
> 
> Considering all of the above, I feel MGLRU is quite different from
> active/inactive. Trying to unify them seems like merging two
> completely different approaches. Still, active/inactive might have
> some useful lessons to learn from MGLRU, particularly on how to
> reduce reclamation cost.
> 

I will preface this with: I'm not arguing to rip out MGLRU, but I do
want to take account of what I spent the last week digging through.
(and no, none of this is AI-written)

=======

You list here is more or less the same I came up with - and I poked
at bolting some of these onto the original LRU in the trivial sense.

I tried re-using the code on LRU with minimal modifications just to
see how it affected some really degenerate high-pressure scenarios.

I mostly found these features did nothing, too much, or straight up
caused LRU to fall over dead where it didn't before.

PTE scans
=======
PTE scans and look around are powerful, but really possibly TOO
powerful, that's why there's the bloom filter and the PID to prevent
MGLRU from over-correcting and saving more folios than it should.

As you point out, you also burn many more CPU cycles this way, just
not in the critical path.  So if you have a core to burn it can be
fine, if you don't, then scanning might hurt more than help.

I do think there's merit to this approach and could be adopted into
LRU as an option.  It does however *greatly* bias towards saving Anon
over Page Cache - and so that can be undesirable.

The PID controller
=======
The existence of the PID really suggests the whole mechanism is a bit
too over-engineered.  You put PIDs in things to dampen corrective
actions to keep towards a steady goal.

Requiring a PID doesn't inspire confidence that we can reason about
how tweaking a particular behavior of MGLRU will affect the rest of
the system.  In fact, it makes it difficult to know exactly what
effect you are having since there's built-in dynamicism.

e.g.:
a) LRU  : folio_mark_accessed() -> promote if already referenced
b) MGLRU: folio_mark_accessed() -> increment a counter

What behavior do we change if increment +2 instead of +1?
Hard to know.

thrashing protection, bloom, intra-generation tiers, etc
=======
Many of these features appear to solve problems MGLRU invents.

Simpler is *generally* (but not always) better for reliability.
The PID is another example, but I put that in its own class.

Aging direction
=======
The fundamental difference in aging direction makes LRU/MGLRU
infeasible to collapse.  At best you could pull SOME features into
LRU, but some features ONLY work because the aging differs so much.

example: Bolting generations onto LRU makes it unstable because you
can starve the oldest generation trivially during bursts.

So we've started by making LRU worse, and then setting off to solve
the problem we've created.

You can sort of see how MGLRU got developed naturally:

a) we want multiple generations
b) what do we do when the oldest generation is empty?
c) we can either cascade to the next generation and reclaim there, or
   we can get fancy and start to treat aging as a sliding window

The engineering decisions all become pretty straight forward from there,
but you've started by creating a problem to solve.

=======

In my gut, MGLRU is trying to bolt hotness monitoring onto a coldness
tracking mechanism.  It's ok if these problems require different systems
to solve efficiently/elegantly - they may in fact demand it.

But reiterating - I'm not of the snap opinion that it should be ripped
out, but I do think MGLRU's feature list raises more eyebrows that it
solves problems (for users, it certain solves some of its own problems).

> >
> > In a random test over the weekend where I turned everything but
> > multiple generations off (no page table scan, no bloom filter, etc -
> > MGLRU just defaults to a multi-gen FIFO) I found that streaming
> > workloads did better this way.
> 
...
> Perhaps an eBPF-programmable LRU could also be a direction
> worth exploring. We could set different eBPF programs for
> different workloads? There is a project in this area:
> 
> https://dl.acm.org/doi/pdf/10.1145/3731569.3764820
> https://github.com/cache-ext/cache_ext
> 

I'm certain the eBPF folks would love this :P.

Though there's always the question of where your hook points are, and I
would question whether this scales, but certainly it's a cool idea.

~Gregory

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-03-03  4:06     ` Barry Song
@ 2026-03-05 17:13       ` David Stevens
  0 siblings, 0 replies; 26+ messages in thread
From: David Stevens @ 2026-03-05 17:13 UTC (permalink / raw)
  To: Barry Song
  Cc: Kairui Song, David Rientjes, axelrasmussen, linux-mm, lsf-pc,
	weixugc, yuanchu

On Tue, Mar 03, 2026 at 12:06:20PM +0800, Barry Song wrote:
> On Mon, Mar 2, 2026 at 7:10 PM Kairui Song <ryncsn@gmail.com> wrote:
> > On Fri, Feb 27, 2026 at 11:30 AM Barry Song <21cnbao@gmail.com> wrote:
> > We have an internal workaround for forces aging, and waits for sync
> > aging if one type of folios are not reclaimable (without the wait, we
> > still hit the MIN_NR_GEN protect again since aging is not finished).
> > And without the MIN_NR_GEN protection we might end up over reclaiming
> > without aging.
> 
> I see your point—I did exactly the same thing in Android.
> However, there’s a significant problem. If anon has two
> generations and files have four, they end up sharing
> generations. To age anon, we would also need to move file
> folios between generations; otherwise, the hottest and
> oldest generations would overlap, causing cold/hot
> inversion. Furthermore, in inc_min_seq(), moving folios
> means the oldest generation gets pushed into the second-
> oldest generation:
> 
> new_gen = folio_inc_gen(lruvec, folio, false);
> list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> 
> This is far from ideal, as it still mixes cold and hot pages
> to some extent. Could we keep anon and file generations
> separate instead? I feel this is a strong requirement and
> likely the first step toward making swappiness work properly.

I did some experiments with splitting anon and file generations on
ChromeOS a bit over a year ago and got fairly positive results. Although
shifting personal and team priorities unfortunately prevented me from
finishing the project. It didn't get to the point where I was confident
enough in it to send it upstream, but I think it's still worthwhile to
mention here.

For a bit of background, due to a combination of low-spec hardware and a
complicated file system structure, many Chromebooks have poor file I/O
performance. We found that a fairly significant contributor to jank was
important threads being blocked on .text faults. We tried adjusting
swappiness but found that MGLRU was unresponsive to such tuning.
Splitting anon and file into seperate generations and then setting a
fairly high swappiness value resulted in meaningful jank reductions,
especially under very high memory pressure. However, there were
regressions to some workloads that I did not get a chance to try to
address. 

If anyone is interested in seeing the code, it is here [1] in ChromeOS's
6.1 kernel. I also have a WIP series posted to Android's 6.12 GKI kernel
[2], but it hasn't been merged there. This version fixes some issues in
my original series that were exposed by runing on a very different
system, in particular around memcgs. However, it hasn't been tested as
extensively.

[1] https://chromium.googlesource.com/chromiumos/third_party/kernel/+log/929932351492d01f0aee37a0ac3be8c7bd88f80d
[2] https://android.googlesource.com/kernel/common/+log/49cd20cfb0da2361be064f7fd70b36867065277a

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [LSF/MM/BPF] Improving MGLRU
  2026-02-19 17:09 [LSF/MM/BPF] " Kairui Song
@ 2026-02-24 17:19 ` Suren Baghdasaryan
  0 siblings, 0 replies; 26+ messages in thread
From: Suren Baghdasaryan @ 2026-02-24 17:19 UTC (permalink / raw)
  To: Kairui Song; +Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Thu, Feb 19, 2026 at 9:10 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi All,
>
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
>
> And I've been looking at a few major issues here:
>
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo, smap.
> 4. Some reclaim behavior is less controllable.
>
> And other issues too.
> And I think there isn't a simple solution, but it can definitely be solved. I
> would like to propose a session to discuss a few ideas on how to solve this, and
> perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
> session to discuss some ideas about improving MGLRU and making it the only LRU.
>
> Some parts are just ideas, so far I have a working series [2] following the
> LFU and metric unification idea below, solving 2) and 3) above, and
> providing some very basic infrastructures for 1). Would try to send that as
> RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.
>
> So far, I already observed a 30% reduction of refault of total folios in
> some workloads, including Tpcc and YCSB, and several critical regressions
> compared to Active / Inactive are gone, PG_workingset and PG_referenced are
> gone, yet things like PSI are more accurate (see below), and still stay
> bitwise compatible with Active / Inactive LRU. If it went smoothly,
> we might be able to unify and have only one LRU.
>
> Following topic and ideas are the key points:
>
> 1. Flags usage: which is solvable, and the hard part is mostly about
>    implementation details: MGLRU uses (at least) 3 extra flags for the gen
>    number, and we are expecting it to use more gen flags to support more than 4
>    gen. These flags can be moved to the tail of the LRU pointer after carefully
>    modifying the kernel's convention on LRU operations. That would allow us to
>    use up to 6 bits for the gen number and support up to 63 gens. The lower bit
>    of both pointers can be packed together for CAS on gen numbers. Reducing
>    flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
>    the LRU pointer tail, which could also be a way.
>
>    struct folio {
>        /* ... */
>        union {
>                struct list_head lru;
>    +           struct lru_gen_list_head lru_gen;
>
>    So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
>    `lru`, which contains encoded info. We might be able to move all LRU-related
>    flags there.
>
>    Ordinary folio lists are still just fine, since when the folio is isolated,
>    `lru` is still there. But places like folio split, will need to
> check if that's
>    a lruvec folio, or folio on an ordinary list.
>
>    This part is just an idea yet. But might make us able to have up to 63 gens
>    in upstream and enable build for every config.
>
> 2. Regressions: Currently regression is a more major problem for us.
>    From our perspective, almost all regressions are caused by an under- or
>    overprotected file cache. MGLRU's PID protection either gets too aggressive
>    or too passive or just have a too long latency. To fix that, I'd propose a
>    LFU-like design and relax the PID's aggressiveness to make it much more
>    proactive and effective for file folios. The idea is always use 3 bits in
>    the page flags to count the referenced time (which would also replace
>    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
>    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
>    might work:
>
>    ========== MGLFU Tiering ==========
>    Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
>    Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
>    0       0|0|0    0        0       | - Readahead & Cache
>    1       0|0|1    1        0       | - LRU_REFS_REFERENCED
>    ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
>    2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
>    3       0|1|1    3        1       | - Frequently used
>    4       1|0|0*   4        2       |
>    5       1|0|1*   5        2       |
>    6       1|1|0*   6        3       |
>    7       1|1|1*   7        3       | - LRU_REFS_MAX
>    ---------- PROMOTION ----------> --+ - <promote to next gen>
>
>    Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
>    than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
>    access, and remove the force protection of folios on eviction. This provides
>    a more proactive protection.
>
>    And this might also give other frameworks like DAMON a nicer interface to
>    interact with MGLRU, since the referenced count can promote every folio and
>    count accesses in a more reasonable and unified way for MGLRU now.
>
>    NOTE: Still changing this design according to test results, e.g. maybe
>    we should optionally still use 4 bits, so the final solution might not
>    be the same.
>
>    Another potential improvement on the regression issue is implementing the
>    refault distance as I once proposed [1], which can have a huge gain for some
>    workloads with heavy file folio usage. Maybe we can have both.
>
> 3. Metrics: The key here is about the meaning of page flags, including
>    PG_workingset and PG_referenced. These two flags are set/cleared very
>    differently for MGLRU compared to Active / Inactive LRU, but many other
>    components are still using them as metrics for Active / Inactive LRU. Hence,
>    I would propose to introduce a different mechanism to unify and replace these
>    two flags: Using the 3 bits in the page flags field reserved for LFU-like
>    tracking above, to determine the folio status.
>
>    Then following the above LFU-like idea, and using helpers like:
>
>    static inline bool folio_is_referenced(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
>    }
>
>    static inline bool folio_is_workingset(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
>    }
>
>    static inline bool folio_is_referenced_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
>    }
>
>    static inline void folio_mark_workingset_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
> BIT(LRU_REFS_PGOFF + 1));
>    }
>
>    To tell if a folio belongs to a working set or is referenced. The definition
>    of workingset will be simplified as follows: a set referenced more than twice
>    for MGLRU, and decoupled from MGLRU's tiering.
>
> 4. MGLRU's swappiness is kind of useless in some situations compared to
>    Active / Inactive LRU, since its force protects the youngest two gen, so
>    quite often we can only reclaim one type of folios. To workaround that, the
>    user usually runs force aging before reclaim. So, can we just remove the
>    force protection of the youngest two gens?
>
> 5. Async aging and aging optimization are also required to make the above ideas
>    work better.
>
> 6. Other issues and discussion on whether the above improvements will help
>    solve them or make them worse. e.g.
>
>    For eBPF extension, using eBPF to determine which gen a folio should be
>    landed given the shadow and after we have more than 4 gens, might be very
>    helpful and enough for many workload customizations.
>
>    Can we just ignore the shadow for anon folios? MGLRU basically activates
>    anon folios unconditionally, especially if we combined with the LFU like
>    idea above we might only want to track the 3 bit count, and get rid of
>    the extra bit usage in the shadow. The eviction performance might be even
>    better, and other components like swap table [3] will have more bits to use
>    for better performance and more features.
>
> The goal is:
>
> - Reduce MGLRU's page flag usage to be identical or less compared to Active /
>   Inactive LRU.
> - Eliminate regressions.
> - Unify or improve the metrics.
> - Provides more extensibility.

There might be some overlap with this topic proposal:
https://lore.kernel.org/all/cb0c0a0bfc7247cf85858eecf0db6eca@honor.com/
but either way I'm interested in participating, especially on the
topics of regressions and reclaim behavior as it's very relevant for
Android.

>
> Link: https://lwn.net/Articles/945266/ [1]
> Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
> [3]
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [LSF/MM/BPF] Improving MGLRU
@ 2026-02-19 17:09 Kairui Song
  2026-02-24 17:19 ` Suren Baghdasaryan
  0 siblings, 1 reply; 26+ messages in thread
From: Kairui Song @ 2026-02-19 17:09 UTC (permalink / raw)
  To: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

Hi All,

MGLRU has been introduced in the mainline for years, but we still have two LRUs
today. There are many reasons MGLRU is still not the only LRU implementation in
the kernel.

And I've been looking at a few major issues here:

1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
LRU.
2. Regressions: MGLRU might cause regression, even though in many workloads it
outperforms Active/Inactive by a lot.
3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
/proc/meminfo, smap.
4. Some reclaim behavior is less controllable.

And other issues too.
And I think there isn't a simple solution, but it can definitely be solved. I
would like to propose a session to discuss a few ideas on how to solve this, and
perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
session to discuss some ideas about improving MGLRU and making it the only LRU.

Some parts are just ideas, so far I have a working series [2] following the
LFU and metric unification idea below, solving 2) and 3) above, and
providing some very basic infrastructures for 1). Would try to send that as
RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.

So far, I already observed a 30% reduction of refault of total folios in
some workloads, including Tpcc and YCSB, and several critical regressions
compared to Active / Inactive are gone, PG_workingset and PG_referenced are
gone, yet things like PSI are more accurate (see below), and still stay
bitwise compatible with Active / Inactive LRU. If it went smoothly,
we might be able to unify and have only one LRU.

Following topic and ideas are the key points:

1. Flags usage: which is solvable, and the hard part is mostly about
   implementation details: MGLRU uses (at least) 3 extra flags for the gen
   number, and we are expecting it to use more gen flags to support more than 4
   gen. These flags can be moved to the tail of the LRU pointer after carefully
   modifying the kernel's convention on LRU operations. That would allow us to
   use up to 6 bits for the gen number and support up to 63 gens. The lower bit
   of both pointers can be packed together for CAS on gen numbers. Reducing
   flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
   the LRU pointer tail, which could also be a way.

   struct folio {
       /* ... */
       union {
               struct list_head lru;
   +           struct lru_gen_list_head lru_gen;

   So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
   `lru`, which contains encoded info. We might be able to move all LRU-related
   flags there.

   Ordinary folio lists are still just fine, since when the folio is isolated,
   `lru` is still there. But places like folio split, will need to
check if that's
   a lruvec folio, or folio on an ordinary list.

   This part is just an idea yet. But might make us able to have up to 63 gens
   in upstream and enable build for every config.

2. Regressions: Currently regression is a more major problem for us.
   From our perspective, almost all regressions are caused by an under- or
   overprotected file cache. MGLRU's PID protection either gets too aggressive
   or too passive or just have a too long latency. To fix that, I'd propose a
   LFU-like design and relax the PID's aggressiveness to make it much more
   proactive and effective for file folios. The idea is always use 3 bits in
   the page flags to count the referenced time (which would also replace
   PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
   refaults, and many regressions are gone. A flow chart of how the MGLRU idea
   might work:

   ========== MGLFU Tiering ==========
   Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
   Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
   0       0|0|0    0        0       | - Readahead & Cache
   1       0|0|1    1        0       | - LRU_REFS_REFERENCED
   ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
   2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
   3       0|1|1    3        1       | - Frequently used
   4       1|0|0*   4        2       |
   5       1|0|1*   5        2       |
   6       1|1|0*   6        3       |
   7       1|1|1*   7        3       | - LRU_REFS_MAX
   ---------- PROMOTION ----------> --+ - <promote to next gen>

   Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
   than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
   access, and remove the force protection of folios on eviction. This provides
   a more proactive protection.

   And this might also give other frameworks like DAMON a nicer interface to
   interact with MGLRU, since the referenced count can promote every folio and
   count accesses in a more reasonable and unified way for MGLRU now.

   NOTE: Still changing this design according to test results, e.g. maybe
   we should optionally still use 4 bits, so the final solution might not
   be the same.

   Another potential improvement on the regression issue is implementing the
   refault distance as I once proposed [1], which can have a huge gain for some
   workloads with heavy file folio usage. Maybe we can have both.

3. Metrics: The key here is about the meaning of page flags, including
   PG_workingset and PG_referenced. These two flags are set/cleared very
   differently for MGLRU compared to Active / Inactive LRU, but many other
   components are still using them as metrics for Active / Inactive LRU. Hence,
   I would propose to introduce a different mechanism to unify and replace these
   two flags: Using the 3 bits in the page flags field reserved for LFU-like
   tracking above, to determine the folio status.

   Then following the above LFU-like idea, and using helpers like:

   static inline bool folio_is_referenced(const struct folio *folio)
   {
    return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
   }

   static inline bool folio_is_workingset(const struct folio *folio)
   {
    return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
   }

   static inline bool folio_is_referenced_by_bit(struct folio *folio)
   {    /* For compatibility */
    return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
   }

   static inline void folio_mark_workingset_by_bit(struct folio *folio)
   {    /* For compatibility */
    set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
BIT(LRU_REFS_PGOFF + 1));
   }

   To tell if a folio belongs to a working set or is referenced. The definition
   of workingset will be simplified as follows: a set referenced more than twice
   for MGLRU, and decoupled from MGLRU's tiering.

4. MGLRU's swappiness is kind of useless in some situations compared to
   Active / Inactive LRU, since its force protects the youngest two gen, so
   quite often we can only reclaim one type of folios. To workaround that, the
   user usually runs force aging before reclaim. So, can we just remove the
   force protection of the youngest two gens?

5. Async aging and aging optimization are also required to make the above ideas
   work better.

6. Other issues and discussion on whether the above improvements will help
   solve them or make them worse. e.g.

   For eBPF extension, using eBPF to determine which gen a folio should be
   landed given the shadow and after we have more than 4 gens, might be very
   helpful and enough for many workload customizations.

   Can we just ignore the shadow for anon folios? MGLRU basically activates
   anon folios unconditionally, especially if we combined with the LFU like
   idea above we might only want to track the 3 bit count, and get rid of
   the extra bit usage in the shadow. The eviction performance might be even
   better, and other components like swap table [3] will have more bits to use
   for better performance and more features.

The goal is:

- Reduce MGLRU's page flag usage to be identical or less compared to Active /
  Inactive LRU.
- Eliminate regressions.
- Unify or improve the metrics.
- Provides more extensibility.

Link: https://lwn.net/Articles/945266/ [1]
Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
[3]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-03-05 17:13 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
2026-02-20 18:24 ` Johannes Weiner
2026-02-21  6:03   ` Kairui Song
2026-02-26  1:55 ` Kalesh Singh
2026-02-26  3:06   ` Kairui Song
2026-02-26 10:10     ` wangzicheng
2026-02-26 15:54 ` Matthew Wilcox
2026-02-27  4:31   ` [LSF/MM/BPF] " Barry Song
2026-03-02 17:46     ` Gregory Price
2026-03-05  6:27       ` Barry Song
2026-03-05  7:31         ` Gregory Price
2026-02-27 17:55   ` [LSF/MM/BPF TOPIC] " Shakeel Butt
2026-02-27 18:50     ` Gregory Price
2026-03-03  1:31     ` Axel Rasmussen
2026-03-03 13:39       ` Shakeel Butt
2026-03-05  6:46         ` Chen Ridong
2026-03-03  1:30   ` Axel Rasmussen
2026-02-27  3:30 ` [LSF/MM/BPF] " Barry Song
2026-03-02 11:10   ` Kairui Song
2026-03-03  4:06     ` Barry Song
2026-03-05 17:13       ` David Stevens
2026-02-27  7:11 ` [LSF/MM/BPF TOPIC] " David Rientjes
2026-02-27 10:29 ` Vernon Yang
2026-03-02 12:17   ` Kairui Song
  -- strict thread matches above, loose matches on Subject: below --
2026-02-19 17:09 [LSF/MM/BPF] " Kairui Song
2026-02-24 17:19 ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox