[LSF/MM/BPF TOPIC] Improving MGLRU

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Improving MGLRU
@ 2026-02-19 17:25 Kairui Song
  2026-02-20 18:24 ` Johannes Weiner
  2026-02-26  1:55 ` Kalesh Singh
  0 siblings, 2 replies; 6+ messages in thread
From: Kairui Song @ 2026-02-19 17:25 UTC (permalink / raw)
  To: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

Hi All,

Apologies I forgot to add the proper tag in the previous email so
resending this.

MGLRU has been introduced in the mainline for years, but we still have two LRUs
today. There are many reasons MGLRU is still not the only LRU implementation in
the kernel.

And I've been looking at a few major issues here:

1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
LRU.
2. Regressions: MGLRU might cause regression, even though in many workloads it
outperforms Active/Inactive by a lot.
3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
/proc/meminfo.
4. Some reclaim behavior is less controllable.

And other issues too.
And I think there isn't a simple solution, but it can definitely be solved. I
would like to propose a session to discuss a few ideas on how to solve this, and
perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
session to discuss some ideas about improving MGLRU and making it the only LRU.

Some parts are just ideas, so far I have a working series [2] following the
LFU and metric unification idea below, solving 2) and 3) above, and
providing some very basic infrastructures for 1). Would try to send that as
RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.

So far, I already observed a 30% reduction of refault of total folios in
some workloads, including Tpcc and YCSB, and several critical regressions
compared to Active / Inactive are gone, PG_workingset and PG_referenced are
gone, yet things like PSI are more accurate (see below), and still stay
bitwise compatible with Active / Inactive LRU. If it went smoothly,
we might be able to unify and have only one LRU.

Following topic and ideas are the key points:

1. Flags usage: which is solvable, and the hard part is mostly about
   implementation details: MGLRU uses (at least) 3 extra flags for the gen
   number, and we are expecting it to use more gen flags to support more than 4
   gen. These flags can be moved to the tail of the LRU pointer after carefully
   modifying the kernel's convention on LRU operations. That would allow us to
   use up to 6 bits for the gen number and support up to 63 gens. The lower bit
   of both pointers can be packed together for CAS on gen numbers. Reducing
   flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
   the LRU pointer tail, which could also be a way.

   struct folio {
       /* ... */
       union {
               struct list_head lru;
   +           struct lru_gen_list_head lru_gen;

   So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
   `lru`, which contains encoded info. We might be able to move all LRU-related
   flags there.

   Ordinary folio lists are still just fine, since when the folio is isolated,
   `lru` is still there. But places like folio split, will need to
check if that's
   a lruvec folio, or folio on an ordinary list.

   This part is just an idea yet. But might make us able to have up to 63 gens
   in upstream and enable build for every config.

2. Regressions: Currently regression is a more major problem for us.
   From our perspective, almost all regressions are caused by an under- or
   overprotected file cache. MGLRU's PID protection either gets too aggressive
   or too passive or just have a too long latency. To fix that, I'd propose a
   LFU-like design and relax the PID's aggressiveness to make it much more
   proactive and effective for file folios. The idea is always use 3 bits in
   the page flags to count the referenced time (which would also replace
   PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
   refaults, and many regressions are gone. A flow chart of how the MGLRU idea
   might work:

   ========== MGLFU Tiering ==========
   Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
   Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
   0       0|0|0    0        0       | - Readahead & Cache
   1       0|0|1    1        0       | - LRU_REFS_REFERENCED
   ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
   2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
   3       0|1|1    3        1       | - Frequently used
   4       1|0|0*   4        2       |
   5       1|0|1*   5        2       |
   6       1|1|0*   6        3       |
   7       1|1|1*   7        3       | - LRU_REFS_MAX
   ---------- PROMOTION ----------> --+ - <promote to next gen>

   Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
   than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
   access, and remove the force protection of folios on eviction. This provides
   a more proactive protection.

   And this might also give other frameworks like DAMON a nicer interface to
   interact with MGLRU, since the referenced count can promote every folio and
   count accesses in a more reasonable and unified way for MGLRU now.

   NOTE: Still changing this design according to test results, e.g. maybe
   we should optionally still use 4 bits, so the final solution might not
   be the same.

   Another potential improvement on the regression issue is implementing the
   refault distance as I once proposed [1], which can have a huge gain for some
   workloads with heavy file folio usage. Maybe we can have both.

3. Metrics: The key here is about the meaning of page flags, including
   PG_workingset and PG_referenced. These two flags are set/cleared very
   differently for MGLRU compared to Active / Inactive LRU, but many other
   components are still using them as metrics for Active / Inactive LRU. Hence,
   I would propose to introduce a different mechanism to unify and replace these
   two flags: Using the 3 bits in the page flags field reserved for LFU-like
   tracking above, to determine the folio status.

   Then following the above LFU-like idea, and using helpers like:

   static inline bool folio_is_referenced(const struct folio *folio)
   {
    return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
   }

   static inline bool folio_is_workingset(const struct folio *folio)
   {
    return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
   }

   static inline bool folio_is_referenced_by_bit(struct folio *folio)
   {    /* For compatibility */
    return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
   }

   static inline void folio_mark_workingset_by_bit(struct folio *folio)
   {    /* For compatibility */
    set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
BIT(LRU_REFS_PGOFF + 1));
   }

   To tell if a folio belongs to a working set or is referenced. The definition
   of workingset will be simplified as follows: a set referenced more than twice
   for MGLRU, and decoupled from MGLRU's tiering.

4. MGLRU's swappiness is kind of useless in some situations compared to
   Active / Inactive LRU, since its force protects the youngest two gen, so
   quite often we can only reclaim one type of folios. To workaround that, the
   user usually runs force aging before reclaim. So, can we just remove the
   force protection of the youngest two gens?

5. Async aging and aging optimization are also required to make the above ideas
   work better.

6. Other issues and discussion on whether the above improvements will help
   solve them or make them worse. e.g.

   For eBPF extension, using eBPF to determine which gen a folio should be
   landed given the shadow and after we have more than 4 gens, might be very
   helpful and enough for many workload customizations.

   Can we just ignore the shadow for anon folios? MGLRU basically activates
   anon folios unconditionally, especially if we combined with the LFU like
   idea above we might only want to track the 3 bit count, and get rid of
   the extra bit usage in the shadow. The eviction performance might be even
   better, and other components like swap table [3] will have more bits to use
   for better performance and more features.

The goal is:

- Reduce MGLRU's page flag usage to be identical or less compared to Active /
  Inactive LRU.
- Eliminate regressions.
- Unify or improve the metrics.
- Provides more extensibility.

Link: https://lwn.net/Articles/945266/ [1]
Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
[3]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
@ 2026-02-20 18:24 ` Johannes Weiner
  2026-02-21  6:03   ` Kairui Song
  2026-02-26  1:55 ` Kalesh Singh
  1 sibling, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2026-02-20 18:24 UTC (permalink / raw)
  To: Kairui Song; +Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> Hi All,
> 
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
> 
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
> 
> And I've been looking at a few major issues here:
> 
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.

I would be very interested in discussing this topic as well.

> 2. Regressions: Currently regression is a more major problem for us.
>    From our perspective, almost all regressions are caused by an under- or
>    overprotected file cache. MGLRU's PID protection either gets too aggressive
>    or too passive or just have a too long latency. To fix that, I'd propose a
>    LFU-like design and relax the PID's aggressiveness to make it much more
>    proactive and effective for file folios. The idea is always use 3 bits in
>    the page flags to count the referenced time (which would also replace
>    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
>    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
>    might work:

Are you referring to refaults on the page cache side, or swapins?

Last time we evaluated MGLRU on Meta workloads, we noticed that it
tends to do better with zswap, but worse with disk swap. It seemed to
just prefer reclaiming anon, period.

For the balancing between anon and file to work well in all
situations, it needs to have a notion of backend speed and factor in
the respective cost of misses on each side.

> 4. MGLRU's swappiness is kind of useless in some situations compared to
>    Active / Inactive LRU, since its force protects the youngest two gen, so
>    quite often we can only reclaim one type of folios. To workaround that, the
>    user usually runs force aging before reclaim. So, can we just remove the
>    force protection of the youngest two gens?

[...]

> 6. Other issues and discussion on whether the above improvements will help
>    solve them or make them worse. e.g.

[...]

>    Can we just ignore the shadow for anon folios? MGLRU basically activates
>    anon folios unconditionally, especially if we combined with the LFU like
>    idea above we might only want to track the 3 bit count, and get rid of
>    the extra bit usage in the shadow. The eviction performance might be even
>    better, and other components like swap table [3] will have more bits to use
>    for better performance and more features.

On the face of it, both of these sounds problematic to me. Why are
anon pages special cased?

The cost of reclaiming a page is:

    reuse frequency * cost of a miss

The *type* of the page is not all that meaningful for workload
performance. The wait time is qualitatively the same.

If you assume every refaulting anon is hot, it'll fall apart when the
anon set is huge and has little locality.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-20 18:24 ` Johannes Weiner
@ 2026-02-21  6:03   ` Kairui Song
  0 siblings, 0 replies; 6+ messages in thread
From: Kairui Song @ 2026-02-21  6:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: lsf-pc, Chen Ridong, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm

On Sat, Feb 21, 2026 at 2:24 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Feb 20, 2026 at 01:25:33AM +0800, Kairui Song wrote:
> > Hi All,
> >
> > Apologies I forgot to add the proper tag in the previous email so
> > resending this.
> >
> > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > today. There are many reasons MGLRU is still not the only LRU implementation in
> > the kernel.
> >
> > And I've been looking at a few major issues here:
> >
> > 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> > LRU.
> > 2. Regressions: MGLRU might cause regression, even though in many workloads it
> > outperforms Active/Inactive by a lot.
> > 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> > /proc/meminfo.
> > 4. Some reclaim behavior is less controllable.
>
> I would be very interested in discussing this topic as well.

Thanks, glad to hear that!

>
> > 2. Regressions: Currently regression is a more major problem for us.
> >    From our perspective, almost all regressions are caused by an under- or
> >    overprotected file cache. MGLRU's PID protection either gets too aggressive
> >    or too passive or just have a too long latency. To fix that, I'd propose a
> >    LFU-like design and relax the PID's aggressiveness to make it much more
> >    proactive and effective for file folios. The idea is always use 3 bits in
> >    the page flags to count the referenced time (which would also replace
> >    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
> >    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
> >    might work:
>
> Are you referring to refaults on the page cache side, or swapins?
>
> Last time we evaluated MGLRU on Meta workloads, we noticed that it
> tends to do better with zswap, but worse with disk swap. It seemed to
> just prefer reclaiming anon, period.
>
> For the balancing between anon and file to work well in all
> situations, it needs to have a notion of backend speed and factor in
> the respective cost of misses on each side.

A bit more than that. When there is no swap, MGLRU still performs
worse in some workloads like MongoDB. From what I've noticed that's
because the PID protection is a bit too passive, and there is a force
protection in sort_folio which sometimes seems too aggressive.
Active/Inactive LRU will just move a folio to head if it's accessed
twice while in RAM, but MGLRU won't do so, as result hotter file
folios are evicted equally as the colder one until the PID gets
triggered, or still gets protected even if it hasn't been used for a
while. And by the time PID finally gets triggered, the workload might
has changed. This is fixable using the approach I mentioned though,
and it seems to be better than the Active/Inactive in all our known
cases after that, whether that is a good fix worth discussion.

I also notice Ridong has a series to apply a "heat" based reclaim,
which also looks interesting.

> >    Can we just ignore the shadow for anon folios? MGLRU basically activates
> >    anon folios unconditionally, especially if we combined with the LFU like
> >    idea above we might only want to track the 3 bit count, and get rid of
> >    the extra bit usage in the shadow. The eviction performance might be even
> >    better, and other components like swap table [3] will have more bits to use
> >    for better performance and more features.
>
> On the face of it, both of these sounds problematic to me. Why are
> anon pages special cased?
>
> The cost of reclaiming a page is:
>
>     reuse frequency * cost of a miss
>
> The *type* of the page is not all that meaningful for workload
> performance. The wait time is qualitatively the same.
>
> If you assume every refaulting anon is hot, it'll fall apart when the
> anon set is huge and has little locality.

Sorry I didn't make it clear. For MGLRU currently it already ignored
the shadow distance for re-activation. And yeah, basically all anons
are activated on fault, which turns out to be quite nice? None MGLRU
users considered that as a problem and in fact the performance looks
good.

Of course we can restore the old behavior to test the folio
against some distance (gen distance or eviction distance), or push it
further to only keep the reference bit (not completely ignore the
shadow, just only keep the reference bits, if the LFU + PID still
works well without the distance), and gain more performance and bits
to use.

BTW I tried to restore the refault distance behavior for both anon and
file folios sometime ago:
https://lwn.net/Articles/945266/

For file folios it indeed looked better, anon folios seems unchanged.
But later tests showed that it doesn't apply to all cases, and I think
something better can be used as suggested in this topic.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
  2026-02-20 18:24 ` Johannes Weiner
@ 2026-02-26  1:55 ` Kalesh Singh
  2026-02-26  3:06   ` Kairui Song
  1 sibling, 1 reply; 6+ messages in thread
From: Kalesh Singh @ 2026-02-26  1:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm,
	android-mm, Suren Baghdasaryan, T.J. Mercier

On Thu, Feb 19, 2026 at 9:26 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi All,
>
> Apologies I forgot to add the proper tag in the previous email so
> resending this.
>
> MGLRU has been introduced in the mainline for years, but we still have two LRUs
> today. There are many reasons MGLRU is still not the only LRU implementation in
> the kernel.
>
> And I've been looking at a few major issues here:
>
> 1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
> LRU.
> 2. Regressions: MGLRU might cause regression, even though in many workloads it
> outperforms Active/Inactive by a lot.
> 3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
> /proc/meminfo.
> 4. Some reclaim behavior is less controllable.
>
> And other issues too.
> And I think there isn't a simple solution, but it can definitely be solved. I
> would like to propose a session to discuss a few ideas on how to solve this, and
> perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
> session to discuss some ideas about improving MGLRU and making it the only LRU.
>
> Some parts are just ideas, so far I have a working series [2] following the
> LFU and metric unification idea below, solving 2) and 3) above, and
> providing some very basic infrastructures for 1). Would try to send that as
> RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.
>
> So far, I already observed a 30% reduction of refault of total folios in
> some workloads, including Tpcc and YCSB, and several critical regressions
> compared to Active / Inactive are gone, PG_workingset and PG_referenced are
> gone, yet things like PSI are more accurate (see below), and still stay
> bitwise compatible with Active / Inactive LRU. If it went smoothly,
> we might be able to unify and have only one LRU.
>
> Following topic and ideas are the key points:
>
> 1. Flags usage: which is solvable, and the hard part is mostly about
>    implementation details: MGLRU uses (at least) 3 extra flags for the gen
>    number, and we are expecting it to use more gen flags to support more than 4
>    gen. These flags can be moved to the tail of the LRU pointer after carefully
>    modifying the kernel's convention on LRU operations. That would allow us to
>    use up to 6 bits for the gen number and support up to 63 gens. The lower bit
>    of both pointers can be packed together for CAS on gen numbers. Reducing
>    flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
>    the LRU pointer tail, which could also be a way.
>
>    struct folio {
>        /* ... */
>        union {
>                struct list_head lru;
>    +           struct lru_gen_list_head lru_gen;
>
>    So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
>    `lru`, which contains encoded info. We might be able to move all LRU-related
>    flags there.
>
>    Ordinary folio lists are still just fine, since when the folio is isolated,
>    `lru` is still there. But places like folio split, will need to
> check if that's
>    a lruvec folio, or folio on an ordinary list.
>
>    This part is just an idea yet. But might make us able to have up to 63 gens
>    in upstream and enable build for every config.
>
> 2. Regressions: Currently regression is a more major problem for us.
>    From our perspective, almost all regressions are caused by an under- or
>    overprotected file cache. MGLRU's PID protection either gets too aggressive
>    or too passive or just have a too long latency. To fix that, I'd propose a
>    LFU-like design and relax the PID's aggressiveness to make it much more
>    proactive and effective for file folios. The idea is always use 3 bits in
>    the page flags to count the referenced time (which would also replace
>    PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
>    refaults, and many regressions are gone. A flow chart of how the MGLRU idea
>    might work:
>
>    ========== MGLFU Tiering ==========
>    Access  3 bit    lru_gen  lru_gen |(R - PG_referenced | W - PG_workingset)
>    Count   L|W|R    refs     tier    |(L - LRU_GEN_REFS)
>    0       0|0|0    0        0       | - Readahead & Cache
>    1       0|0|1    1        0       | - LRU_REFS_REFERENCED
>    ----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
>    2       0|1|0    2        0       | - LRU_REFS_WORKINGSET
>    3       0|1|1    3        1       | - Frequently used
>    4       1|0|0*   4        2       |
>    5       1|0|1*   5        2       |
>    6       1|1|0*   6        3       |
>    7       1|1|1*   7        3       | - LRU_REFS_MAX
>    ---------- PROMOTION ----------> --+ - <promote to next gen>
>
>    Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
>    than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
>    access, and remove the force protection of folios on eviction. This provides
>    a more proactive protection.
>
>    And this might also give other frameworks like DAMON a nicer interface to
>    interact with MGLRU, since the referenced count can promote every folio and
>    count accesses in a more reasonable and unified way for MGLRU now.
>
>    NOTE: Still changing this design according to test results, e.g. maybe
>    we should optionally still use 4 bits, so the final solution might not
>    be the same.
>
>    Another potential improvement on the regression issue is implementing the
>    refault distance as I once proposed [1], which can have a huge gain for some
>    workloads with heavy file folio usage. Maybe we can have both.
>
> 3. Metrics: The key here is about the meaning of page flags, including
>    PG_workingset and PG_referenced. These two flags are set/cleared very
>    differently for MGLRU compared to Active / Inactive LRU, but many other
>    components are still using them as metrics for Active / Inactive LRU. Hence,
>    I would propose to introduce a different mechanism to unify and replace these
>    two flags: Using the 3 bits in the page flags field reserved for LFU-like
>    tracking above, to determine the folio status.
>
>    Then following the above LFU-like idea, and using helpers like:
>
>    static inline bool folio_is_referenced(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
>    }
>
>    static inline bool folio_is_workingset(const struct folio *folio)
>    {
>     return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
>    }
>
>    static inline bool folio_is_referenced_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
>    }
>
>    static inline void folio_mark_workingset_by_bit(struct folio *folio)
>    {    /* For compatibility */
>     set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
> BIT(LRU_REFS_PGOFF + 1));
>    }
>
>    To tell if a folio belongs to a working set or is referenced. The definition
>    of workingset will be simplified as follows: a set referenced more than twice
>    for MGLRU, and decoupled from MGLRU's tiering.
>
> 4. MGLRU's swappiness is kind of useless in some situations compared to
>    Active / Inactive LRU, since its force protects the youngest two gen, so
>    quite often we can only reclaim one type of folios. To workaround that, the
>    user usually runs force aging before reclaim. So, can we just remove the
>    force protection of the youngest two gens?
>
> 5. Async aging and aging optimization are also required to make the above ideas
>    work better.
>
> 6. Other issues and discussion on whether the above improvements will help
>    solve them or make them worse. e.g.
>
>    For eBPF extension, using eBPF to determine which gen a folio should be
>    landed given the shadow and after we have more than 4 gens, might be very
>    helpful and enough for many workload customizations.
>
>    Can we just ignore the shadow for anon folios? MGLRU basically activates
>    anon folios unconditionally, especially if we combined with the LFU like
>    idea above we might only want to track the 3 bit count, and get rid of
>    the extra bit usage in the shadow. The eviction performance might be even
>    better, and other components like swap table [3] will have more bits to use
>    for better performance and more features.
>
> The goal is:
>
> - Reduce MGLRU's page flag usage to be identical or less compared to Active /
>   Inactive LRU.
> - Eliminate regressions.
> - Unify or improve the metrics.
> - Provides more extensibility.

Hi Kairui,

I would be very interested in joining this discussion at LSF/MM.

We use MGLRU on Android. While the reduced CPU usage leads to power
improvements for mobile devices, we've run into a few notable issues
as well.

Off the top of my head:

1. Direct Reclaim Latency: We've observed that direct reclaim tail
latencies can sometimes be significantly higher with MGLRU.

2. PSI and OOM Response: Tying directly into your point about metrics,
the PSI memory pressure generated by MGLRU is consistently 30% to 40%
lower than the Active/Inactive LRU on Android workloads. Because
user-space OOM daemons like lmkd rely heavily on these metrics, this
causes them to be less quick to react to actual memory pressure.

3. Misleading Conventional LRU Metrics: We've noticed patterns in
standard memory tracking where nr_active and nr_inactive show sharp
vertical cliffs and rises. Since MGLRU derives these metrics by
mapping the two youngest generations to "active" and the two oldest to
"inactive," every time a new generation is created (incrementing the
seq counter), the second youngest generation (before the increment) is
suddenly recategorized as inactive (after the increment). Because the
newly created generation is empty, this manifests as a massive,
instantaneous drop in active pages and a corresponding spike in
inactive pages.

I'd love to participate and discuss how we might tackle these
regressions and metrics.

Thanks,
Kalesh

>
> Link: https://lwn.net/Articles/945266/ [1]
> Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
> [3]
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-26  1:55 ` Kalesh Singh
@ 2026-02-26  3:06   ` Kairui Song
  2026-02-26 10:10     ` wangzicheng
  0 siblings, 1 reply; 6+ messages in thread
From: Kairui Song @ 2026-02-26  3:06 UTC (permalink / raw)
  To: Kalesh Singh, wangzicheng
  Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm,
	android-mm, Suren Baghdasaryan, T.J. Mercier, Barry Song

On Thu, Feb 26, 2026 at 9:55 AM Kalesh Singh <kaleshsingh@google.com> wrote:
>
> On Thu, Feb 19, 2026 at 9:26 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > Hi All,
> >
> > Apologies I forgot to add the proper tag in the previous email so
> > resending this.
> >
> > MGLRU has been introduced in the mainline for years, but we still have two LRUs
> > today. There are many reasons MGLRU is still not the only LRU implementation in
> > the kernel.
> Hi Kairui,
>
> I would be very interested in joining this discussion at LSF/MM.
>
> We use MGLRU on Android. While the reduced CPU usage leads to power
> improvements for mobile devices, we've run into a few notable issues
> as well.

Hi Kelash,

Glad to discuss this with you.

>
> Off the top of my head:
>
> 1. Direct Reclaim Latency: We've observed that direct reclaim tail
> latencies can sometimes be significantly higher with MGLRU.
>
> 2. PSI and OOM Response: Tying directly into your point about metrics,
> the PSI memory pressure generated by MGLRU is consistently 30% to 40%
> lower than the Active/Inactive LRU on Android workloads. Because
> user-space OOM daemons like lmkd rely heavily on these metrics, this
> causes them to be less quick to react to actual memory pressure.

Yes, this is one of the main issues for us too. Per our observation
one cause for that is MGLRU's usage of flags like PG_workingset is
different from active / inactive LRU, and flags like the PG_workingset
flags are bound with tiering now, so changing that requires some
redesign of how tiering works too. Which is one of the motivations
behind the LFU like tiering design I mentioned. That should make
things like PSI or readahead stable again.

> 3. Misleading Conventional LRU Metrics: We've noticed patterns in
> standard memory tracking where nr_active and nr_inactive show sharp
> vertical cliffs and rises. Since MGLRU derives these metrics by
> mapping the two youngest generations to "active" and the two oldest to
> "inactive," every time a new generation is created (incrementing the
> seq counter), the second youngest generation (before the increment) is
> suddenly recategorized as inactive (after the increment). Because the
> newly created generation is empty, this manifests as a massive,
> instantaneous drop in active pages and a corresponding spike in
> inactive pages.

That's also a major problem for things like K8s. The cliffs and rises
confuses the cloud scheduler. Our solution is also based on that new
tiering design, and counting the number of folios in different tiers
instead of gens will greatly improve the usability of nr_active /
nr_inactive. Whether this is a good design can be discussed.

>
> I'd love to participate and discuss how we might tackle these
> regressions and metrics.

Looking forward to that!

I also noticed Zicheng has another proposal, I've discussed with him
too previously about some ideas, hopefully we will make some progress
on this.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [LSF/MM/BPF TOPIC] Improving MGLRU
  2026-02-26  3:06   ` Kairui Song
@ 2026-02-26 10:10     ` wangzicheng
  0 siblings, 0 replies; 6+ messages in thread
From: wangzicheng @ 2026-02-26 10:10 UTC (permalink / raw)
  To: Kairui Song, Kalesh Singh
  Cc: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm,
	android-mm, Suren Baghdasaryan, T.J. Mercier, Barry Song,
	wangtao, gao xu, wangxin 00023513



> -----Original Message-----
> From: Kairui Song <ryncsn@gmail.com>
> Sent: Thursday, February 26, 2026 11:07 AM
> To: Kalesh Singh <kaleshsingh@google.com>; wangzicheng
> <wangzicheng@honor.com>
> Cc: lsf-pc@lists.linux-foundation.org; Axel Rasmussen
> <axelrasmussen@google.com>; Yuanchu Xie <yuanchu@google.com>; Wei
> Xu <weixugc@google.com>; linux-mm <linux-mm@kvack.org>; android-mm
> <android-mm@google.com>; Suren Baghdasaryan <surenb@google.com>;
> T.J. Mercier <tjmercier@google.com>; Barry Song <21cnbao@gmail.com>
> Subject: Re: [LSF/MM/BPF TOPIC] Improving MGLRU
> 
> On Thu, Feb 26, 2026 at 9:55 AM Kalesh Singh <kaleshsingh@google.com>
> wrote:
> >
> > On Thu, Feb 19, 2026 at 9:26 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > Hi All,
> > >
> > > Apologies I forgot to add the proper tag in the previous email so
> > > resending this.
> > >
> > > MGLRU has been introduced in the mainline for years, but we still have
> two LRUs
> > > today. There are many reasons MGLRU is still not the only LRU
> implementation in
> > > the kernel.
> > Hi Kairui,
> >
> > I would be very interested in joining this discussion at LSF/MM.
> >
> > We use MGLRU on Android. While the reduced CPU usage leads to power
> > improvements for mobile devices, we've run into a few notable issues
> > as well.
> 
> Hi Kelash,
> 
> Glad to discuss this with you.
> 
> >
> > Off the top of my head:
> >
> > 1. Direct Reclaim Latency: We've observed that direct reclaim tail
> > latencies can sometimes be significantly higher with MGLRU.
> >
> > 2. PSI and OOM Response: Tying directly into your point about metrics,
> > the PSI memory pressure generated by MGLRU is consistently 30% to 40%
> > lower than the Active/Inactive LRU on Android workloads. Because
> > user-space OOM daemons like lmkd rely heavily on these metrics, this
> > causes them to be less quick to react to actual memory pressure.
> 
> Yes, this is one of the main issues for us too. Per our observation
> one cause for that is MGLRU's usage of flags like PG_workingset is
> different from active / inactive LRU, and flags like the PG_workingset
> flags are bound with tiering now, so changing that requires some
> redesign of how tiering works too. Which is one of the motivations
> behind the LFU like tiering design I mentioned. That should make
> things like PSI or readahead stable again.
> 
> > 3. Misleading Conventional LRU Metrics: We've noticed patterns in
> > standard memory tracking where nr_active and nr_inactive show sharp
> > vertical cliffs and rises. Since MGLRU derives these metrics by
> > mapping the two youngest generations to "active" and the two oldest to
> > "inactive," every time a new generation is created (incrementing the
> > seq counter), the second youngest generation (before the increment) is
> > suddenly recategorized as inactive (after the increment). Because the
> > newly created generation is empty, this manifests as a massive,
> > instantaneous drop in active pages and a corresponding spike in
> > inactive pages.
> 
> That's also a major problem for things like K8s. The cliffs and rises
> confuses the cloud scheduler. Our solution is also based on that new
> tiering design, and counting the number of folios in different tiers
> instead of gens will greatly improve the usability of nr_active /
> nr_inactive. Whether this is a good design can be discussed.
> 
> >
> > I'd love to participate and discuss how we might tackle these
> > regressions and metrics.
> 
> Looking forward to that!
> 
> I also noticed Zicheng has another proposal, I've discussed with him
> too previously about some ideas, hopefully we will make some progress
> on this.

Hi Kairui, hi Kalesh,

Yes, we’re interested in this work.

We see file pages being under-protected in smartphone workload, and an LFU-like
approach sounds promising to better promote and protect hot file pages.
Kairui has shared the patches; we’ll backport them to our tree and report back
once we have results from our workloads.

Best,
Zicheng


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-02-26 10:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-19 17:25 [LSF/MM/BPF TOPIC] Improving MGLRU Kairui Song
2026-02-20 18:24 ` Johannes Weiner
2026-02-21  6:03   ` Kairui Song
2026-02-26  1:55 ` Kalesh Singh
2026-02-26  3:06   ` Kairui Song
2026-02-26 10:10     ` wangzicheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox