* [LSF/MM/BPF] Improving MGLRU
@ 2026-02-19 17:09 Kairui Song
0 siblings, 0 replies; only message in thread
From: Kairui Song @ 2026-02-19 17:09 UTC (permalink / raw)
To: lsf-pc, Axel Rasmussen, Yuanchu Xie, Wei Xu, linux-mm
Hi All,
MGLRU has been introduced in the mainline for years, but we still have two LRUs
today. There are many reasons MGLRU is still not the only LRU implementation in
the kernel.
And I've been looking at a few major issues here:
1. Page flag usage: MGLRU uses many more flags (3+ more) than Active/Inactive
LRU.
2. Regressions: MGLRU might cause regression, even though in many workloads it
outperforms Active/Inactive by a lot.
3. Metrics: MGLRU makes some metrics work differently, for example: PSI,
/proc/meminfo, smap.
4. Some reclaim behavior is less controllable.
And other issues too.
And I think there isn't a simple solution, but it can definitely be solved. I
would like to propose a session to discuss a few ideas on how to solve this, and
perhaps we can finally only have one LRU in the kernel. So I'd like topropose a
session to discuss some ideas about improving MGLRU and making it the only LRU.
Some parts are just ideas, so far I have a working series [2] following the
LFU and metric unification idea below, solving 2) and 3) above, and
providing some very basic infrastructures for 1). Would try to send that as
RFC for easier review and merge once it's stable enough soon, before LSF/MM/BPF.
So far, I already observed a 30% reduction of refault of total folios in
some workloads, including Tpcc and YCSB, and several critical regressions
compared to Active / Inactive are gone, PG_workingset and PG_referenced are
gone, yet things like PSI are more accurate (see below), and still stay
bitwise compatible with Active / Inactive LRU. If it went smoothly,
we might be able to unify and have only one LRU.
Following topic and ideas are the key points:
1. Flags usage: which is solvable, and the hard part is mostly about
implementation details: MGLRU uses (at least) 3 extra flags for the gen
number, and we are expecting it to use more gen flags to support more than 4
gen. These flags can be moved to the tail of the LRU pointer after carefully
modifying the kernel's convention on LRU operations. That would allow us to
use up to 6 bits for the gen number and support up to 63 gens. The lower bit
of both pointers can be packed together for CAS on gen numbers. Reducing
flag usage by 3. Previously, Yu also suggested moving flags like PG_active to
the LRU pointer tail, which could also be a way.
struct folio {
/* ... */
union {
struct list_head lru;
+ struct lru_gen_list_head lru_gen;
So whenever the folio is on lruvec, `lru_gen_list_head` is used instead of
`lru`, which contains encoded info. We might be able to move all LRU-related
flags there.
Ordinary folio lists are still just fine, since when the folio is isolated,
`lru` is still there. But places like folio split, will need to
check if that's
a lruvec folio, or folio on an ordinary list.
This part is just an idea yet. But might make us able to have up to 63 gens
in upstream and enable build for every config.
2. Regressions: Currently regression is a more major problem for us.
From our perspective, almost all regressions are caused by an under- or
overprotected file cache. MGLRU's PID protection either gets too aggressive
or too passive or just have a too long latency. To fix that, I'd propose a
LFU-like design and relax the PID's aggressiveness to make it much more
proactive and effective for file folios. The idea is always use 3 bits in
the page flags to count the referenced time (which would also replace
PG_workingset and PG_referenced). Initial tests showed a 30% reduction of
refaults, and many regressions are gone. A flow chart of how the MGLRU idea
might work:
========== MGLFU Tiering ==========
Access 3 bit lru_gen lru_gen |(R - PG_referenced | W - PG_workingset)
Count L|W|R refs tier |(L - LRU_GEN_REFS)
0 0|0|0 0 0 | - Readahead & Cache
1 0|0|1 1 0 | - LRU_REFS_REFERENCED
----- WORKINGSET / PROMOTE --- <--+ - <move out of min_seq>
2 0|1|0 2 0 | - LRU_REFS_WORKINGSET
3 0|1|1 3 1 | - Frequently used
4 1|0|0* 4 2 |
5 1|0|1* 5 2 |
6 1|1|0* 6 3 |
7 1|1|1* 7 3 | - LRU_REFS_MAX
---------- PROMOTION ----------> --+ - <promote to next gen>
Once a folio has an access count > LRU_REFS_WORKINGSET, it never goes lower
than that. Folios that hit LRU_REFS_MAX will be promoted to next gen on
access, and remove the force protection of folios on eviction. This provides
a more proactive protection.
And this might also give other frameworks like DAMON a nicer interface to
interact with MGLRU, since the referenced count can promote every folio and
count accesses in a more reasonable and unified way for MGLRU now.
NOTE: Still changing this design according to test results, e.g. maybe
we should optionally still use 4 bits, so the final solution might not
be the same.
Another potential improvement on the regression issue is implementing the
refault distance as I once proposed [1], which can have a huge gain for some
workloads with heavy file folio usage. Maybe we can have both.
3. Metrics: The key here is about the meaning of page flags, including
PG_workingset and PG_referenced. These two flags are set/cleared very
differently for MGLRU compared to Active / Inactive LRU, but many other
components are still using them as metrics for Active / Inactive LRU. Hence,
I would propose to introduce a different mechanism to unify and replace these
two flags: Using the 3 bits in the page flags field reserved for LFU-like
tracking above, to determine the folio status.
Then following the above LFU-like idea, and using helpers like:
static inline bool folio_is_referenced(const struct folio *folio)
{
return folio_lru_refs(folio) >= LRU_REFS_REFERENCED;
}
static inline bool folio_is_workingset(const struct folio *folio)
{
return folio_lru_refs(folio) >= LRU_REFS_WORKINGSET;
}
static inline bool folio_is_referenced_by_bit(struct folio *folio)
{ /* For compatibility */
return !!(READ_ONCE(*folio_flags(folio, 0)) & BIT(LRU_REFS_PGOFF));
}
static inline void folio_mark_workingset_by_bit(struct folio *folio)
{ /* For compatibility */
set_mask_bits(folio_flags(folio, 0), BIT(LRU_REFS_PGOFF + 1),
BIT(LRU_REFS_PGOFF + 1));
}
To tell if a folio belongs to a working set or is referenced. The definition
of workingset will be simplified as follows: a set referenced more than twice
for MGLRU, and decoupled from MGLRU's tiering.
4. MGLRU's swappiness is kind of useless in some situations compared to
Active / Inactive LRU, since its force protects the youngest two gen, so
quite often we can only reclaim one type of folios. To workaround that, the
user usually runs force aging before reclaim. So, can we just remove the
force protection of the youngest two gens?
5. Async aging and aging optimization are also required to make the above ideas
work better.
6. Other issues and discussion on whether the above improvements will help
solve them or make them worse. e.g.
For eBPF extension, using eBPF to determine which gen a folio should be
landed given the shadow and after we have more than 4 gens, might be very
helpful and enough for many workload customizations.
Can we just ignore the shadow for anon folios? MGLRU basically activates
anon folios unconditionally, especially if we combined with the LFU like
idea above we might only want to track the 3 bit count, and get rid of
the extra bit usage in the shadow. The eviction performance might be even
better, and other components like swap table [3] will have more bits to use
for better performance and more features.
The goal is:
- Reduce MGLRU's page flag usage to be identical or less compared to Active /
Inactive LRU.
- Eliminate regressions.
- Unify or improve the metrics.
- Provides more extensibility.
Link: https://lwn.net/Articles/945266/ [1]
Link: https://github.com/ryncsn/linux/tree/improving-mglru [2]
Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-5-f4e34be021a7@tencent.com/
[3]
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-02-19 17:10 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-19 17:09 [LSF/MM/BPF] Improving MGLRU Kairui Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox