* [PATCH 0/2] mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK
@ 2024-10-19 1:29 James Houghton
2024-10-19 1:29 ` [PATCH 1/2] mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats James Houghton
2024-10-19 1:29 ` [PATCH 2/2] mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() James Houghton
0 siblings, 2 replies; 5+ messages in thread
From: James Houghton @ 2024-10-19 1:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Sean Christopherson, Paolo Bonzini, David Matlack,
David Rientjes, James Houghton, Oliver Upton, David Stevens,
Yu Zhao, Wei Xu, Axel Rasmussen, kvm, linux-kernel, linux-mm
Today, the MM_WALK capability causes MGLRU to clear the young bit from
PMDs and PTEs during the page table walk before eviction, but MGLRU does
not call the clear_young() MMU notifier in this case. By not calling
this notifier, the MM walk takes less time/CPU, but it causes pages that
are accessed mostly through KVM / secondary MMUs to appear younger than
they should be.
We do call the clear_young() notifier today, but only when attempting to
evict the page, so we end up clearing young/accessed information less
frequently for secondary MMUs than for mm PTEs, and therefore they
appear younger and are less likely to be evicted. Therefore, memory that
is *not* being accessed mostly by KVM will be evicted *more* frequently,
worsening performance.
ChromeOS observed a tab-open latency regression when enabling MGLRU with
a setup that involved running a VM:
Tab-open latency histogram (ms)
Version p50 mean p95 p99 max
base 1315 1198 2347 3454 10319
mglru 2559 1311 7399 12060 43758
fix 1119 926 2470 4211 6947
This series replaces the final non-selftest patchs from this series[1],
which introduced a similar change (and a new MMU notifier) with KVM
optimizations. I'll send a separate series (to Sean and Paolo) for the
KVM optimizations.
This series also makes proactive reclaim with MGLRU possible for KVM
memory. I have verified that this functions correctly with the selftest
from [1], but given that that test is a KVM selftest, I'll send it with
the rest of the KVM optimizations later. Andrew, let me know if you'd
like to take the test now anyway.
[1]: https://lore.kernel.org/linux-mm/20240926013506.860253-18-jthoughton@google.com/
Yu Zhao (2):
mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
include/linux/mmzone.h | 7 ++-
mm/rmap.c | 9 ++--
mm/vmscan.c | 105 +++++++++++++++++++++--------------------
3 files changed, 60 insertions(+), 61 deletions(-)
base-commit: b5d43fad926a3f542cd06f3c9d286f6f489f7129
--
2.47.0.105.g07ac214952-goog
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
2024-10-19 1:29 [PATCH 0/2] mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK James Houghton
@ 2024-10-19 1:29 ` James Houghton
2024-10-24 3:26 ` Andrew Morton
2024-10-19 1:29 ` [PATCH 2/2] mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() James Houghton
1 sibling, 1 reply; 5+ messages in thread
From: James Houghton @ 2024-10-19 1:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Sean Christopherson, Paolo Bonzini, David Matlack,
David Rientjes, James Houghton, Oliver Upton, David Stevens,
Yu Zhao, Wei Xu, Axel Rasmussen, kvm, linux-kernel, linux-mm
From: Yu Zhao <yuzhao@google.com>
The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very
helpful and become more complicated to properly compute when adding
test/clear_young() notifiers in MGLRU's mm walk.
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
include/linux/mmzone.h | 2 --
mm/vmscan.c | 14 +++++---------
2 files changed, 5 insertions(+), 11 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 96dea31fb211..691c635d8d1f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -460,9 +460,7 @@ struct lru_gen_folio {
enum {
MM_LEAF_TOTAL, /* total leaf entries */
- MM_LEAF_OLD, /* old leaf entries */
MM_LEAF_YOUNG, /* young leaf entries */
- MM_NONLEAF_TOTAL, /* total non-leaf entries */
MM_NONLEAF_FOUND, /* non-leaf entries found in Bloom filters */
MM_NONLEAF_ADDED, /* non-leaf entries added to Bloom filters */
NR_MM_STATS
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d0486189804..60669f8bba46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3405,7 +3405,6 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
continue;
if (!pte_young(ptent)) {
- walk->mm_stats[MM_LEAF_OLD]++;
continue;
}
@@ -3558,7 +3557,6 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
walk->mm_stats[MM_LEAF_TOTAL]++;
if (!pmd_young(val)) {
- walk->mm_stats[MM_LEAF_OLD]++;
continue;
}
@@ -3570,8 +3568,6 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
continue;
}
- walk->mm_stats[MM_NONLEAF_TOTAL]++;
-
if (!walk->force_scan && should_clear_pmd_young()) {
if (!pmd_young(val))
continue;
@@ -5262,11 +5258,11 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
for (tier = 0; tier < MAX_NR_TIERS; tier++) {
seq_printf(m, " %10d", tier);
for (type = 0; type < ANON_AND_FILE; type++) {
- const char *s = " ";
+ const char *s = "xxx";
unsigned long n[3] = {};
if (seq == max_seq) {
- s = "RT ";
+ s = "RTx";
n[0] = READ_ONCE(lrugen->avg_refaulted[type][tier]);
n[1] = READ_ONCE(lrugen->avg_total[type][tier]);
} else if (seq == min_seq[type] || NR_HIST_GENS > 1) {
@@ -5288,14 +5284,14 @@ static void lru_gen_seq_show_full(struct seq_file *m, struct lruvec *lruvec,
seq_puts(m, " ");
for (i = 0; i < NR_MM_STATS; i++) {
- const char *s = " ";
+ const char *s = "xxxx";
unsigned long n = 0;
if (seq == max_seq && NR_HIST_GENS == 1) {
- s = "LOYNFA";
+ s = "TYFA";
n = READ_ONCE(mm_state->stats[hist][i]);
} else if (seq != max_seq && NR_HIST_GENS > 1) {
- s = "loynfa";
+ s = "tyfa";
n = READ_ONCE(mm_state->stats[hist][i]);
}
--
2.47.0.105.g07ac214952-goog
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 2/2] mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
2024-10-19 1:29 [PATCH 0/2] mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK James Houghton
2024-10-19 1:29 ` [PATCH 1/2] mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats James Houghton
@ 2024-10-19 1:29 ` James Houghton
1 sibling, 0 replies; 5+ messages in thread
From: James Houghton @ 2024-10-19 1:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Sean Christopherson, Paolo Bonzini, David Matlack,
David Rientjes, James Houghton, Oliver Upton, David Stevens,
Yu Zhao, Wei Xu, Axel Rasmussen, kvm, linux-kernel, linux-mm
From: Yu Zhao <yuzhao@google.com>
When the MM_WALK capability is enabled, memory that is mostly accessed
by a VM appears younger than it really is, therefore this memory will be
less likely to be evicted. Therefore, the presence of a running VM can
significantly increase swap-outs for non-VM memory, regressing the
performance for the rest of the system.
Fix this regression by always calling {ptep,pmdp}_clear_young_notify()
whenever we clear the young bits on PMDs/PTEs.
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Reported-by: David Stevens <stevensd@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: James Houghton <jthoughton@google.com>
---
include/linux/mmzone.h | 5 ++-
mm/rmap.c | 9 ++---
mm/vmscan.c | 91 +++++++++++++++++++++++-------------------
3 files changed, 55 insertions(+), 50 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 691c635d8d1f..2e8c4307c728 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -557,7 +557,7 @@ struct lru_gen_memcg {
void lru_gen_init_pgdat(struct pglist_data *pgdat);
void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
void lru_gen_init_memcg(struct mem_cgroup *memcg);
void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -576,8 +576,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
{
}
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
+ return false;
}
static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
diff --git a/mm/rmap.c b/mm/rmap.c
index 2c561b1e52cc..4785a693857a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -896,13 +896,10 @@ static bool folio_referenced_one(struct folio *folio,
return false;
}
- if (pvmw.pte) {
- if (lru_gen_enabled() &&
- pte_young(ptep_get(pvmw.pte))) {
- lru_gen_look_around(&pvmw);
+ if (lru_gen_enabled() && pvmw.pte) {
+ if (lru_gen_look_around(&pvmw))
referenced++;
- }
-
+ } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 60669f8bba46..29c098790b01 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,7 @@
#include <linux/khugepaged.h>
#include <linux/rculist_nulls.h>
#include <linux/random.h>
+#include <linux/mmu_notifier.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -3293,7 +3294,8 @@ static bool get_next_vma(unsigned long mask, unsigned long size, struct mm_walk
return false;
}
-static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pte_pfn(pte);
@@ -3305,13 +3307,20 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
return -1;
+ if (!pte_young(pte) && !mm_has_notifiers(vma->vm_mm))
+ return -1;
+
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}
-static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr)
+static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned long addr,
+ struct pglist_data *pgdat)
{
unsigned long pfn = pmd_pfn(pmd);
@@ -3323,9 +3332,15 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
if (WARN_ON_ONCE(pmd_devmap(pmd)))
return -1;
+ if (!pmd_young(pmd) && !mm_has_notifiers(vma->vm_mm))
+ return -1;
+
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;
+ if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+ return -1;
+
return pfn;
}
@@ -3334,10 +3349,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
{
struct folio *folio;
- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
- return NULL;
-
folio = pfn_folio(pfn);
if (folio_nid(folio) != pgdat->node_id)
return NULL;
@@ -3400,20 +3411,16 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
total++;
walk->mm_stats[MM_LEAF_TOTAL]++;
- pfn = get_pte_pfn(ptent, args->vma, addr);
+ pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
if (pfn == -1)
continue;
- if (!pte_young(ptent)) {
- continue;
- }
-
folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
if (!folio)
continue;
- if (!ptep_test_and_clear_young(args->vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ if (!ptep_clear_young_notify(args->vma, addr, pte + i))
+ continue;
young++;
walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -3479,21 +3486,22 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
/* don't round down the first address */
addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first;
- pfn = get_pmd_pfn(pmd[i], vma, addr);
- if (pfn == -1)
- goto next;
-
- if (!pmd_trans_huge(pmd[i])) {
- if (!walk->force_scan && should_clear_pmd_young())
+ if (pmd_present(pmd[i]) && !pmd_trans_huge(pmd[i])) {
+ if (!walk->force_scan && should_clear_pmd_young() &&
+ !mm_has_notifiers(args->mm))
pmdp_test_and_clear_young(vma, addr, pmd + i);
goto next;
}
+ pfn = get_pmd_pfn(pmd[i], vma, addr, pgdat);
+ if (pfn == -1)
+ goto next;
+
folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
if (!folio)
goto next;
- if (!pmdp_test_and_clear_young(vma, addr, pmd + i))
+ if (!pmdp_clear_young_notify(vma, addr, pmd + i))
goto next;
walk->mm_stats[MM_LEAF_YOUNG]++;
@@ -3551,24 +3559,18 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
}
if (pmd_trans_huge(val)) {
- unsigned long pfn = pmd_pfn(val);
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
+ unsigned long pfn = get_pmd_pfn(val, vma, addr, pgdat);
walk->mm_stats[MM_LEAF_TOTAL]++;
- if (!pmd_young(val)) {
- continue;
- }
-
- /* try to avoid unnecessary memory loads */
- if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
- continue;
-
- walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
+ if (pfn != -1)
+ walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
continue;
}
- if (!walk->force_scan && should_clear_pmd_young()) {
+ if (!walk->force_scan && should_clear_pmd_young() &&
+ !mm_has_notifiers(args->mm)) {
if (!pmd_young(val))
continue;
@@ -4042,13 +4044,13 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
* the PTE table to the Bloom filter. This forms a feedback loop between the
* eviction and the aging.
*/
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
int i;
unsigned long start;
unsigned long end;
struct lru_gen_mm_walk *walk;
- int young = 0;
+ int young = 1;
pte_t *pte = pvmw->pte;
unsigned long addr = pvmw->address;
struct vm_area_struct *vma = pvmw->vma;
@@ -4064,12 +4066,15 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
+ if (!ptep_clear_young_notify(vma, addr, pte))
+ return false;
+
if (spin_is_contended(pvmw->ptl))
- return;
+ return true;
/* exclude special VMAs containing anon pages from COW */
if (vma->vm_flags & VM_SPECIAL)
- return;
+ return true;
/* avoid taking the LRU lock under the PTL when possible */
walk = current->reclaim_state ? current->reclaim_state->mm_walk : NULL;
@@ -4077,6 +4082,9 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
start = max(addr & PMD_MASK, vma->vm_start);
end = min(addr | ~PMD_MASK, vma->vm_end - 1) + 1;
+ if (end - start == PAGE_SIZE)
+ return true;
+
if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
if (addr - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
end = start + MIN_LRU_BATCH * PAGE_SIZE;
@@ -4090,7 +4098,7 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
/* folio_update_gen() requires stable folio_memcg() */
if (!mem_cgroup_trylock_pages(memcg))
- return;
+ return true;
arch_enter_lazy_mmu_mode();
@@ -4100,19 +4108,16 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
unsigned long pfn;
pte_t ptent = ptep_get(pte + i);
- pfn = get_pte_pfn(ptent, vma, addr);
+ pfn = get_pte_pfn(ptent, vma, addr, pgdat);
if (pfn == -1)
continue;
- if (!pte_young(ptent))
- continue;
-
folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
if (!folio)
continue;
- if (!ptep_test_and_clear_young(vma, addr, pte + i))
- VM_WARN_ON_ONCE(true);
+ if (!ptep_clear_young_notify(vma, addr, pte + i))
+ continue;
young++;
@@ -4144,6 +4149,8 @@ void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
/* feedback from rmap walkers to page table walkers */
if (mm_state && suitable_to_scan(i, young))
update_bloom_filter(mm_state, max_seq, pvmw->pmd);
+
+ return true;
}
/******************************************************************************
--
2.47.0.105.g07ac214952-goog
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
2024-10-19 1:29 ` [PATCH 1/2] mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats James Houghton
@ 2024-10-24 3:26 ` Andrew Morton
2024-10-24 3:27 ` Yu Zhao
0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2024-10-24 3:26 UTC (permalink / raw)
To: James Houghton
Cc: Sean Christopherson, Paolo Bonzini, David Matlack,
David Rientjes, Oliver Upton, David Stevens, Yu Zhao, Wei Xu,
Axel Rasmussen, kvm, linux-kernel, linux-mm
On Sat, 19 Oct 2024 01:29:38 +0000 James Houghton <jthoughton@google.com> wrote:
> From: Yu Zhao <yuzhao@google.com>
>
> The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very
> helpful and become more complicated to properly compute when adding
> test/clear_young() notifiers in MGLRU's mm walk.
>
Patch 2/2 has Fixes: and cc:stable. Patch 2/2 requires patch 1/2. So I
added the same Fixes: and cc:stable to patch 1/2.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/2] mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
2024-10-24 3:26 ` Andrew Morton
@ 2024-10-24 3:27 ` Yu Zhao
0 siblings, 0 replies; 5+ messages in thread
From: Yu Zhao @ 2024-10-24 3:27 UTC (permalink / raw)
To: Andrew Morton
Cc: James Houghton, Sean Christopherson, Paolo Bonzini,
David Matlack, David Rientjes, Oliver Upton, David Stevens,
Wei Xu, Axel Rasmussen, kvm, linux-kernel, linux-mm
On Wed, Oct 23, 2024 at 9:26 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sat, 19 Oct 2024 01:29:38 +0000 James Houghton <jthoughton@google.com> wrote:
>
> > From: Yu Zhao <yuzhao@google.com>
> >
> > The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very
> > helpful and become more complicated to properly compute when adding
> > test/clear_young() notifiers in MGLRU's mm walk.
> >
>
> Patch 2/2 has Fixes: and cc:stable. Patch 2/2 requires patch 1/2. So I
> added the same Fixes: and cc:stable to patch 1/2.
SGTM. Thanks!
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-10-24 3:28 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-19 1:29 [PATCH 0/2] mm: multi-gen LRU: Have secondary MMUs participate in MM_WALK James Houghton
2024-10-19 1:29 ` [PATCH 1/2] mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats James Houghton
2024-10-24 3:26 ` Andrew Morton
2024-10-24 3:27 ` Yu Zhao
2024-10-19 1:29 ` [PATCH 2/2] mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify() James Houghton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox