[RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion
@ 2025-03-24 22:02 Kinsey Ho
  2025-03-24 22:03 ` [RFC PATCH v1 1/2] mm: mglru: generalize page table walk Kinsey Ho
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Kinsey Ho @ 2025-03-24 22:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: yuanchu, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, dave, hyeonggon.yoo,
	bharata, Kinsey Ho

This patch series introduces a software-based approach to identify
hot pages for promotion in tiered memory systems, particularly those
leveraging CXL-attached memory, by utilizing the Multi-Generational
LRU (MGLRU) framework. This method is designed to complement
hardware-based hotness detection mechanisms like Intel PMU sampling, AMD
IBS, or dedicated CXL memory monitoring units, providing a more
comprehensive view of page access patterns, similar to kmmscand [1].

We propose to utilize MGLRU's existing infrastructure to provide hot
page information. A key benefit here is the reuse of the MGLRU page
table walk code, thus avoiding the overhead and duplication of effort
involved in implementing a separate page table scanning mechanism. The
working set reporting proposal [2] also reuses MGLRU's infrastructure,
but focuses on cold page detection. It provides its own aging daemon,
which could additionally provide hot page information by integrating
this proof-of-concept.

This series relies on kpromoted [3] as the migration engine to implement
the promotion policies. This is just an early proof-of-concept RFC
posted now in the context of LSFMM.

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 include/linux/mmzone.h |   5 ++
 mm/Kconfig             |   8 ++
 mm/Makefile            |   1 +
 mm/internal.h          |   4 +
 mm/klruscand.c         | 118 +++++++++++++++++++++++++++
 mm/vmscan.c            | 177 ++++++++++++++++++++++++++++++-----------
 6 files changed, 267 insertions(+), 46 deletions(-)
 create mode 100644 mm/klruscand.c

-- 
2.49.0.395.g12beb8f557-goog

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH v1 1/2] mm: mglru: generalize page table walk
  2025-03-24 22:02 [RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion Kinsey Ho
@ 2025-03-24 22:03 ` Kinsey Ho
  2025-03-24 22:03 ` [RFC PATCH v1 2/2] mm: klruscand: use mglru scanning for page promotion Kinsey Ho
  2025-03-25 11:56 ` [RFC PATCH v1 0/2] mm: multi-gen LRU " Bharata B Rao
  2 siblings, 0 replies; 5+ messages in thread
From: Kinsey Ho @ 2025-03-24 22:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: yuanchu, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, dave, hyeonggon.yoo,
	bharata, Kinsey Ho

Refactor the existing MGLRU page table walking logic to make it
resumable.

Additionally, introduce two hooks into the MGLRU page table walk:
accessed callback and flush callback. The accessed callback is called
for each accessed page detected via the scanned accessed bit. The flush
callback is called when the accessed callback reports an out of space
error. This allows for processing pages in batches for efficiency.

With a generalised page table walk, introduce a new scan function which
repeatedly scans on the same young generation and does not add a new
young generation.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 include/linux/mmzone.h |   5 ++
 mm/internal.h          |   4 +
 mm/vmscan.c            | 177 ++++++++++++++++++++++++++++++-----------
 3 files changed, 140 insertions(+), 46 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a5c4e789aa55..bab586961a82 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -511,6 +511,8 @@ struct lru_gen_mm_walk {
 	unsigned long seq;
 	/* the next address within an mm to scan */
 	unsigned long next_addr;
+	/* called for each accessed pte/pmd */
+	int (*accessed_cb)(pfn_t pfn);
 	/* to batch promoted pages */
 	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* to batch the mm stats */
@@ -518,6 +520,9 @@ struct lru_gen_mm_walk {
 	/* total batched items */
 	int batched;
 	int swappiness;
+	/* for the pmd under scanning */
+	int nr_young_pte;
+	int nr_total_pte;
 	bool force_scan;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 20b3535935a3..3bf528af2deb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -476,6 +476,10 @@ extern unsigned long highest_memmap_pfn;
 bool folio_isolate_lru(struct folio *folio);
 void folio_putback_lru(struct folio *folio);
 extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+void set_task_reclaim_state(struct task_struct *task,
+				   struct reclaim_state *rs);
+void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq,
+			 int (*accessed_cb)(pfn_t), void (*flush_cb)(void));
 
 /*
  * in mm/rmap.c:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c767d71c43d7..fb828a429645 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,7 @@
 #include <linux/rculist_nulls.h>
 #include <linux/random.h>
 #include <linux/mmu_notifier.h>
+#include <linux/pfn_t.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -271,7 +272,7 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
 }
 #endif
 
-static void set_task_reclaim_state(struct task_struct *task,
+void set_task_reclaim_state(struct task_struct *task,
 				   struct reclaim_state *rs)
 {
 	/* Check for an overwrite */
@@ -3023,7 +3024,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
 
 	VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->seq);
 
-	if (walk->seq <= mm_state->seq)
+	if (!walk->accessed_cb && walk->seq <= mm_state->seq)
 		goto done;
 
 	if (!mm_state->head)
@@ -3452,16 +3453,14 @@ static void walk_update_folio(struct lru_gen_mm_walk *walk, struct folio *folio,
 	}
 }
 
-static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
-			   struct mm_walk *args)
+static int walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *args, bool *suitable)
 {
-	int i;
+	int i, err = 0;
 	bool dirty;
 	pte_t *pte;
 	spinlock_t *ptl;
 	unsigned long addr;
-	int total = 0;
-	int young = 0;
 	struct folio *last = NULL;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
@@ -3471,17 +3470,21 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	pmd_t pmdval;
 
 	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval, &ptl);
-	if (!pte)
-		return false;
+	if (!pte) {
+		*suitable = false;
+		return 0;
+	}
 
 	if (!spin_trylock(ptl)) {
 		pte_unmap(pte);
-		return true;
+		*suitable = true;
+		return 0;
 	}
 
 	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
 		pte_unmap_unlock(pte, ptl);
-		return false;
+		*suitable = false;
+		return 0;
 	}
 
 	arch_enter_lazy_mmu_mode();
@@ -3491,7 +3494,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		struct folio *folio;
 		pte_t ptent = ptep_get(pte + i);
 
-		total++;
+		walk->nr_total_pte++;
 		walk->mm_stats[MM_LEAF_TOTAL]++;
 
 		pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
@@ -3515,23 +3518,34 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		if (pte_dirty(ptent))
 			dirty = true;
 
-		young++;
+		walk->nr_young_pte++;
 		walk->mm_stats[MM_LEAF_YOUNG]++;
+
+		if (!walk->accessed_cb)
+			continue;
+
+		err = walk->accessed_cb(pfn_to_pfn_t(pfn));
+		if (err) {
+			walk->next_addr = addr + PAGE_SIZE;
+			break;
+		}
 	}
 
 	walk_update_folio(walk, last, gen, dirty);
 	last = NULL;
 
-	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
+	if (!err && i < PTRS_PER_PTE &&
+	    get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
 		goto restart;
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte, ptl);
 
-	return suitable_to_scan(total, young);
+	*suitable = suitable_to_scan(walk->nr_total_pte, walk->nr_young_pte);
+	return err;
 }
 
-static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
+static int walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
 				  struct mm_walk *args, unsigned long *bitmap, unsigned long *first)
 {
 	int i;
@@ -3544,6 +3558,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int gen = lru_gen_from_seq(max_seq);
+	int err = 0;
 
 	VM_WARN_ON_ONCE(pud_leaf(*pud));
 
@@ -3551,13 +3566,13 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	if (*first == -1) {
 		*first = addr;
 		bitmap_zero(bitmap, MIN_LRU_BATCH);
-		return;
+		return 0;
 	}
 
 	i = addr == -1 ? 0 : pmd_index(addr) - pmd_index(*first);
 	if (i && i <= MIN_LRU_BATCH) {
 		__set_bit(i - 1, bitmap);
-		return;
+		return 0;
 	}
 
 	pmd = pmd_offset(pud, *first);
@@ -3607,6 +3622,16 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 			dirty = true;
 
 		walk->mm_stats[MM_LEAF_YOUNG]++;
+		if (!walk->accessed_cb)
+			goto next;
+
+		err = walk->accessed_cb(pfn_to_pfn_t(pfn));
+		if (err) {
+			i = find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
+
+			walk->next_addr = (*first & PMD_MASK) + i * PMD_SIZE;
+			break;
+		}
 next:
 		i = i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
 	} while (i <= MIN_LRU_BATCH);
@@ -3617,9 +3642,10 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	spin_unlock(ptl);
 done:
 	*first = -1;
+	return err;
 }
 
-static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+static int walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			   struct mm_walk *args)
 {
 	int i;
@@ -3631,6 +3657,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 	unsigned long first = -1;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct lru_gen_mm_state *mm_state = get_mm_state(walk->lruvec);
+	int err = 0;
 
 	VM_WARN_ON_ONCE(pud_leaf(*pud));
 
@@ -3644,6 +3671,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 	/* walk_pte_range() may call get_next_vma() */
 	vma = args->vma;
 	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		bool suitable;
 		pmd_t val = pmdp_get_lockless(pmd + i);
 
 		next = pmd_addr_end(addr, end);
@@ -3660,7 +3688,10 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			walk->mm_stats[MM_LEAF_TOTAL]++;
 
 			if (pfn != -1)
-				walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
+				err = walk_pmd_range_locked(pud, addr, vma, args,
+						bitmap, &first);
+			if (err)
+				return err;
 			continue;
 		}
 
@@ -3669,33 +3700,50 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			if (!pmd_young(val))
 				continue;
 
-			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
+			err = walk_pmd_range_locked(pud, addr, vma, args,
+						bitmap, &first);
+			if (err)
+				return err;
 		}
 
 		if (!walk->force_scan && !test_bloom_filter(mm_state, walk->seq, pmd + i))
 			continue;
 
+		err = walk_pte_range(&val, addr, next, args, &suitable);
+		if (err && walk->next_addr < next && first == -1)
+			return err;
+
+		walk->nr_total_pte = 0;
+		walk->nr_young_pte = 0;
+
 		walk->mm_stats[MM_NONLEAF_FOUND]++;
 
-		if (!walk_pte_range(&val, addr, next, args))
-			continue;
+		if (!suitable)
+			goto next;
 
 		walk->mm_stats[MM_NONLEAF_ADDED]++;
 
 		/* carry over to the next generation */
 		update_bloom_filter(mm_state, walk->seq + 1, pmd + i);
+next:
+		if (err) {
+			walk->next_addr = first;
+			return err;
+		}
 	}
 
-	walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first);
+	err = walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first);
 
-	if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
+	if (!err && i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
 		goto restart;
+
+	return err;
 }
 
 static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 			  struct mm_walk *args)
 {
-	int i;
+	int i, err;
 	pud_t *pud;
 	unsigned long addr;
 	unsigned long next;
@@ -3713,7 +3761,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
 			continue;
 
-		walk_pmd_range(&val, addr, next, args);
+		err = walk_pmd_range(&val, addr, next, args);
+		if (err)
+			return err;
 
 		if (need_resched() || walk->batched >= MAX_LRU_BATCH) {
 			end = (addr | ~PUD_MASK) + 1;
@@ -3734,40 +3784,48 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 	return -EAGAIN;
 }
 
-static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+static int try_walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
 {
+	int err;
 	static const struct mm_walk_ops mm_walk_ops = {
 		.test_walk = should_skip_vma,
 		.p4d_entry = walk_pud_range,
 		.walk_lock = PGWALK_RDLOCK,
 	};
-	int err;
 	struct lruvec *lruvec = walk->lruvec;
 
-	walk->next_addr = FIRST_USER_ADDRESS;
+	DEFINE_MAX_SEQ(lruvec);
 
-	do {
-		DEFINE_MAX_SEQ(lruvec);
+	err = -EBUSY;
 
-		err = -EBUSY;
+	/* another thread might have called inc_max_seq() */
+	if (walk->seq != max_seq)
+		return err;
 
-		/* another thread might have called inc_max_seq() */
-		if (walk->seq != max_seq)
-			break;
+	/* the caller might be holding the lock for write */
+	if (mmap_read_trylock(mm)) {
+		err = walk_page_range(mm, walk->next_addr, ULONG_MAX,
+				      &mm_walk_ops, walk);
 
-		/* the caller might be holding the lock for write */
-		if (mmap_read_trylock(mm)) {
-			err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk);
+		mmap_read_unlock(mm);
+	}
 
-			mmap_read_unlock(mm);
-		}
+	if (walk->batched) {
+		spin_lock_irq(&lruvec->lru_lock);
+		reset_batch_size(walk);
+		spin_unlock_irq(&lruvec->lru_lock);
+	}
 
-		if (walk->batched) {
-			spin_lock_irq(&lruvec->lru_lock);
-			reset_batch_size(walk);
-			spin_unlock_irq(&lruvec->lru_lock);
-		}
+	return err;
+}
+
+static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	int err;
 
+	walk->next_addr = FIRST_USER_ADDRESS;
+	do {
+		err = try_walk_mm(mm, walk);
 		cond_resched();
 	} while (err == -EAGAIN);
 }
@@ -3964,6 +4022,33 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
 	return success;
 }
 
+void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq,
+			 int (*accessed_cb)(pfn_t), void (*flush_cb)(void))
+{
+	struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk;
+	struct mm_struct *mm = NULL;
+
+	walk->lruvec = lruvec;
+	walk->seq = seq;
+	walk->accessed_cb = accessed_cb;
+	walk->swappiness = MAX_SWAPPINESS;
+
+	do {
+		int err = -EBUSY;
+
+		iterate_mm_list(walk, &mm);
+		if (!mm)
+			break;
+
+		walk->next_addr = FIRST_USER_ADDRESS;
+		do {
+			err = try_walk_mm(mm, walk);
+			cond_resched();
+			flush_cb();
+		} while (err == -EAGAIN);
+	} while (mm);
+}
+
 static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 			       int swappiness, bool force_scan)
 {
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH v1 2/2] mm: klruscand: use mglru scanning for page promotion
  2025-03-24 22:02 [RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion Kinsey Ho
  2025-03-24 22:03 ` [RFC PATCH v1 1/2] mm: mglru: generalize page table walk Kinsey Ho
@ 2025-03-24 22:03 ` Kinsey Ho
  2025-03-25 11:56 ` [RFC PATCH v1 0/2] mm: multi-gen LRU " Bharata B Rao
  2 siblings, 0 replies; 5+ messages in thread
From: Kinsey Ho @ 2025-03-24 22:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: yuanchu, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, dave, hyeonggon.yoo,
	bharata, Kinsey Ho

Introduce a new kernel daemon, klruscand, that periodically invokes the
MGLRU page table walk. It leverages the new callbacks to gather access
information and forwards it to the kpromoted daemon for promotion
decisions.

This benefits from reusing the existing MGLRU page table walk
infrastructure, which is optimized with features such as hierarchical
scanning and bloom filters to reduce CPU overhead.

As an additional optimization to be added in the future, we can tune
the scan intervals for each memcg.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
---
 mm/Kconfig     |   8 ++++
 mm/Makefile    |   1 +
 mm/klruscand.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+)
 create mode 100644 mm/klruscand.c

diff --git a/mm/Kconfig b/mm/Kconfig
index ceaa462a0ce6..ed0fa8f2551e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1366,6 +1366,14 @@ config KPROMOTED
 	  Promote hot pages from lower tier to top tier by using the
 	  memory access information provided by various sources.
 
+config KLRUSCAND
+	bool "Kernel lower tier access scan daemon"
+	default y
+	depends on KPROMOTED && LRU_GEN_WALKS_MMU
+	help
+	  Scan for accesses from lower tiers by invoking MGLRU to perform
+	  page table walks.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index bf4f5f18f1f9..eb7b76db3b33 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -148,3 +148,4 @@ obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
 obj-$(CONFIG_KPROMOTED) += kpromoted.o
+obj-$(CONFIG_KLRUSCAND) += klruscand.o
diff --git a/mm/klruscand.c b/mm/klruscand.c
new file mode 100644
index 000000000000..a53d43c60155
--- /dev/null
+++ b/mm/klruscand.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memcontrol.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/random.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
+#include <linux/slab.h>
+#include <linux/sched/clock.h>
+#include <linux/memory-tiers.h>
+#include <linux/sched/mm.h>
+#include <linux/sched.h>
+#include <linux/kpromoted.h>
+
+#include "internal.h"
+
+#define KLRUSCAND_INTERVAL_MS 4000
+#define BATCH_SIZE (2 << 16)
+
+static struct task_struct *scan_thread;
+static pfn_t pfn_batch[BATCH_SIZE];
+static int batch_index;
+
+static void flush_cb(void)
+{
+	int i = 0;
+
+	for (; i < batch_index; i++) {
+		u64 pfn = pfn_batch[i].val;
+
+		kpromoted_record_access((unsigned long)pfn, NUMA_NO_NODE,
+					KPROMOTED_PGTABLE_SCAN, jiffies);
+
+		if (i % 16 == 0)
+			cond_resched();
+	}
+	batch_index = 0;
+}
+
+static int accessed_cb(pfn_t pfn)
+{
+	if (batch_index >= BATCH_SIZE)
+		return -EAGAIN;
+
+	pfn_batch[batch_index++] = pfn;
+	return 0;
+}
+
+static int klruscand_run(void *unused)
+{
+	struct lru_gen_mm_walk *walk;
+
+	walk = kzalloc(sizeof(*walk),
+		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	if (!walk)
+		return -ENOMEM;
+
+	while (!kthread_should_stop()) {
+		unsigned long next_wake_time;
+		long sleep_time;
+		struct mem_cgroup *memcg;
+		int flags;
+		int nid;
+
+		next_wake_time = jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL_MS);
+
+		for_each_node_state(nid, N_MEMORY) {
+			pg_data_t *pgdat = NODE_DATA(nid);
+			struct reclaim_state rs = { 0 };
+
+			if (node_is_toptier(nid))
+				continue;
+
+			rs.mm_walk = walk;
+			set_task_reclaim_state(current, &rs);
+			flags = memalloc_noreclaim_save();
+
+			memcg = mem_cgroup_iter(NULL, NULL, NULL);
+			do {
+				struct lruvec *lruvec =
+					mem_cgroup_lruvec(memcg, pgdat);
+				unsigned long max_seq =
+					READ_ONCE((lruvec)->lrugen.max_seq);
+
+				lru_gen_scan_lruvec(lruvec, max_seq,
+						    accessed_cb, flush_cb);
+				cond_resched();
+			} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+			memalloc_noreclaim_restore(flags);
+			set_task_reclaim_state(current, NULL);
+			memset(walk, 0, sizeof(*walk));
+		}
+
+		sleep_time = next_wake_time - jiffies;
+		if (sleep_time > 0 && sleep_time != MAX_SCHEDULE_TIMEOUT)
+			schedule_timeout_idle(sleep_time);
+	}
+	kfree(walk);
+	return 0;
+}
+
+static int __init klruscand_init(void)
+{
+	struct task_struct *task;
+
+	task = kthread_run(klruscand_run, NULL, "klruscand");
+
+	if (IS_ERR(task)) {
+		pr_err("Failed to create klruscand kthread\n");
+		return PTR_ERR(task);
+	}
+
+	scan_thread = task;
+	return 0;
+}
+module_init(klruscand_init);
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion
  2025-03-24 22:02 [RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion Kinsey Ho
  2025-03-24 22:03 ` [RFC PATCH v1 1/2] mm: mglru: generalize page table walk Kinsey Ho
  2025-03-24 22:03 ` [RFC PATCH v1 2/2] mm: klruscand: use mglru scanning for page promotion Kinsey Ho
@ 2025-03-25 11:56 ` Bharata B Rao
  2025-03-25 21:55   ` Yuanchu Xie
  2 siblings, 1 reply; 5+ messages in thread
From: Bharata B Rao @ 2025-03-25 11:56 UTC (permalink / raw)
  To: Kinsey Ho, linux-mm, linux-kernel
  Cc: yuanchu, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, dave

On 25-Mar-25 3:32 AM, Kinsey Ho wrote:
> This patch series introduces a software-based approach to identify
> hot pages for promotion in tiered memory systems, particularly those
> leveraging CXL-attached memory, by utilizing the Multi-Generational
> LRU (MGLRU) framework. This method is designed to complement
> hardware-based hotness detection mechanisms like Intel PMU sampling, AMD
> IBS, or dedicated CXL memory monitoring units, providing a more
> comprehensive view of page access patterns, similar to kmmscand [1].
> 
> We propose to utilize MGLRU's existing infrastructure to provide hot
> page information. A key benefit here is the reuse of the MGLRU page
> table walk code, thus avoiding the overhead and duplication of effort
> involved in implementing a separate page table scanning mechanism. The
> working set reporting proposal [2] also reuses MGLRU's infrastructure,
> but focuses on cold page detection. It provides its own aging daemon,
> which could additionally provide hot page information by integrating
> this proof-of-concept.
> 
> This series relies on kpromoted [3] as the migration engine to implement
> the promotion policies. This is just an early proof-of-concept RFC
> posted now in the context of LSFMM.

Thanks for your patchset. I haven't looked at the patches in detail yet, 
but gave it a quick try with the micro-benchmark that I have been using.

The below numbers can be compared with the base numbers that I have 
posted here 
(https://lore.kernel.org/linux-mm/20250325081832.209140-1-bharata@amd.com/). 
Test 2 in the above link is the one I tried with this patchset.

kernel.numa_balancing = 0
demotion=true
cpufreq governor=performance

Benchmark run configuration:
Compute-node            = 1
Memory-node             = 2
Memory-size             = 206158430208
Hot-region-size         = 1073741824
Nr-hot-regions          = 192
Access pattern          = random
Access granularity      = 4096
Delay b/n accesses      = 0
Load/store ratio        = 50l50s
THP used                = no
Nr accesses             = 25769803776
Nr repetitions          = 512

Benchmark completed in 605983205.0 us

numa_hit 63621437
numa_miss 2721737
numa_foreign 2721737
numa_interleave 0
numa_local 48243292
numa_other 18099882
pgpromote_success 0
pgpromote_candidate 0
pgdemote_kswapd 15409682
pgdemote_direct 0
pgdemote_khugepaged 0
numa_pte_updates 0
numa_huge_pte_updates 0
numa_hint_faults 0
numa_hint_faults_local 0
numa_pages_migrated 19596
pgmigrate_success 15429278
pgmigrate_fail 256

kpromoted_recorded_accesses 27647687
kpromoted_recorded_hwhints 0
kpromoted_recorded_pgtscans 27647687
kpromoted_record_toptier 0
kpromoted_record_added 17184209
kpromoted_record_exists 10463478
kpromoted_mig_right_node 0
kpromoted_mig_non_lru 404308
kpromoted_mig_cold_old 6417567
kpromoted_mig_cold_not_accessed 10342825
kpromoted_mig_promoted 19509
kpromoted_mig_dropped 17164700

When I try to get the same benchmark numbers for kpromoted driven by 
kmmscand, kpromoted gets overwhelmed with the amount of data that 
kmmdscand provides while no such issues with the amount of accesses 
reported by this patchset.

As I have mentioned earlier, the hot page categorization heuristics is 
simplistic in kpromoted and may not have been able to promote more pages 
than what it has for this benchmark.

Regards,
Bharata.






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion
  2025-03-25 11:56 ` [RFC PATCH v1 0/2] mm: multi-gen LRU " Bharata B Rao
@ 2025-03-25 21:55   ` Yuanchu Xie
  0 siblings, 0 replies; 5+ messages in thread
From: Yuanchu Xie @ 2025-03-25 21:55 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Kinsey Ho, linux-mm, linux-kernel, AneeshKumar.KizhakeVeetil,
	Hasan.Maruf, Jonathan.Cameron, Michael.Day, akpm, dave.hansen,
	david, feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, dave

On Tue, Mar 25, 2025 at 4:56 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Thanks for your patchset. I haven't looked at the patches in detail yet,
> but gave it a quick try with the micro-benchmark that I have been using.
Thanks for running the numbers. Unfortunately neither of us can attend
LSF/MM in person, but we're excited about this opportunity for
collaboration.

>
> The below numbers can be compared with the base numbers that I have
> posted here
> (https://lore.kernel.org/linux-mm/20250325081832.209140-1-bharata@amd.com/).
> Test 2 in the above link is the one I tried with this patchset.
>
> kernel.numa_balancing = 0
> demotion=true
> cpufreq governor=performance
>
> Benchmark run configuration:
> Compute-node            = 1
> Memory-node             = 2
> Memory-size             = 206158430208
> Hot-region-size         = 1073741824
> Nr-hot-regions          = 192
> Access pattern          = random
> Access granularity      = 4096
> Delay b/n accesses      = 0
> Load/store ratio        = 50l50s
> THP used                = no
> Nr accesses             = 25769803776
> Nr repetitions          = 512
>
> Benchmark completed in 605983205.0 us
The benchmark does seem to complete in less time, but I'm not sure why
especially given the small number of pages promoted. I think it would
also be useful to see the usage breakdown of DRAM/CXL over time.

>
> numa_hit 63621437
> numa_miss 2721737
> numa_foreign 2721737
> numa_interleave 0
> numa_local 48243292
> numa_other 18099882
> pgpromote_success 0
> pgpromote_candidate 0
> pgdemote_kswapd 15409682
> pgdemote_direct 0
> pgdemote_khugepaged 0
> numa_pte_updates 0
> numa_huge_pte_updates 0
> numa_hint_faults 0
> numa_hint_faults_local 0
> numa_pages_migrated 19596
> pgmigrate_success 15429278
> pgmigrate_fail 256
>
> kpromoted_recorded_accesses 27647687
> kpromoted_recorded_hwhints 0
> kpromoted_recorded_pgtscans 27647687
> kpromoted_record_toptier 0
Makes sense, we skip toptier scanning

> kpromoted_record_added 17184209
> kpromoted_record_exists 10463478
> kpromoted_mig_right_node 0
> kpromoted_mig_non_lru 404308
> kpromoted_mig_cold_old 6417567
> kpromoted_mig_cold_not_accessed 10342825
> kpromoted_mig_promoted 19509
Compared to 611077 (IBS number) this is a lot lower.

> kpromoted_mig_dropped 17164700
>
> When I try to get the same benchmark numbers for kpromoted driven by
> kmmscand, kpromoted gets overwhelmed with the amount of data that
> kmmdscand provides while no such issues with the amount of accesses
> reported by this patchset.
The scan interval in this series is 4 seconds, while the kmmscand's
pause between scanning is 16ms. So there're definitely some gaps here.
The MGLRU page table walk also has a bunch of optimizations, and some
of them are more focused on reclaim, so we might need to tweak some
things there too.


Yuanchu


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-03-25 21:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-24 22:02 [RFC PATCH v1 0/2] mm: multi-gen LRU scanning for page promotion Kinsey Ho
2025-03-24 22:03 ` [RFC PATCH v1 1/2] mm: mglru: generalize page table walk Kinsey Ho
2025-03-24 22:03 ` [RFC PATCH v1 2/2] mm: klruscand: use mglru scanning for page promotion Kinsey Ho
2025-03-25 11:56 ` [RFC PATCH v1 0/2] mm: multi-gen LRU " Bharata B Rao
2025-03-25 21:55   ` Yuanchu Xie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox