[RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure
@ 2025-11-10  5:23 Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

[If someone wants to be off the CC-list, pls drop me a note. Will remove from
the next iteration]

Hi,

This patchset introduces a new subsystem for hot page tracking and
promotion (pghot) with the following goals:

- Unify hot page detection from multiple sources like hint faults, page table
  scans, hardware hints (IBS).
- Decouple detection from migration.
- Centralize promotion logic via per-lowertier-node kernel thread.
- Move migration rate limiting and associated logic in NUMAB=2 (current NUMA
  Balancing based hot page promotion) from scheduler to pghot sub-system to
  enable broader reuse.
  
Currently, multiple kernel subsystems detect page accesses independently.
This patchset consolidates accesses from these mechanisms by providing:

- A common API for reporting page accesses
- Shared infrastructure for tracking hotness at PFN granularity
- Per-lowertier-node kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency, last access time and accessing node for each recorded
  access.
- These hotness parameters are maintained on a per-PFN in an unsigned long
  variable within the existing mem_section data structure.
  Bits 0-31 are used to store nid, frequency and time.
  Bits 32-62 are unused now.
  Bit 63 is used to indicate the page is ready for migration.
- Classifies pages as hot based on configurable thresholds.
- Pages classified as hot are marked as ready for migration using the ready bit.
- Per-lowertier-node kmigrated threads periodically scan the PFNs of lower tier
  nodes, checking for the migration-ready bit to perform batched migrations.

Three page hotness sources have been integrated with pghot subsystem on
experimental basis:

1. IBS
2. klruscand (based on MGLRU page table walks)
3. NUMA Balancing (mode 2).

Major change in v3
==================
The major design change in this version is to move away from the hash and heap
based hot page records management and instead use statically allocated
per-PFN unsigned long variable for storing hotness parameters. This was the
approach that I had used in what was called the kmigrated patchset [1]. While
earlier I had used extended page flags, here mem_section data structure is used
to store per-PFN hotness information for PFNs spanning the section.

Advantages of this approach:

- Eliminates the need for dynamic allocation and deallocation of hot page
  records. Also, no more atomic context allocations.
- Removes the requirement for special data structures (like hash lists and heap)
  to manage hot page records.
- Considerable space savings per hot page record (Just an unsigned long now
  instead of 40 bytes per record in the earlier approach)
- Fixed complexity for looking up the hot page record of a PFN.
- No locking complexity but just atomic updates to per-PFN record.

Downsides:

- Not easily possible to obtain top N hot pages list but a kernel thread will
  periodically scan the hotness records of its corresponding lower tier to
  obtain the hot pages for promotion.
- A page may become cold by the time kmigrated gets to act on it.

Space overhead:

- One pointer overhead for each memory section to store hotness array pointer.
  With a section size of 128MB resulting in 8192 sections per TB of node memory,
  there will be 64KB of memory used per TB. Currently I am using mem_section to
  store the hotness array pointer instead of creating a parallel data structure.
  If the latter method is preferred, then hotness array pointers are required
  only for the lower tier nodes.
- With 4K PFNs, there can be 32768 PFNs in a section and hence with 8 bytes
  (unsigned long) per PFN, hotness array will consume 2GB per TB of node memory.
  This will be for lower tier nodes only.

Other changes in v3
===================
- Migration thread is renamed to kmigrated (earlier called kpromoted).
- Most code cleanups as suggested by Jonathan Cameron.
- NUMAB mode 2 is now fully enabled as hotness source to pghot sub-system with
  off-loading of large pages migration to kmigrated.
- Sysctl knobs to enable access recording from different sources independently.

Results
=======
System details
--------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2 
  0:  10  32  50 
  1:  32  10  60 
  2:  255  255  10

Microbenchmark details
----------------------
Multi-threaded application with 64 threads that access memory at 4K granularity
repetitively and randomly. The number of accesses per thread and the randomness
pattern for each thread are fixed beforehand. The accesses are divided into stores
and loads.

Benchmark threads run on Node 0, while memory is initially provisioned on
CXL node 2 before the accesses start. There are three modes in which the
benchmark is run:

Mode 1: Regular 4K page accesses. The memory is provisioned on CXL node using
mmap(MAP_POPULATE). 50% loads and 50% stores.

Mode 2: mmapped file 4K accesses. The memory is provisioned on CXL node using
mmap(fd, MAP_POPULATE|MAP_SHARED). 100% loads.

Mode 3: 2M THP page accesses. The memory is provisioned on CXL node using mmap,
madvise(MADV_HUGEPAGE) and move_pages(to cxl node). 50% loads and 50% stores.

Repetitive accesses results in lowertier pages becoming hot and kmigrated
detecting and migrating them. The benchmark score is the time taken to finish
the accesses in microseconds. The sooner it finishes the better it is. All the
numbers shown below are average of 3 runs.

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
	 in the patched case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
	 use of hint faults as source in the patched case.
pgtscan - Klruscand (MGLRU based PTE A bit scanning) source
hwhints - IBS as source

Results summary
---------------
Performance Impact:
- NUMAB2: 4.5% regression in Mode 1 and 19.8% regression in Mode 2.
- Hardware hints (IBS): Shows close to original NUMAB2 performance.
- Page table scanning: Good performance, comprehensive migration.

Migration Effectiveness:
- NUMAB2 and pgtscan achieve similar migration counts to baseline.
- THP migration significantly improved with new sources.
- Hardware hints show some sampling limitations.

Mode 1 - Time taken (microseconds, lower is better)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		115,668,771	117,775,032	+1.8%
NUMAB2		102,894,589	107,576,615	+4.5%
pgtscan		NA		111,399,698	NA
hwhints		NA		103,232,152	NA
------------------------------------------------------

Mode 1 - Pages migrated (pgpromote_success)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		0		0		0%
NUMAB2		2097144		2097152		+0.0%
pgtscan		NA		2097152		NA
hwhints		NA		1269467		NA
------------------------------------------------------

Mode 2 - Time taken (microseconds, lower is better)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		110,273,416	113,801,899	+3.2%
NUMAB2		71,859,123	86,098,560	+19.8%
pgtscan		NA		71,545,031	NA
hwhints		NA		71,857,476	NA
------------------------------------------------------

Mode 2 - Pages migrated (pgpromote_success)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		0		0		0%
NUMAB2		2097152		2080128		-0.8%
pgtscan		NA		2097152		NA
hwhints		NA		2097115		NA
------------------------------------------------------

Mode 3 - Time taken (microseconds, lower is better)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		30,944,794	30,537,137	-1.3%
NUMAB2		29,773,930	31,184,442	+4.7%
pgtscan		NA		28,580,878	NA
hwhints		NA		28,732,128	NA
------------------------------------------------------

Mode 3 - Pages migrated (thp_migration_success)
------------------------------------------------------
Source		Base		Patched		Change
------------------------------------------------------
NUMAB0		0		0		0
NUMAB2		3754		1278		-65.9%
pgtscan		NA		33032		NA
hwhints		NA		32768		NA
------------------------------------------------------

Results Analysis TODO
---------------------
- Regression in NUMAB2 needs further analysis. The overhead of pghot path and
  effect of batched migration needs to be identified. It is seen that
  migrations get kicked off a bit later in kmigrated-NUMAB2 case compared to
  base-NUMAB2 case. This also needs further investigation.

This v3 patchset applies on top of upstream commit e53642b87a4f and
can be fetched from:

https://github.com/AMDESE/linux-mm/tree/bharata/pghot-rfcv3

v2: https://lore.kernel.org/linux-mm/20250910144653.212066-1-bharata@amd.com/
v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

TODOs
=====
- Check if the page is still within the hotness time window when
  kmigrated gets to it.
- Per-zone or per-section indicators to walk only zones or sections that
  have hot PFNs instead of kmigrated walking all the PFNs of the lower
  tier node.
- Bulk access reporting may be desirable for sources like IBS.
- Take care of memory hotplug for allocation/freeing of mem_section->hot_map.
- Currently I am defaulting to node 0 if target NID isn't specified by the
  source. The best fallback target node may have to determined dynamically.
- Provide compatibility alias for the sysctls moved from sched to pghot.
- Wider testing and benchmark coverage.
- Address Ying Huang's comment about merging migrate_misplaced_folio()
  and migrate_misplaced_folios_batch() and correctly handling memcg
  stats counting properly in the latter.

[1] kmigrated approach: https://lore.kernel.org/linux-mm/20250616133931.206626-1-bharata@amd.com/

Bharata B Rao (5):
  mm: migrate: Allow misplaced migration without VMA too
  mm: Hot page tracking and promotion
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses
  mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking

Gregory Price (1):
  migrate: implement migrate_misplaced_folios_batch

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 arch/x86/events/amd/ibs.c           |  11 +
 arch/x86/include/asm/entry-common.h |   3 +
 arch/x86/include/asm/hardirq.h      |   2 +
 arch/x86/include/asm/ibs.h          |   9 +
 arch/x86/include/asm/msr-index.h    |  16 +
 arch/x86/mm/Makefile                |   3 +-
 arch/x86/mm/ibs.c                   | 343 +++++++++++++++++
 include/linux/migrate.h             |   6 +
 include/linux/mmzone.h              |  19 +
 include/linux/pghot.h               |  55 +++
 include/linux/vm_event_item.h       |  21 +
 kernel/sched/debug.c                |   1 -
 kernel/sched/fair.c                 | 152 +-------
 mm/Kconfig                          |  19 +
 mm/Makefile                         |   2 +
 mm/huge_memory.c                    |  26 +-
 mm/internal.h                       |   4 +
 mm/klruscand.c                      | 110 ++++++
 mm/memory.c                         |  31 +-
 mm/migrate.c                        |  41 +-
 mm/mm_init.c                        |  10 +
 mm/page_ext.c                       |  11 +
 mm/pghot.c                          | 571 ++++++++++++++++++++++++++++
 mm/vmscan.c                         | 181 ++++++---
 mm/vmstat.c                         |  21 +
 25 files changed, 1427 insertions(+), 241 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/klruscand.c
 create mode 100644 mm/pghot.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 1/8] mm: migrate: Allow misplaced migration without VMA too
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

We want isolation of misplaced folios to work in contexts
where VMA isn't available, typically when performing migrations
from a kernel thread context. In order to prepare for that
allow migrate_misplaced_folio_prepare() to be called with
a NULL VMA.

When migrate_misplaced_folio_prepare() is called with non-NULL
VMA, it will check if the folio is mapped shared and that requires
holding PTL lock. This path isn't taken when the function is
invoked with NULL VMA (migration outside of process context).
Hence for such cases, it is not necessary this function be
called with PTL lock held.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c0e9f15be2a2..189d0548d4ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2644,7 +2644,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 
 /*
  * Prepare for calling migrate_misplaced_folio() by isolating the folio if
- * permitted. Must be called with the PTL still held.
+ * permitted. Must be called with the PTL still held if called with a non-NULL
+ * vma.
  */
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -2661,7 +2662,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_maybe_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
+		if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
 			return -EACCES;
 
 		/*
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 2/8] migrate: implement migrate_misplaced_folios_batch
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 3/8] mm: Hot page tracking and promotion Bharata B Rao
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

From: Gregory Price <gourry@gourry.net>

A common operation in tiering is to migrate multiple pages at once.
The migrate_misplaced_folio function requires one call for each
individual folio.  Expose a batch-variant of the same call for use
when doing batch migrations.

Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/migrate.h |  6 ++++++
 mm/migrate.c            | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 1f0ac122c3bf..2ace66772c16 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -103,6 +103,7 @@ static inline int set_movable_ops(const struct movable_operations *ops, enum pag
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
+int migrate_misplaced_folios_batch(struct list_head *foliolist, int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -113,6 +114,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline int migrate_misplaced_folios_batch(struct list_head *foliolist,
+						 int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index 189d0548d4ce..990a251aea33 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2739,5 +2739,41 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	BUG_ON(!list_empty(&migratepages));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/**
+ * migrate_misplaced_folios_batch - Batch variant of migrate_misplaced_folio.
+ * Attempts to migrate a folio list to the specified destination.
+ * @folio_list - Isolated list of folios to be batch-migrated.
+ * @node - The NUMA node ID to where the folios should be migrated.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio.
+ *
+ * This function will un-isolate the folios, dereference them, and
+ * remove them from the list before returning.
+ *
+ * Return: 0 on success and -EAGAIN on failure or partial migration.
+ * 	   On return, @folio_list will be empty regardless of success/failure.
+ */
+int migrate_misplaced_folios_batch(struct list_head *folio_list, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	unsigned int nr_succeeded;
+	int nr_remaining;
+
+	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+				     NULL, node, MIGRATE_ASYNC,
+				     MR_NUMA_MISPLACED, &nr_succeeded);
+	if (nr_remaining)
+		putback_movable_pages(folio_list);
+
+	if (nr_succeeded) {
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+	}
+	BUG_ON(!list_empty(folio_list));
+	return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 3/8] mm: Hot page tracking and promotion
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
       [not found]   ` <CGME20251126132450epcas5p123220533572f40d70799294cd3ca4819@epcas5p1.samsung.com>
  2025-11-10  5:23 ` [RFC PATCH v3 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

This introduces a sub-system for collecting memory access
information from different sources. It maintains the hotness
information based on the access history and time of access.

Additionally, it provides per-lowertier-node kernel threads
(named kmigrated) that periodically promote the pages that
are eligible for promotion.

Sub-systems that generate hot page access info can report that
using this API:

int pghot_record_access(unsigned long pfn, int nid, int src,
                        unsigned long time)

@pfn: The PFN of the memory accessed
@nid: The accessing NUMA node ID
@src: The temperature source (sub-system) that generated the
      access info
@time: The access time in jiffies

Some temperature sources may not provide the nid from which
the page was accessed. This is true for sources that use
page table scanning for PTE Accessed bit. For such sources,
the default toptier node to which such pages should be promoted
is hard coded.

Also, the access time provided some sources may at best be
considered approximate. This is especially true for hot pages
detected by PTE A bit scanning.

The hotness information is stored for every page of lower
tier memory in an unsigned long variable that is part of
mem_section data structure.

kmigrated is a per-lowertier-node kernel thread that migrates
the folios marked for migration in batches. Each kmigrated
thread walks the PFN range spanning its node and checks
for potential migration candidates.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/mmzone.h        |  14 ++
 include/linux/pghot.h         |  52 ++++
 include/linux/vm_event_item.h |   4 +
 mm/Kconfig                    |  11 +
 mm/Makefile                   |   1 +
 mm/mm_init.c                  |  10 +
 mm/page_ext.c                 |  11 +
 mm/pghot.c                    | 446 ++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |   4 +
 9 files changed, 553 insertions(+)
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/pghot.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7fb7331c5725..fde851990394 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1068,6 +1068,7 @@ enum pgdat_flags {
 					 * many pages under writeback
 					 */
 	PGDAT_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
+	PGDAT_KMIGRATED_ACTIVATE,	/* activates kmigrated */
 };
 
 enum zone_flags {
@@ -1522,6 +1523,10 @@ typedef struct pglist_data {
 #ifdef CONFIG_MEMORY_FAILURE
 	struct memory_failure_stats mf_stats;
 #endif
+#ifdef CONFIG_PGHOT
+	struct task_struct *kmigrated;
+	wait_queue_head_t kmigrated_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -1920,12 +1925,21 @@ struct mem_section {
 	unsigned long section_mem_map;
 
 	struct mem_section_usage *usage;
+#ifdef CONFIG_PGHOT
+	/*
+	 * Per-PFN hotness data for this section.
+	 */
+	unsigned long *hot_map;
+#endif
 #ifdef CONFIG_PAGE_EXTENSION
 	/*
 	 * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use
 	 * section. (see page_ext.h about this.)
 	 */
 	struct page_ext *page_ext;
+#endif
+#if (defined(CONFIG_PGHOT) && !defined(CONFIG_PAGE_EXTENSION)) || \
+		(!defined(CONFIG_PGHOT) && defined(CONFIG_PAGE_EXTENSION))
 	unsigned long pad;
 #endif
 	/*
diff --git a/include/linux/pghot.h b/include/linux/pghot.h
new file mode 100644
index 000000000000..7238ddf18a35
--- /dev/null
+++ b/include/linux/pghot.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PGHOT_H
+#define _LINUX_PGHOT_H
+
+/* Page hotness temperature sources */
+enum pghot_src {
+	PGHOT_HW_HINTS,
+	PGHOT_PGTABLE_SCAN,
+	PGHOT_HINT_FAULT,
+};
+
+#ifdef CONFIG_PGHOT
+#define PGHOT_FREQ_WINDOW	(5 * MSEC_PER_SEC)
+#define PGHOT_FREQ_THRESHOLD	2
+
+#define KMIGRATE_DELAY_MS	100
+#define KMIGRATE_BATCH		512
+
+#define PGHOT_DEFAULT_NODE	0
+
+/*
+ * Bits 0-31 are used to store nid, frequency and time.
+ * Bits 32-62 are unused now.
+ * Bit 63 is used to indicate the page is ready for migration.
+ */
+#define PGHOT_MIGRATE_READY	63
+
+#define PGHOT_NID_WIDTH		10
+#define PGHOT_FREQ_WIDTH	3
+/* time is stored in 19 bits which can represent up to 8.73s with HZ=1000 */
+#define PGHOT_TIME_WIDTH	19
+
+#define PGHOT_NID_SHIFT		0
+#define PGHOT_FREQ_SHIFT	(PGHOT_NID_SHIFT + PGHOT_NID_WIDTH)
+#define PGHOT_TIME_SHIFT	(PGHOT_FREQ_SHIFT + PGHOT_FREQ_WIDTH)
+
+#define PGHOT_NID_MASK		((1UL << PGHOT_NID_SHIFT) - 1)
+#define PGHOT_FREQ_MASK		((1UL << PGHOT_FREQ_SHIFT) - 1)
+#define PGHOT_TIME_MASK		((1UL << PGHOT_TIME_SHIFT) - 1)
+
+#define PGHOT_NID_MAX		(1 << PGHOT_NID_WIDTH)
+#define PGHOT_FREQ_MAX		(1 << PGHOT_FREQ_WIDTH)
+#define PGHOT_TIME_MAX		(1 << PGHOT_TIME_WIDTH)
+
+int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now);
+#else
+static inline int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
+{
+	return 0;
+}
+#endif /* CONFIG_PGHOT */
+#endif /* _LINUX_PGHOT_H */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 92f80b4d69a6..4731d667231d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -188,6 +188,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSTACK_REST,
 #endif
 #endif /* CONFIG_DEBUG_STACK_USAGE */
+		PGHOT_RECORDED_ACCESSES,
+		PGHOT_RECORD_HWHINTS,
+		PGHOT_RECORD_PGTSCANS,
+		PGHOT_RECORD_HINTFAULTS,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 0e26f4fc8717..b5e84cb50253 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1372,6 +1372,17 @@ config PT_RECLAIM
 config FIND_NORMAL_PAGE
 	def_bool n
 
+config PGHOT
+	bool "Hot page tracking and promotion"
+	def_bool y
+	depends on NUMA && MIGRATION && SPARSEMEM && MMU
+	help
+	  A sub-system to track page accesses in lower tier memory and
+	  maintain hot page information. Promotes hot pages from lower
+	  tiers to top tier by using the memory access information provided
+	  by various sources. Asynchronous promotion is done by per-node
+	  kernel threads.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..a6fac171c36e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -146,3 +146,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
+obj-$(CONFIG_PGHOT) += pghot.o
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3db2dea7db4c..2c0199f7691b 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1401,6 +1401,15 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
 static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
 #endif
 
+#ifdef CONFIG_PGHOT
+static void pgdat_init_kmigrated(struct pglist_data *pgdat)
+{
+	init_waitqueue_head(&pgdat->kmigrated_wait);
+}
+#else
+static inline void pgdat_init_kmigrated(struct pglist_data *pgdat) {}
+#endif
+
 static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 {
 	int i;
@@ -1410,6 +1419,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
+	pgdat_init_kmigrated(pgdat);
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
diff --git a/mm/page_ext.c b/mm/page_ext.c
index d7396a8970e5..a32d43755306 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -76,6 +76,16 @@ static struct page_ext_operations page_idle_ops __initdata = {
 };
 #endif
 
+static bool need_page_mig(void)
+{
+	return true;
+}
+
+static struct page_ext_operations page_mig_ops __initdata = {
+	.need = need_page_mig,
+	.need_shared_flags = true,
+};
+
 static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_OWNER
 	&page_owner_ops,
@@ -89,6 +99,7 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
 #ifdef CONFIG_PAGE_TABLE_CHECK
 	&page_table_check_ops,
 #endif
+	&page_mig_ops,
 };
 
 unsigned long page_ext_size;
diff --git a/mm/pghot.c b/mm/pghot.c
new file mode 100644
index 000000000000..7c1a32f8a7ba
--- /dev/null
+++ b/mm/pghot.c
@@ -0,0 +1,446 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Maintains information about hot pages from slower tier nodes and
+ * promotes them.
+ *
+ * Per-PFN hotness information is stored for lower tier nodes in
+ * mem_section. An unsigned long variable is used to store the
+ * frequency of access, last access time and the nid to which the
+ * page needs to be migrated.
+ *
+ * A kernel thread named kmigrated is provided to migrate or promote
+ * the hot pages. kmigrated runs for each lower tier node. It iterates
+ * over the node's PFNs and  migrates pages marked for migration into
+ * their targeted nodes.
+ */
+#include <linux/mm.h>
+#include <linux/migrate.h>
+#include <linux/memory-tiers.h>
+#include <linux/cpuhotplug.h>
+#include <linux/pghot.h>
+
+static unsigned int sysctl_pghot_freq_window = PGHOT_FREQ_WINDOW;
+
+/*
+ * Sysctl tunables to selectively enable access recording from different
+ * sources.
+ */
+static unsigned int sysctl_pghot_record_hwhints_enable;
+static unsigned int sysctl_pghot_record_pgtscans_enable;
+static unsigned int sysctl_pghot_record_hintfaults_enable;
+
+static DEFINE_STATIC_KEY_FALSE(pghot_record_hwhints);
+static DEFINE_STATIC_KEY_FALSE(pghot_record_pgtscans);
+static DEFINE_STATIC_KEY_FALSE(pghot_record_hintfaults);
+
+#ifdef CONFIG_SYSCTL
+static int sysctl_record_enable_handler(const struct ctl_table *table, int write,
+					void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err, val;
+
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (err || !write)
+		return err;
+
+	val = *(int *)table->data;
+
+	if (table->data == &sysctl_pghot_record_hwhints_enable) {
+		if (val)
+			static_branch_enable(&pghot_record_hwhints);
+		else
+			static_branch_disable(&pghot_record_hwhints);
+	} else if (table->data == &sysctl_pghot_record_pgtscans_enable) {
+		if (val)
+			static_branch_enable(&pghot_record_pgtscans);
+		else
+			static_branch_disable(&pghot_record_pgtscans);
+	} else if (table->data == &sysctl_pghot_record_hintfaults_enable) {
+		if (val)
+			static_branch_enable(&pghot_record_hintfaults);
+		else
+			static_branch_disable(&pghot_record_hintfaults);
+	}
+	return 0;
+}
+
+static const struct ctl_table pghot_sysctls[] = {
+	{
+		.procname       = "pghot_record_hwhints_enable",
+		.data           = &sysctl_pghot_record_hwhints_enable,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = sysctl_record_enable_handler,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
+	{
+		.procname       = "pghot_record_pgtscans_enable",
+		.data           = &sysctl_pghot_record_pgtscans_enable,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = sysctl_record_enable_handler,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
+	{
+		.procname       = "pghot_record_hintfaults_enable",
+		.data           = &sysctl_pghot_record_hintfaults_enable,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = sysctl_record_enable_handler,
+		.extra1         = SYSCTL_ZERO,
+		.extra2         = SYSCTL_ONE,
+	},
+	{
+		.procname       = "pghot_promote_freq_window_ms",
+		.data           = &sysctl_pghot_freq_window,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1         = SYSCTL_ZERO,
+	},
+};
+#endif
+
+static bool kmigrated_started __ro_after_init;
+
+/**
+ *
+ * pghot_record_access - Record page accesses from lower tier memory
+ * for the purpose of tracking page hotness and subsequent promotion.
+ *
+ * @pfn - PFN of the page
+ * @nid - Target NID to were the page needs to be migrated
+ * @src - The identifier of the sub-system that reports the access
+ * @now - Access time in jiffies
+ *
+ * Updates the NID, frequency and time of access and marks the page as
+ * ready for migration if the frequency crosses a threshold. The pages
+ * marked for migration are migrated by kmigrated kernel thread.
+ *
+ * Return: 0 on success and -EAGAIN on failure to record the access.
+ */
+int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
+{
+	unsigned long time = now & PGHOT_TIME_MASK;
+	unsigned long old_nid, old_freq, old_time;
+	unsigned long *phi, old_hotness, hotness;
+	bool new_window = false;
+	struct mem_section *ms;
+	struct folio *folio;
+	struct page *page;
+	unsigned long freq;
+
+	if (!kmigrated_started)
+		return -EINVAL;
+
+	if (nid >= PGHOT_NID_MAX)
+		return -EINVAL;
+
+	count_vm_event(PGHOT_RECORDED_ACCESSES);
+	switch (src) {
+	case PGHOT_HW_HINTS:
+		if (!static_branch_likely(&pghot_record_hwhints))
+			return -EINVAL;
+		count_vm_event(PGHOT_RECORD_HWHINTS);
+		break;
+	case PGHOT_PGTABLE_SCAN:
+		if (!static_branch_likely(&pghot_record_pgtscans))
+			return -EINVAL;
+		count_vm_event(PGHOT_RECORD_PGTSCANS);
+		break;
+	case PGHOT_HINT_FAULT:
+		if (!static_branch_likely(&pghot_record_hintfaults))
+			return -EINVAL;
+		count_vm_event(PGHOT_RECORD_HINTFAULTS);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/*
+	 * Record only accesses from lower tiers.
+	 */
+	if (node_is_toptier(pfn_to_nid(pfn)))
+		return 0;
+
+	/*
+	 * Reject the non-migratable pages right away.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (!page || is_zone_device_page(page))
+		return 0;
+
+	folio = page_folio(page);
+	if (!folio_test_lru(folio))
+		return 0;
+
+	/* Get the hotness slot corresponding to the 1st PFN of the folio */
+	pfn = folio_pfn(folio);
+	ms = __pfn_to_section(pfn);
+	if (!ms)
+		return -EINVAL;
+	phi = &ms->hot_map[pfn % PAGES_PER_SECTION];
+
+	/*
+	 * Update the hotness parameters.
+	 */
+	old_hotness = READ_ONCE(*phi);
+	do {
+		hotness = old_hotness;
+		old_nid = (hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK;
+		old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+		old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+
+		if (((time - old_time) > msecs_to_jiffies(sysctl_pghot_freq_window))
+		    || (nid != NUMA_NO_NODE && old_nid != nid))
+			new_window = true;
+
+		if (new_window)
+			freq = 1;
+		else if (old_freq < PGHOT_FREQ_MAX)
+			freq = old_freq + 1;
+		nid = (nid == NUMA_NO_NODE) ? PGHOT_DEFAULT_NODE : nid;
+
+		hotness &= ~(PGHOT_NID_MASK << PGHOT_NID_SHIFT);
+		hotness &= ~(PGHOT_FREQ_MASK << PGHOT_FREQ_SHIFT);
+		hotness &= ~(PGHOT_TIME_MASK << PGHOT_TIME_SHIFT);
+
+		hotness |= (nid & PGHOT_NID_MASK) << PGHOT_NID_SHIFT;
+		hotness |= (freq & PGHOT_FREQ_MASK) << PGHOT_FREQ_SHIFT;
+		hotness |= (time & PGHOT_TIME_MASK) << PGHOT_TIME_SHIFT;
+
+		if (freq > PGHOT_FREQ_THRESHOLD)
+			set_bit(PGHOT_MIGRATE_READY, &hotness);
+	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+
+	if (test_bit(PGHOT_MIGRATE_READY, &hotness))
+		set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags);
+	return 0;
+}
+
+static int pghot_get_hotness(unsigned long pfn, unsigned long *nid, unsigned long *freq,
+				    unsigned long *time)
+{
+	unsigned long *phi, old_hotness, hotness;
+	struct mem_section *ms;
+
+	ms = __pfn_to_section(pfn);
+	if (!ms)
+		return -EINVAL;
+
+	phi = &ms->hot_map[pfn % PAGES_PER_SECTION];
+	if (!test_and_clear_bit(PGHOT_MIGRATE_READY, phi))
+		return -EINVAL;
+
+	old_hotness = READ_ONCE(*phi);
+	do {
+		hotness = old_hotness;
+		*nid = (hotness >> PGHOT_NID_SHIFT) & PGHOT_NID_MASK;
+		*freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
+		*time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
+		hotness = 0;
+
+	} while (unlikely(!try_cmpxchg(phi, &old_hotness, hotness)));
+	return 0;
+}
+
+/*
+ * Walks the PFNs of the zone, isolates and migrates them in batches.
+ */
+static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
+				int src_nid)
+{
+	int cur_nid = NUMA_NO_NODE;
+	LIST_HEAD(migrate_list);
+	int batch_count = 0;
+	struct folio *folio;
+	struct page *page;
+	unsigned long pfn;
+
+	pfn = start_pfn;
+	do {
+		unsigned long nid = NUMA_NO_NODE, freq = 0, time = 0, nr = 1;
+
+		if (!pfn_valid(pfn))
+			goto out_next;
+
+		page = pfn_to_online_page(pfn);
+		if (!page)
+			goto out_next;
+
+		folio = page_folio(page);
+		nr = folio_nr_pages(folio);
+		if (folio_nid(folio) != src_nid)
+			goto out_next;
+
+		if (!folio_test_lru(folio))
+			goto out_next;
+
+		if (pghot_get_hotness(pfn, &nid, &freq, &time))
+			goto out_next;
+
+		if (nid == NUMA_NO_NODE)
+			goto out_next;
+
+		if (folio_nid(folio) == nid)
+			goto out_next;
+
+		if (migrate_misplaced_folio_prepare(folio, NULL, nid))
+			goto out_next;
+
+		if (cur_nid != NUMA_NO_NODE)
+			cur_nid = nid;
+
+		if (++batch_count >= KMIGRATE_BATCH || cur_nid != nid) {
+			migrate_misplaced_folios_batch(&migrate_list, cur_nid);
+			cur_nid = nid;
+			batch_count = 0;
+			cond_resched();
+		}
+		list_add(&folio->lru, &migrate_list);
+out_next:
+		pfn += nr;
+	} while (pfn < end_pfn);
+	if (!list_empty(&migrate_list))
+		migrate_misplaced_folios_batch(&migrate_list, cur_nid);
+}
+
+static void kmigrated_do_work(pg_data_t *pgdat)
+{
+	struct zone *zone;
+	int zone_idx;
+
+	clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+	for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
+		zone = &pgdat->node_zones[zone_idx];
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_is_zone_device(zone))
+			continue;
+
+		kmigrated_walk_zone(zone->zone_start_pfn, zone_end_pfn(zone),
+				    pgdat->node_id);
+	}
+}
+
+static inline bool kmigrated_work_requested(pg_data_t *pgdat)
+{
+	return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
+}
+
+/*
+ * Per-node kthread that iterates over its PFNs and migrates the
+ * pages that have been marked for migration.
+ */
+static int kmigrated(void *p)
+{
+	long timeout = msecs_to_jiffies(KMIGRATE_DELAY_MS);
+	pg_data_t *pgdat = p;
+
+	while (!kthread_should_stop()) {
+		if (wait_event_timeout(pgdat->kmigrated_wait, kmigrated_work_requested(pgdat),
+				       timeout))
+			kmigrated_do_work(pgdat);
+	}
+	return 0;
+}
+
+static int kmigrated_run(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	int ret;
+
+	if (node_is_toptier(nid))
+		return 0;
+
+	if (!pgdat->kmigrated) {
+		pgdat->kmigrated = kthread_create_on_node(kmigrated, pgdat, nid,
+							  "kmigrated%d", nid);
+		if (IS_ERR(pgdat->kmigrated)) {
+			ret = PTR_ERR(pgdat->kmigrated);
+			pgdat->kmigrated = NULL;
+			pr_err("Failed to start kmigrated%d, ret %d\n", nid, ret);
+			return ret;
+		}
+		pr_info("pghot: Started kmigrated thread for node %d\n", nid);
+	}
+	wake_up_process(pgdat->kmigrated);
+	return 0;
+}
+
+static void pghot_free_hot_map(void)
+{
+	unsigned long section_nr, s_begin;
+	struct mem_section *ms;
+
+	/* s_begin = first_present_section_nr(); */
+	s_begin = next_present_section_nr(-1);
+	for_each_present_section_nr(s_begin, section_nr) {
+		ms = __nr_to_section(section_nr);
+		kfree(ms->hot_map);
+	}
+}
+
+static int pghot_alloc_hot_map(void)
+{
+	unsigned long section_nr, s_begin, start_pfn;
+	struct mem_section *ms;
+	int nid;
+
+	/* s_begin = first_present_section_nr(); */
+	s_begin = next_present_section_nr(-1);
+	for_each_present_section_nr(s_begin, section_nr) {
+		ms = __nr_to_section(section_nr);
+		start_pfn = section_nr_to_pfn(section_nr);
+		nid = pfn_to_nid(start_pfn);
+
+		if (node_is_toptier(nid) || !pfn_valid(start_pfn))
+			continue;
+
+		ms->hot_map = kcalloc_node(PAGES_PER_SECTION, sizeof(*ms->hot_map), GFP_KERNEL,
+					   nid);
+		if (!ms->hot_map)
+			goto out_free_hot_map;
+	}
+	return 0;
+
+out_free_hot_map:
+	pghot_free_hot_map();
+	return -ENOMEM;
+}
+
+static int __init pghot_init(void)
+{
+	pg_data_t *pgdat;
+	int nid, ret;
+
+	ret = pghot_alloc_hot_map();
+	if (ret)
+		return ret;
+
+	for_each_node_state(nid, N_MEMORY) {
+		ret = kmigrated_run(nid);
+		if (ret)
+			goto out_stop_kthread;
+	}
+	register_sysctl_init("vm", pghot_sysctls);
+	kmigrated_started = true;
+	return 0;
+
+out_stop_kthread:
+	for_each_node_state(nid, N_MEMORY) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kmigrated) {
+			kthread_stop(pgdat->kmigrated);
+			pgdat->kmigrated = NULL;
+		}
+	}
+	pghot_free_hot_map();
+	return ret;
+}
+
+late_initcall_sync(pghot_init)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bb09c032eecf..49d974f8e8b3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1496,6 +1496,10 @@ const char * const vmstat_text[] = {
 	[I(KSTACK_REST)]			= "kstack_rest",
 #endif
 #endif
+	[I(PGHOT_RECORDED_ACCESSES)]            = "pghot_recorded_accesses",
+	[I(PGHOT_RECORD_HWHINTS)]               = "pghot_recorded_hwhints",
+	[I(PGHOT_RECORD_PGTSCANS)]              = "pghot_recorded_pgtscans",
+	[I(PGHOT_RECORD_HINTFAULTS)]            = "pghot_recorded_hintfaults",
 #undef I
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 };
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CGME20251126132450epcas5p123220533572f40d70799294cd3ca4819@epcas5p1.samsung.com>]

* Re: [RFC PATCH v3 3/8] mm: Hot page tracking and promotion
       [not found]   ` <CGME20251126132450epcas5p123220533572f40d70799294cd3ca4819@epcas5p1.samsung.com>
@ 2025-11-26 13:24     ` Alok Rathore
  2025-12-06 10:22       ` Bharata B Rao
  0 siblings, 1 reply; 12+ messages in thread
From: Alok Rathore @ 2025-11-26 13:24 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy,
	yuanchu, balbirs, shivankg, alokrathore20, cpgs

[-- Attachment #1: Type: text/plain, Size: 2990 bytes --]

On 10/11/25 10:53AM, Bharata B Rao wrote:
>This introduces a sub-system for collecting memory access
>information from different sources. It maintains the hotness
>information based on the access history and time of access.
>
>Additionally, it provides per-lowertier-node kernel threads
>(named kmigrated) that periodically promote the pages that
>are eligible for promotion.
>
>Sub-systems that generate hot page access info can report that
>using this API:
>
>int pghot_record_access(unsigned long pfn, int nid, int src,
>                        unsigned long time)
>
>@pfn: The PFN of the memory accessed
>@nid: The accessing NUMA node ID
>@src: The temperature source (sub-system) that generated the
>      access info
>@time: The access time in jiffies
>
>Some temperature sources may not provide the nid from which
>the page was accessed. This is true for sources that use
>page table scanning for PTE Accessed bit. For such sources,
>the default toptier node to which such pages should be promoted
>is hard coded.
>
>Also, the access time provided some sources may at best be
>considered approximate. This is especially true for hot pages
>detected by PTE A bit scanning.
>
>The hotness information is stored for every page of lower
>tier memory in an unsigned long variable that is part of
>mem_section data structure.
>
>kmigrated is a per-lowertier-node kernel thread that migrates
>the folios marked for migration in batches. Each kmigrated
>thread walks the PFN range spanning its node and checks
>for potential migration candidates.
>
>Signed-off-by: Bharata B Rao <bharata@amd.com>
>---
> include/linux/mmzone.h        |  14 ++
> include/linux/pghot.h         |  52 ++++
> include/linux/vm_event_item.h |   4 +
> mm/Kconfig                    |  11 +
> mm/Makefile                   |   1 +
> mm/mm_init.c                  |  10 +
> mm/page_ext.c                 |  11 +
> mm/pghot.c                    | 446 ++++++++++++++++++++++++++++++++++
> mm/vmstat.c                   |   4 +
> 9 files changed, 553 insertions(+)
> create mode 100644 include/linux/pghot.h
> create mode 100644 mm/pghot.c
>
>+

<snip>

>+/*
>+ * Walks the PFNs of the zone, isolates and migrates them in batches.
>+ */
>+static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
>+				int src_nid)
>+{
>+	int cur_nid = NUMA_NO_NODE;
>+	LIST_HEAD(migrate_list);
>+	int batch_count = 0;
>+	struct folio *folio;
>+	struct page *page;
>+	unsigned long pfn;
>+
>+	pfn = start_pfn;
>+	do {
>+		unsigned long nid = NUMA_NO_NODE, freq = 0, time = 0, nr = 1;
>+
>+		if (!pfn_valid(pfn))
>+			goto out_next;
>+
>+		page = pfn_to_online_page(pfn);
>+		if (!page)
>+			goto out_next;
>+
>+		folio = page_folio(page);
>+		nr = folio_nr_pages(folio);
>+		if (folio_nid(folio) != src_nid)
>+			goto out_next;
>+
>+		if (!folio_test_lru(folio))
>+			goto out_next;
>+
>+		if (pghot_get_hotness(pfn, &nid, &freq, &time))

Better to remove freq value, it’s not used later.

Regards,
Alok Rathore

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 3/8] mm: Hot page tracking and promotion
  2025-11-26 13:24     ` Alok Rathore
@ 2025-12-06 10:22       ` Bharata B Rao
  0 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-12-06 10:22 UTC (permalink / raw)
  To: Alok Rathore
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy,
	yuanchu, balbirs, shivankg, alokrathore20, cpgs

On 26-Nov-25 6:54 PM, Alok Rathore wrote:
>> +/*
>> + * Walks the PFNs of the zone, isolates and migrates them in batches.
>> + */
>> +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
>> +                int src_nid)
>> +{
>> +    int cur_nid = NUMA_NO_NODE;
>> +    LIST_HEAD(migrate_list);
>> +    int batch_count = 0;
>> +    struct folio *folio;
>> +    struct page *page;
>> +    unsigned long pfn;
>> +
>> +    pfn = start_pfn;
>> +    do {
>> +        unsigned long nid = NUMA_NO_NODE, freq = 0, time = 0, nr = 1;
>> +
>> +        if (!pfn_valid(pfn))
>> +            goto out_next;
>> +
>> +        page = pfn_to_online_page(pfn);
>> +        if (!page)
>> +            goto out_next;
>> +
>> +        folio = page_folio(page);
>> +        nr = folio_nr_pages(folio);
>> +        if (folio_nid(folio) != src_nid)
>> +            goto out_next;
>> +
>> +        if (!folio_test_lru(folio))
>> +            goto out_next;
>> +
>> +        if (pghot_get_hotness(pfn, &nid, &freq, &time))
> 
> Better to remove freq value, it’s not used later.

Yes, can be removed. I am leaving it here just for a future possibility
of having kmigrated being driven to migrate hottest pages on priority
compared to hot pages at which point frequency value may come in handy.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 4/8] x86: ibs: In-kernel IBS driver for memory access profiling
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (2 preceding siblings ...)
  2025-11-10  5:23 ` [RFC PATCH v3 3/8] mm: Hot page tracking and promotion Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

Use IBS (Instruction Based Sampling) feature present
in AMD processors for memory access tracking. The access
information obtained from IBS via NMI is fed to pghot
sub-system for futher action.

In addition to many other information related to the memory
access, IBS provides physical (and virtual) address of the access
and indicates if the access came from slower tier. Only memory
accesses originating from slower tiers are further acted upon
by this driver.

The samples are initially accumulated in percpu buffers which
are flushed to pghot hot page tracking mechanism using irq_work.

TODO: Many counters are added to vmstat just as debugging aid
for now.

About IBS
---------
IBS can be programmed to provide data about instruction
execution periodically. This is done by programming a desired
sample count (number of ops) in a control register. When the
programmed number of ops are dispatched, a micro-op gets tagged,
various information about the tagged micro-op's execution is
populated in IBS execution MSRs and an interrupt is raised.
While IBS provides a lot of data for each sample, for the
purpose of  memory access profiling, we are interested in
linear and physical address of the memory access that reached
DRAM. Recent AMD processors provide further filtering where
it is possible to limit the sampling to those ops that had
an L3 miss which greately reduces the non-useful samples.

While IBS provides capability to sample instruction fetch
and execution, only IBS execution sampling is used here
to collect data about memory accesses that occur during
the instruction execution.

More information about IBS is available in Sec 13.3 of
AMD64 Architecture Programmer's Manual, Volume 2:System
Programming which is present at:
https://bugzilla.kernel.org/attachment.cgi?id=288923

Information about MSRs used for programming IBS can be
found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h
Model 11h B1 which is currently present at:
https://www.amd.com/system/files/TechDocs/55901_0.25.zip

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/events/amd/ibs.c        |  11 ++
 arch/x86/include/asm/ibs.h       |   7 +
 arch/x86/include/asm/msr-index.h |  16 ++
 arch/x86/mm/Makefile             |   3 +-
 arch/x86/mm/ibs.c                | 311 +++++++++++++++++++++++++++++++
 include/linux/vm_event_item.h    |  17 ++
 mm/vmstat.c                      |  17 ++
 7 files changed, 381 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 112f43b23ebf..1498dc9caeb2 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -13,9 +13,11 @@
 #include <linux/ptrace.h>
 #include <linux/syscore_ops.h>
 #include <linux/sched/clock.h>
+#include <linux/pghot.h>
 
 #include <asm/apic.h>
 #include <asm/msr.h>
+#include <asm/ibs.h>
 
 #include "../perf_event.h"
 
@@ -1756,6 +1758,15 @@ static __init int amd_ibs_init(void)
 {
 	u32 caps;
 
+	/*
+	 * TODO: Find a clean way to disable perf IBS so that IBS
+	 * can be used for memory access profiling.
+	 */
+	if (arch_hw_access_profiling) {
+		pr_info("IBS isn't available for perf use\n");
+		return 0;
+	}
+
 	caps = __get_ibs_caps();
 	if (!caps)
 		return -ENODEV;	/* ibs not supported by the cpu */
diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h
new file mode 100644
index 000000000000..b5a4f2ca6330
--- /dev/null
+++ b/arch/x86/include/asm/ibs.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IBS_H
+#define _ASM_X86_IBS_H
+
+extern bool arch_hw_access_profiling;
+
+#endif /* _ASM_X86_IBS_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9e1720d73244..59657bd768c9 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -760,6 +760,22 @@
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
 
+/* AMD IBS MSR bits */
+#define MSR_AMD64_IBSOPDATA2_DATASRC			0x7
+#define MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE		0x1
+#define MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR	0x2
+#define MSR_AMD64_IBSOPDATA2_DATASRC_DRAM		0x3
+#define MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE	0x5
+#define MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM		0x8
+#define	MSR_AMD64_IBSOPDATA2_RMTNODE			0x10
+
+#define MSR_AMD64_IBSOPDATA3_LDOP		BIT_ULL(0)
+#define MSR_AMD64_IBSOPDATA3_STOP		BIT_ULL(1)
+#define MSR_AMD64_IBSOPDATA3_DCMISS		BIT_ULL(7)
+#define MSR_AMD64_IBSOPDATA3_LADDR_VALID	BIT_ULL(17)
+#define MSR_AMD64_IBSOPDATA3_PADDR_VALID	BIT_ULL(18)
+#define MSR_AMD64_IBSOPDATA3_L2MISS		BIT_ULL(20)
+
 /* Zen4 */
 #define MSR_ZEN4_BP_CFG                 0xc001102e
 #define MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT 4
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5b9908f13dcf..967e5af9eba9 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,7 +22,8 @@ CFLAGS_REMOVE_pgprot.o			= -pg
 endif
 
 obj-y				:=  init.o init_$(BITS).o fault.o ioremap.o extable.o mmap.o \
-				    pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o
+				    pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o \
+				    ibs.o
 
 obj-y				+= pat/
 
diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
new file mode 100644
index 000000000000..de2e506fce48
--- /dev/null
+++ b/arch/x86/mm/ibs.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/init.h>
+#include <linux/pghot.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+#include <linux/irq_work.h>
+
+#include <asm/nmi.h>
+#include <asm/perf_event.h> /* TODO: Move defns like IBS_OP_ENABLE into non-perf header */
+#include <asm/apic.h>
+#include <asm/ibs.h>
+
+bool arch_hw_access_profiling;
+static u64 ibs_config __read_mostly;
+static u32 ibs_caps;
+
+#define IBS_NR_SAMPLES	150
+
+/*
+ * Basic access info captured for each memory access.
+ */
+struct ibs_sample {
+	unsigned long pfn;
+	unsigned long time;	/* jiffies when accessed */
+	int nid;		/* Accessing node ID, if known */
+};
+
+/*
+ * Percpu buffer of access samples. Samples are accumulated here
+ * before pushing them to pghot sub-system for further action.
+ */
+struct ibs_sample_pcpu {
+	struct ibs_sample samples[IBS_NR_SAMPLES];
+	int head, tail;
+};
+
+struct ibs_sample_pcpu __percpu *ibs_s;
+
+/*
+ * The workqueue for pushing the percpu access samples to pghot sub-system.
+ */
+static struct work_struct ibs_work;
+static struct irq_work ibs_irq_work;
+
+/*
+ * Record the IBS-reported access sample in percpu buffer.
+ * Called from IBS NMI handler.
+ */
+static int ibs_push_sample(unsigned long pfn, int nid, unsigned long time)
+{
+	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
+	int next = ibs_pcpu->head + 1;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	if (next == ibs_pcpu->tail)
+		return 0;
+
+	ibs_pcpu->samples[ibs_pcpu->head].pfn = pfn;
+	ibs_pcpu->samples[ibs_pcpu->head].time = time;
+	ibs_pcpu->head = next;
+	return 1;
+}
+
+static int ibs_pop_sample(struct ibs_sample *s)
+{
+	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
+
+	int next = ibs_pcpu->tail + 1;
+
+	if (ibs_pcpu->head == ibs_pcpu->tail)
+		return 0;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	*s = ibs_pcpu->samples[ibs_pcpu->tail];
+	ibs_pcpu->tail = next;
+	return 1;
+}
+
+/*
+ * Remove access samples from percpu buffer and send them
+ * to pghot sub-system for further action.
+ */
+static void ibs_work_handler(struct work_struct *work)
+{
+	struct ibs_sample s;
+
+	while (ibs_pop_sample(&s))
+		pghot_record_access(s.pfn, s.nid, PGHOT_HW_HINTS, s.time);
+}
+
+static void ibs_irq_handler(struct irq_work *i)
+{
+	schedule_work_on(smp_processor_id(), &ibs_work);
+}
+
+/*
+ * IBS NMI handler: Process the memory access info reported by IBS.
+ *
+ * Reads the MSRs to collect all the information about the reported
+ * memory access, validates the access, stores the valid sample and
+ * schedules the work on this CPU to further process the sample.
+ */
+static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs)
+{
+	struct mm_struct *mm = current->mm;
+	u64 ops_ctl, ops_data3, ops_data2;
+	u64 laddr = -1, paddr = -1;
+	u64 data_src, rmt_node;
+	struct page *page;
+	unsigned long pfn;
+
+	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
+
+	/*
+	 * When IBS sampling period is reprogrammed via read-modify-update
+	 * of MSR_AMD64_IBSOPCTL, overflow NMIs could be generated with
+	 * IBS_OP_ENABLE not set. For such cases, return as HANDLED.
+	 *
+	 * With this, the handler will say "handled" for all NMIs that
+	 * aren't related to this NMI.  This stems from the limitation of
+	 * having both status and control bits in one MSR.
+	 */
+	if (!(ops_ctl & IBS_OP_VAL))
+		goto handled;
+
+	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_VAL);
+
+	count_vm_event(HWHINT_NR_EVENTS);
+
+	if (!user_mode(regs)) {
+		count_vm_event(HWHINT_KERNEL);
+		goto handled;
+	}
+
+	if (!mm) {
+		count_vm_event(HWHINT_KTHREAD);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA3, ops_data3);
+
+	/* Load/Store ops only */
+	/* TODO: DataSrc isn't valid for stores, so filter out stores? */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_LDOP |
+			   MSR_AMD64_IBSOPDATA3_STOP))) {
+		count_vm_event(HWHINT_NON_LOAD_STORES);
+		goto handled;
+	}
+
+	/* Discard the sample if it was L1 or L2 hit */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_DCMISS |
+			   MSR_AMD64_IBSOPDATA3_L2MISS))) {
+		count_vm_event(HWHINT_DC_L2_HITS);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA2, ops_data2);
+	data_src = ops_data2 & MSR_AMD64_IBSOPDATA2_DATASRC;
+	if (ibs_caps & IBS_CAPS_ZEN4)
+		data_src |= ((ops_data2 & 0xC0) >> 3);
+
+	switch (data_src) {
+	case MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE:
+		count_vm_event(HWHINT_LOCAL_L3L1L2);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR:
+		count_vm_event(HWHINT_LOCAL_PEER_CACHE_NEAR);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_DRAM:
+		count_vm_event(HWHINT_DRAM_ACCESSES);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM:
+		count_vm_event(HWHINT_CXL_ACCESSES);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE:
+		count_vm_event(HWHINT_FAR_CACHE_HITS);
+		break;
+	}
+
+	rmt_node = ops_data2 & MSR_AMD64_IBSOPDATA2_RMTNODE;
+	if (rmt_node)
+		count_vm_event(HWHINT_REMOTE_NODE);
+
+	/* Is linear addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_LADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCLINAD, laddr);
+	else {
+		count_vm_event(HWHINT_LADDR_INVALID);
+		goto handled;
+	}
+
+	/* Discard kernel address accesses */
+	if (laddr & (1UL << 63)) {
+		count_vm_event(HWHINT_KERNEL_ADDR);
+		goto handled;
+	}
+
+	/* Is phys addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_PADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCPHYSAD, paddr);
+	else {
+		count_vm_event(HWHINT_PADDR_INVALID);
+		goto handled;
+	}
+
+	pfn = PHYS_PFN(paddr);
+	page = pfn_to_online_page(pfn);
+	if (!page)
+		goto handled;
+
+	if (!PageLRU(page)) {
+		count_vm_event(HWHINT_NON_LRU);
+		goto handled;
+	}
+
+	if (!ibs_push_sample(pfn, numa_node_id(), jiffies)) {
+		count_vm_event(HWHINT_BUFFER_FULL);
+		goto handled;
+	}
+
+	irq_work_queue(&ibs_irq_work);
+	count_vm_event(HWHINT_USEFUL_SAMPLES);
+
+handled:
+	return NMI_HANDLED;
+}
+
+static inline int get_ibs_lvt_offset(void)
+{
+	u64 val;
+
+	rdmsrl(MSR_AMD64_IBSCTL, val);
+	if (!(val & IBSCTL_LVT_OFFSET_VALID))
+		return -EINVAL;
+
+	return val & IBSCTL_LVT_OFFSET_MASK;
+}
+
+static void setup_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset < 0)
+		goto failed;
+
+	if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0))
+		return;
+failed:
+	pr_warn("IBS APIC setup failed on cpu #%d\n",
+		smp_processor_id());
+}
+
+static void clear_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset >= 0)
+		setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1);
+}
+
+static int x86_amd_ibs_access_profile_startup(unsigned int cpu)
+{
+	setup_APIC_ibs();
+	return 0;
+}
+
+static int x86_amd_ibs_access_profile_teardown(unsigned int cpu)
+{
+	clear_APIC_ibs();
+	return 0;
+}
+
+static int __init ibs_access_profiling_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_IBS)) {
+		pr_info("IBS capability is unavailable for access profiling\n");
+		return 0;
+	}
+
+	ibs_s = alloc_percpu_gfp(struct ibs_sample_pcpu, GFP_KERNEL | __GFP_ZERO);
+	if (!ibs_s)
+		return 0;
+
+	INIT_WORK(&ibs_work, ibs_work_handler);
+	init_irq_work(&ibs_irq_work, ibs_irq_handler);
+
+	/* Uses IBS Op sampling */
+	ibs_config = IBS_OP_CNT_CTL | IBS_OP_ENABLE;
+	ibs_caps = cpuid_eax(IBS_CPUID_FEATURES);
+	if (ibs_caps & IBS_CAPS_ZEN4)
+		ibs_config |= IBS_OP_L3MISSONLY;
+
+	register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs");
+
+	cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
+			  "x86/amd/ibs_access_profile:starting",
+			  x86_amd_ibs_access_profile_startup,
+			  x86_amd_ibs_access_profile_teardown);
+
+	pr_info("IBS setup for memory access profiling\n");
+	return 0;
+}
+
+arch_initcall(ibs_access_profiling_init);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 4731d667231d..557da365946c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -192,6 +192,23 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGHOT_RECORD_HWHINTS,
 		PGHOT_RECORD_PGTSCANS,
 		PGHOT_RECORD_HINTFAULTS,
+		HWHINT_NR_EVENTS,
+		HWHINT_KERNEL,
+		HWHINT_KTHREAD,
+		HWHINT_NON_LOAD_STORES,
+		HWHINT_DC_L2_HITS,
+		HWHINT_LOCAL_L3L1L2,
+		HWHINT_LOCAL_PEER_CACHE_NEAR,
+		HWHINT_FAR_CACHE_HITS,
+		HWHINT_DRAM_ACCESSES,
+		HWHINT_CXL_ACCESSES,
+		HWHINT_REMOTE_NODE,
+		HWHINT_LADDR_INVALID,
+		HWHINT_KERNEL_ADDR,
+		HWHINT_PADDR_INVALID,
+		HWHINT_NON_LRU,
+		HWHINT_BUFFER_FULL,
+		HWHINT_USEFUL_SAMPLES,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 49d974f8e8b3..d99e736a561d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1500,6 +1500,23 @@ const char * const vmstat_text[] = {
 	[I(PGHOT_RECORD_HWHINTS)]               = "pghot_recorded_hwhints",
 	[I(PGHOT_RECORD_PGTSCANS)]              = "pghot_recorded_pgtscans",
 	[I(PGHOT_RECORD_HINTFAULTS)]            = "pghot_recorded_hintfaults",
+	[I(HWHINT_NR_EVENTS)]			= "hwhint_nr_events",
+	[I(HWHINT_KERNEL)]			= "hwhint_kernel",
+	[I(HWHINT_KTHREAD)]			= "hwhint_kthread",
+	[I(HWHINT_NON_LOAD_STORES)]		= "hwhint_non_load_stores",
+	[I(HWHINT_DC_L2_HITS)]			= "hwhint_dc_l2_hits",
+	[I(HWHINT_LOCAL_L3L1L2)]		= "hwhint_local_l3l1l2",
+	[I(HWHINT_LOCAL_PEER_CACHE_NEAR)]	= "hwhint_local_peer_cache_near",
+	[I(HWHINT_FAR_CACHE_HITS)]		= "hwhint_far_cache_hits",
+	[I(HWHINT_DRAM_ACCESSES)]		= "hwhint_dram_accesses",
+	[I(HWHINT_CXL_ACCESSES)]		= "hwhint_cxl_accesses",
+	[I(HWHINT_REMOTE_NODE)]			= "hwhint_remote_node",
+	[I(HWHINT_LADDR_INVALID)]		= "hwhint_invalid_laddr",
+	[I(HWHINT_KERNEL_ADDR)]			= "hwhint_kernel_addr",
+	[I(HWHINT_PADDR_INVALID)]		= "hwhint_invalid_paddr",
+	[I(HWHINT_NON_LRU)]			= "hwhint_non_lru",
+	[I(HWHINT_BUFFER_FULL)]			= "hwhint_buffer_full",
+	[I(HWHINT_USEFUL_SAMPLES)]		= "hwhint_useful_samples",
 #undef I
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 };
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 5/8] x86: ibs: Enable IBS profiling for memory accesses
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (3 preceding siblings ...)
  2025-11-10  5:23 ` [RFC PATCH v3 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 6/8] mm: mglru: generalize page table walk Bharata B Rao
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

Enable IBS memory access data collection for user memory
accesses by programming the required MSRs. The profiling
is turned ON only for user mode execution and turned OFF
for kernel mode execution. Profiling is explicitly disabled
for NMI handler too.

TODOs:

- IBS sampling rate is kept fixed for now.
- Arch/vendor separation/isolation of the code needs relook.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/include/asm/entry-common.h |  3 +++
 arch/x86/include/asm/hardirq.h      |  2 ++
 arch/x86/include/asm/ibs.h          |  2 ++
 arch/x86/mm/ibs.c                   | 32 +++++++++++++++++++++++++++++
 4 files changed, 39 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index ce3eb6d5fdf9..46aafd34b945 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -9,10 +9,12 @@
 #include <asm/io_bitmap.h>
 #include <asm/fpu/api.h>
 #include <asm/fred.h>
+#include <asm/ibs.h>
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
 {
+	hw_access_profiling_stop();
 	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
 		/*
 		 * Make sure that the entry code gave us a sensible EFLAGS
@@ -106,6 +108,7 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 static __always_inline void arch_exit_to_user_mode(void)
 {
 	amd_clear_divider();
+	hw_access_profiling_start();
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index f00c09ffe6a9..0752cb6ebd7a 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -91,4 +91,6 @@ static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(void)
 static __always_inline void kvm_set_cpu_l1tf_flush_l1d(void) { }
 #endif /* IS_ENABLED(CONFIG_KVM_INTEL) */
 
+#define arch_nmi_enter()	hw_access_profiling_stop()
+#define arch_nmi_exit()		hw_access_profiling_start()
 #endif /* _ASM_X86_HARDIRQ_H */
diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h
index b5a4f2ca6330..6b480958534e 100644
--- a/arch/x86/include/asm/ibs.h
+++ b/arch/x86/include/asm/ibs.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_IBS_H
 #define _ASM_X86_IBS_H
 
+void hw_access_profiling_start(void);
+void hw_access_profiling_stop(void);
 extern bool arch_hw_access_profiling;
 
 #endif /* _ASM_X86_IBS_H */
diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
index de2e506fce48..98aa9543c8ec 100644
--- a/arch/x86/mm/ibs.c
+++ b/arch/x86/mm/ibs.c
@@ -16,6 +16,7 @@ static u64 ibs_config __read_mostly;
 static u32 ibs_caps;
 
 #define IBS_NR_SAMPLES	150
+#define IBS_SAMPLE_PERIOD      10000
 
 /*
  * Basic access info captured for each memory access.
@@ -98,6 +99,36 @@ static void ibs_irq_handler(struct irq_work *i)
 	schedule_work_on(smp_processor_id(), &ibs_work);
 }
 
+void hw_access_profiling_stop(void)
+{
+	u64 ops_ctl;
+
+	if (!arch_hw_access_profiling)
+		return;
+
+	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
+	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_ENABLE);
+}
+
+void hw_access_profiling_start(void)
+{
+	u64 config = 0;
+	unsigned int period = IBS_SAMPLE_PERIOD;
+
+	if (!arch_hw_access_profiling)
+		return;
+
+	/* Disable IBS for kernel thread */
+	if (!current->mm)
+		goto out;
+
+	config = (period >> 4) & IBS_OP_MAX_CNT;
+	config |= (period & IBS_OP_MAX_CNT_EXT_MASK);
+	config |= ibs_config;
+out:
+	wrmsrl(MSR_AMD64_IBSOPCTL, config);
+}
+
 /*
  * IBS NMI handler: Process the memory access info reported by IBS.
  *
@@ -304,6 +335,7 @@ static int __init ibs_access_profiling_init(void)
 			  x86_amd_ibs_access_profile_startup,
 			  x86_amd_ibs_access_profile_teardown);
 
+	arch_hw_access_profiling = true;
 	pr_info("IBS setup for memory access profiling\n");
 	return 0;
 }
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 6/8] mm: mglru: generalize page table walk
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (4 preceding siblings ...)
  2025-11-10  5:23 ` [RFC PATCH v3 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

From: Kinsey Ho <kinseyho@google.com>

Refactor the existing MGLRU page table walking logic to make it
resumable.

Additionally, introduce two hooks into the MGLRU page table walk:
accessed callback and flush callback. The accessed callback is called
for each accessed page detected via the scanned accessed bit. The flush
callback is called when the accessed callback reports that a flush is
required. This allows for processing pages in batches for efficiency.

With a generalised page table walk, introduce a new scan function which
repeatedly scans on the same young generation and does not add a new
young generation.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/mmzone.h |   5 ++
 mm/internal.h          |   4 +
 mm/vmscan.c            | 181 +++++++++++++++++++++++++++++++----------
 3 files changed, 145 insertions(+), 45 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fde851990394..421b012fb60c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -548,6 +548,8 @@ struct lru_gen_mm_walk {
 	unsigned long seq;
 	/* the next address within an mm to scan */
 	unsigned long next_addr;
+	/* called for each accessed pte/pmd */
+	bool (*accessed_cb)(unsigned long pfn);
 	/* to batch promoted pages */
 	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* to batch the mm stats */
@@ -555,6 +557,9 @@ struct lru_gen_mm_walk {
 	/* total batched items */
 	int batched;
 	int swappiness;
+	/* for the pmd under scanning */
+	int nr_young_pte;
+	int nr_total_pte;
 	bool force_scan;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..531104a96c51 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -538,6 +538,10 @@ extern unsigned long highest_memmap_pfn;
 bool folio_isolate_lru(struct folio *folio);
 void folio_putback_lru(struct folio *folio);
 extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
+void set_task_reclaim_state(struct task_struct *task,
+				   struct reclaim_state *rs);
+void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq,
+			 bool (*accessed_cb)(unsigned long), void (*flush_cb)(void));
 #ifdef CONFIG_NUMA
 int user_proactive_reclaim(char *buf,
 			   struct mem_cgroup *memcg, pg_data_t *pgdat);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2fc8b626d3d..1bb637fd6e5e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -289,7 +289,7 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
 			continue;				\
 		else
 
-static void set_task_reclaim_state(struct task_struct *task,
+void set_task_reclaim_state(struct task_struct *task,
 				   struct reclaim_state *rs)
 {
 	/* Check for an overwrite */
@@ -3093,7 +3093,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
 
 	VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->seq);
 
-	if (walk->seq <= mm_state->seq)
+	if (!walk->accessed_cb && walk->seq <= mm_state->seq)
 		goto done;
 
 	if (!mm_state->head)
@@ -3519,16 +3519,14 @@ static void walk_update_folio(struct lru_gen_mm_walk *walk, struct folio *folio,
 	}
 }
 
-static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
-			   struct mm_walk *args)
+static int walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *args, bool *suitable)
 {
 	int i;
 	bool dirty;
 	pte_t *pte;
 	spinlock_t *ptl;
 	unsigned long addr;
-	int total = 0;
-	int young = 0;
 	struct folio *last = NULL;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
@@ -3536,19 +3534,24 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int gen = lru_gen_from_seq(max_seq);
 	pmd_t pmdval;
+	int err = 0;
 
 	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval, &ptl);
-	if (!pte)
-		return false;
+	if (!pte) {
+		*suitable = false;
+		return err;
+	}
 
 	if (!spin_trylock(ptl)) {
 		pte_unmap(pte);
-		return true;
+		*suitable = true;
+		return err;
 	}
 
 	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
 		pte_unmap_unlock(pte, ptl);
-		return false;
+		*suitable = false;
+		return err;
 	}
 
 	arch_enter_lazy_mmu_mode();
@@ -3557,8 +3560,9 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		unsigned long pfn;
 		struct folio *folio;
 		pte_t ptent = ptep_get(pte + i);
+		bool do_flush;
 
-		total++;
+		walk->nr_total_pte++;
 		walk->mm_stats[MM_LEAF_TOTAL]++;
 
 		pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
@@ -3582,23 +3586,36 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		if (pte_dirty(ptent))
 			dirty = true;
 
-		young++;
+		walk->nr_young_pte++;
 		walk->mm_stats[MM_LEAF_YOUNG]++;
+
+		if (!walk->accessed_cb)
+			continue;
+
+		do_flush = walk->accessed_cb(pfn);
+		if (do_flush) {
+			walk->next_addr = addr + PAGE_SIZE;
+
+			err = -EAGAIN;
+			break;
+		}
 	}
 
 	walk_update_folio(walk, last, gen, dirty);
 	last = NULL;
 
-	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
+	if (!err && i < PTRS_PER_PTE &&
+	    get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
 		goto restart;
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte, ptl);
 
-	return suitable_to_scan(total, young);
+	*suitable = suitable_to_scan(walk->nr_total_pte, walk->nr_young_pte);
+	return err;
 }
 
-static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
+static int walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
 				  struct mm_walk *args, unsigned long *bitmap, unsigned long *first)
 {
 	int i;
@@ -3611,6 +3628,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int gen = lru_gen_from_seq(max_seq);
+	int err = 0;
 
 	VM_WARN_ON_ONCE(pud_leaf(*pud));
 
@@ -3618,13 +3636,13 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	if (*first == -1) {
 		*first = addr;
 		bitmap_zero(bitmap, MIN_LRU_BATCH);
-		return;
+		return err;
 	}
 
 	i = addr == -1 ? 0 : pmd_index(addr) - pmd_index(*first);
 	if (i && i <= MIN_LRU_BATCH) {
 		__set_bit(i - 1, bitmap);
-		return;
+		return err;
 	}
 
 	pmd = pmd_offset(pud, *first);
@@ -3638,6 +3656,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	do {
 		unsigned long pfn;
 		struct folio *folio;
+		bool do_flush;
 
 		/* don't round down the first address */
 		addr = i ? (*first & PMD_MASK) + i * PMD_SIZE : *first;
@@ -3674,6 +3693,17 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 			dirty = true;
 
 		walk->mm_stats[MM_LEAF_YOUNG]++;
+		if (!walk->accessed_cb)
+			goto next;
+
+		do_flush = walk->accessed_cb(pfn);
+		if (do_flush) {
+			i = find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
+
+			walk->next_addr = (*first & PMD_MASK) + i * PMD_SIZE;
+			err = -EAGAIN;
+			break;
+		}
 next:
 		i = i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
 	} while (i <= MIN_LRU_BATCH);
@@ -3684,9 +3714,10 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	spin_unlock(ptl);
 done:
 	*first = -1;
+	return err;
 }
 
-static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+static int walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			   struct mm_walk *args)
 {
 	int i;
@@ -3698,6 +3729,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 	unsigned long first = -1;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct lru_gen_mm_state *mm_state = get_mm_state(walk->lruvec);
+	int err = 0;
 
 	VM_WARN_ON_ONCE(pud_leaf(*pud));
 
@@ -3711,6 +3743,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 	/* walk_pte_range() may call get_next_vma() */
 	vma = args->vma;
 	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		bool suitable;
 		pmd_t val = pmdp_get_lockless(pmd + i);
 
 		next = pmd_addr_end(addr, end);
@@ -3727,7 +3760,10 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			walk->mm_stats[MM_LEAF_TOTAL]++;
 
 			if (pfn != -1)
-				walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
+				err = walk_pmd_range_locked(pud, addr, vma, args,
+						bitmap, &first);
+			if (err)
+				return err;
 			continue;
 		}
 
@@ -3736,33 +3772,51 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			if (!pmd_young(val))
 				continue;
 
-			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
+			err = walk_pmd_range_locked(pud, addr, vma, args,
+						bitmap, &first);
+			if (err)
+				return err;
 		}
 
 		if (!walk->force_scan && !test_bloom_filter(mm_state, walk->seq, pmd + i))
 			continue;
 
+		err = walk_pte_range(&val, addr, next, args, &suitable);
+		if (err && walk->next_addr < next && first == -1)
+			return err;
+
+		walk->nr_total_pte = 0;
+		walk->nr_young_pte = 0;
+
 		walk->mm_stats[MM_NONLEAF_FOUND]++;
 
-		if (!walk_pte_range(&val, addr, next, args))
-			continue;
+		if (!suitable)
+			goto next;
 
 		walk->mm_stats[MM_NONLEAF_ADDED]++;
 
 		/* carry over to the next generation */
 		update_bloom_filter(mm_state, walk->seq + 1, pmd + i);
+next:
+		if (err) {
+			walk->next_addr = first;
+			return err;
+		}
 	}
 
-	walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first);
+	err = walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first);
 
-	if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
+	if (!err && i < PTRS_PER_PMD &&
+	    get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
 		goto restart;
+
+	return err;
 }
 
 static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 			  struct mm_walk *args)
 {
-	int i;
+	int i, err;
 	pud_t *pud;
 	unsigned long addr;
 	unsigned long next;
@@ -3780,7 +3834,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
 			continue;
 
-		walk_pmd_range(&val, addr, next, args);
+		err = walk_pmd_range(&val, addr, next, args);
+		if (err)
+			return err;
 
 		if (need_resched() || walk->batched >= MAX_LRU_BATCH) {
 			end = (addr | ~PUD_MASK) + 1;
@@ -3801,40 +3857,48 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 	return -EAGAIN;
 }
 
-static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+static int try_walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
 {
+	int err;
 	static const struct mm_walk_ops mm_walk_ops = {
 		.test_walk = should_skip_vma,
 		.p4d_entry = walk_pud_range,
 		.walk_lock = PGWALK_RDLOCK,
 	};
-	int err;
 	struct lruvec *lruvec = walk->lruvec;
 
-	walk->next_addr = FIRST_USER_ADDRESS;
+	DEFINE_MAX_SEQ(lruvec);
 
-	do {
-		DEFINE_MAX_SEQ(lruvec);
+	err = -EBUSY;
 
-		err = -EBUSY;
+	/* another thread might have called inc_max_seq() */
+	if (walk->seq != max_seq)
+		return err;
 
-		/* another thread might have called inc_max_seq() */
-		if (walk->seq != max_seq)
-			break;
+	/* the caller might be holding the lock for write */
+	if (mmap_read_trylock(mm)) {
+		err = walk_page_range(mm, walk->next_addr, ULONG_MAX,
+				      &mm_walk_ops, walk);
 
-		/* the caller might be holding the lock for write */
-		if (mmap_read_trylock(mm)) {
-			err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk);
+		mmap_read_unlock(mm);
+	}
 
-			mmap_read_unlock(mm);
-		}
+	if (walk->batched) {
+		spin_lock_irq(&lruvec->lru_lock);
+		reset_batch_size(walk);
+		spin_unlock_irq(&lruvec->lru_lock);
+	}
 
-		if (walk->batched) {
-			spin_lock_irq(&lruvec->lru_lock);
-			reset_batch_size(walk);
-			spin_unlock_irq(&lruvec->lru_lock);
-		}
+	return err;
+}
+
+static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	int err;
 
+	walk->next_addr = FIRST_USER_ADDRESS;
+	do {
+		err = try_walk_mm(mm, walk);
 		cond_resched();
 	} while (err == -EAGAIN);
 }
@@ -4046,6 +4110,33 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
 	return success;
 }
 
+void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq,
+			 bool (*accessed_cb)(unsigned long), void (*flush_cb)(void))
+{
+	struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk;
+	struct mm_struct *mm = NULL;
+
+	walk->lruvec = lruvec;
+	walk->seq = seq;
+	walk->accessed_cb = accessed_cb;
+	walk->swappiness = MAX_SWAPPINESS;
+
+	do {
+		int err = -EBUSY;
+
+		iterate_mm_list(walk, &mm);
+		if (!mm)
+			break;
+
+		walk->next_addr = FIRST_USER_ADDRESS;
+		do {
+			err = try_walk_mm(mm, walk);
+			cond_resched();
+			flush_cb();
+		} while (err == -EAGAIN);
+	} while (mm);
+}
+
 static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 			       int swappiness, bool force_scan)
 {
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 7/8] mm: klruscand: use mglru scanning for page promotion
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (5 preceding siblings ...)
  2025-11-10  5:23 ` [RFC PATCH v3 6/8] mm: mglru: generalize page table walk Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
  2025-11-10  5:23 ` [RFC PATCH v3 8/8] mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking Bharata B Rao
  2025-11-19 13:06 ` [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

From: Kinsey Ho <kinseyho@google.com>

Introduce a new kernel daemon, klruscand, that periodically invokes the
MGLRU page table walk. It leverages the new callbacks to gather access
information and forwards it to pghot sub-system for promotion decisions.

This benefits from reusing the existing MGLRU page table walk
infrastructure, which is optimized with features such as hierarchical
scanning and bloom filters to reduce CPU overhead.

As an additional optimization to be added in the future, we can tune
the scan intervals for each memcg.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
[Reduced the scan interval to 500ms, KLRUSCAND to default n in config]
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/Kconfig     |   8 ++++
 mm/Makefile    |   1 +
 mm/klruscand.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+)
 create mode 100644 mm/klruscand.c

diff --git a/mm/Kconfig b/mm/Kconfig
index b5e84cb50253..84ec9a9aca13 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1383,6 +1383,14 @@ config PGHOT
 	  by various sources. Asynchronous promotion is done by per-node
 	  kernel threads.
 
+config KLRUSCAND
+	bool "Kernel lower tier access scan daemon"
+	default n
+	depends on PGHOT && LRU_GEN_WALKS_MMU
+	help
+	  Scan for accesses from lower tiers by invoking MGLRU to perform
+	  page table walks.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index a6fac171c36e..1c0c79fec106 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -147,3 +147,4 @@ obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
 obj-$(CONFIG_PGHOT) += pghot.o
+obj-$(CONFIG_KLRUSCAND) += klruscand.o
diff --git a/mm/klruscand.c b/mm/klruscand.c
new file mode 100644
index 000000000000..13a41b38d67d
--- /dev/null
+++ b/mm/klruscand.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memcontrol.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/memory-tiers.h>
+#include <linux/pghot.h>
+
+#include "internal.h"
+
+#define KLRUSCAND_INTERVAL 500
+#define BATCH_SIZE (2 << 16)
+
+static struct task_struct *scan_thread;
+static unsigned long pfn_batch[BATCH_SIZE];
+static int batch_index;
+
+static void flush_cb(void)
+{
+	int i;
+
+	for (i = 0; i < batch_index; i++) {
+		unsigned long pfn = pfn_batch[i];
+
+		pghot_record_access(pfn, NUMA_NO_NODE, PGHOT_PGTABLE_SCAN, jiffies);
+
+		if (i % 16 == 0)
+			cond_resched();
+	}
+	batch_index = 0;
+}
+
+static bool accessed_cb(unsigned long pfn)
+{
+	WARN_ON_ONCE(batch_index == BATCH_SIZE);
+
+	if (batch_index < BATCH_SIZE)
+		pfn_batch[batch_index++] = pfn;
+
+	return batch_index == BATCH_SIZE;
+}
+
+static int klruscand_run(void *unused)
+{
+	struct lru_gen_mm_walk *walk;
+
+	walk = kzalloc(sizeof(*walk),
+		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	if (!walk)
+		return -ENOMEM;
+
+	while (!kthread_should_stop()) {
+		unsigned long next_wake_time;
+		long sleep_time;
+		struct mem_cgroup *memcg;
+		int flags;
+		int nid;
+
+		next_wake_time = jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL);
+
+		for_each_node_state(nid, N_MEMORY) {
+			pg_data_t *pgdat = NODE_DATA(nid);
+			struct reclaim_state rs = { 0 };
+
+			if (node_is_toptier(nid))
+				continue;
+
+			rs.mm_walk = walk;
+			set_task_reclaim_state(current, &rs);
+			flags = memalloc_noreclaim_save();
+
+			memcg = mem_cgroup_iter(NULL, NULL, NULL);
+			do {
+				struct lruvec *lruvec =
+					mem_cgroup_lruvec(memcg, pgdat);
+				unsigned long max_seq =
+					READ_ONCE((lruvec)->lrugen.max_seq);
+
+				lru_gen_scan_lruvec(lruvec, max_seq, accessed_cb, flush_cb);
+				cond_resched();
+			} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+			memalloc_noreclaim_restore(flags);
+			set_task_reclaim_state(current, NULL);
+			memset(walk, 0, sizeof(*walk));
+		}
+
+		sleep_time = next_wake_time - jiffies;
+		if (sleep_time > 0 && sleep_time != MAX_SCHEDULE_TIMEOUT)
+			schedule_timeout_idle(sleep_time);
+	}
+	kfree(walk);
+	return 0;
+}
+
+static int __init klruscand_init(void)
+{
+	struct task_struct *task;
+
+	task = kthread_run(klruscand_run, NULL, "klruscand");
+
+	if (IS_ERR(task)) {
+		pr_err("Failed to create klruscand kthread\n");
+		return PTR_ERR(task);
+	}
+
+	scan_thread = task;
+	return 0;
+}
+module_init(klruscand_init);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v3 8/8] mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (6 preceding siblings ...)
  2025-11-10  5:23 ` [RFC PATCH v3 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
@ 2025-11-10  5:23 ` Bharata B Rao
  2025-11-19 13:06 ` [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-10  5:23 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, Bharata B Rao

Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
mode of NUMA Balancing) does hot page detection (via hint faults),
hot page classification and eventual promotion, all by itself and
sits within the scheduler.

With the new hot page tracking and promotion mechanism being
available, NUMA Balancing can limit itself to detection of
hot pages (via hint faults) and off-load rest of the
functionality to the common hot page tracking system.

pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
hot page info. In addition, the migration rate limiting and
dynamic threshold logic are moved to kmigrated so that the same
can be used for hot pages reported by other sources too.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/pghot.h |   3 +
 kernel/sched/debug.c  |   1 -
 kernel/sched/fair.c   | 152 ++----------------------------------------
 mm/huge_memory.c      |  26 ++------
 mm/memory.c           |  31 ++-------
 mm/pghot.c            | 129 ++++++++++++++++++++++++++++++++++-
 6 files changed, 147 insertions(+), 195 deletions(-)

diff --git a/include/linux/pghot.h b/include/linux/pghot.h
index 7238ddf18a35..f42b21b61461 100644
--- a/include/linux/pghot.h
+++ b/include/linux/pghot.h
@@ -42,6 +42,9 @@ enum pghot_src {
 #define PGHOT_FREQ_MAX		(1 << PGHOT_FREQ_WIDTH)
 #define PGHOT_TIME_MAX		(1 << PGHOT_TIME_WIDTH)
 
+#define KMIGRATED_MIGRATION_ADJUST_STEPS	16
+#define KMIGRATED_PROMOTION_THRESHOLD_WINDOW	60000
+
 int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now);
 #else
 static inline int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 02e16b70a790..10dc3c996806 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -520,7 +520,6 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
 	debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
 	debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
-	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..31ab33e85cd1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu)
 static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
 #endif
 
-#ifdef CONFIG_NUMA_BALANCING
-/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
-static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
-#endif
-
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table sched_fair_sysctls[] = {
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	{
-		.procname	= "numa_balancing_promote_rate_limit_MBps",
-		.data		= &sysctl_numa_balancing_promote_rate_limit,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	},
-#endif /* CONFIG_NUMA_BALANCING */
 };
 
 static int __init sched_fair_sysctl_init(void)
@@ -1443,9 +1428,6 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
 /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
 unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
-/* The page with hint page fault latency < threshold in ms is considered hot */
-unsigned int sysctl_numa_balancing_hot_threshold = MSEC_PER_SEC;
-
 struct numa_group {
 	refcount_t refcount;
 
@@ -1800,108 +1782,6 @@ static inline bool cpupid_valid(int cpupid)
 	return cpupid_to_cpu(cpupid) < nr_cpu_ids;
 }
 
-/*
- * For memory tiering mode, if there are enough free pages (more than
- * enough watermark defined here) in fast memory node, to take full
- * advantage of fast memory capacity, all recently accessed slow
- * memory pages will be migrated to fast memory node without
- * considering hot threshold.
- */
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
-{
-	int z;
-	unsigned long enough_wmark;
-
-	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
-			   pgdat->node_present_pages >> 4);
-	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
-		struct zone *zone = pgdat->node_zones + z;
-
-		if (!populated_zone(zone))
-			continue;
-
-		if (zone_watermark_ok(zone, 0,
-				      promo_wmark_pages(zone) + enough_wmark,
-				      ZONE_MOVABLE, 0))
-			return true;
-	}
-	return false;
-}
-
-/*
- * For memory tiering mode, when page tables are scanned, the scan
- * time will be recorded in struct page in addition to make page
- * PROT_NONE for slow memory page.  So when the page is accessed, in
- * hint page fault handler, the hint page fault latency is calculated
- * via,
- *
- *	hint page fault latency = hint page fault time - scan time
- *
- * The smaller the hint page fault latency, the higher the possibility
- * for the page to be hot.
- */
-static int numa_hint_fault_latency(struct folio *folio)
-{
-	int last_time, time;
-
-	time = jiffies_to_msecs(jiffies);
-	last_time = folio_xchg_access_time(folio, time);
-
-	return (time - last_time) & PAGE_ACCESS_TIME_MASK;
-}
-
-/*
- * For memory tiering mode, too high promotion/demotion throughput may
- * hurt application latency.  So we provide a mechanism to rate limit
- * the number of pages that are tried to be promoted.
- */
-static bool numa_promotion_rate_limit(struct pglist_data *pgdat,
-				      unsigned long rate_limit, int nr)
-{
-	unsigned long nr_cand;
-	unsigned int now, start;
-
-	now = jiffies_to_msecs(jiffies);
-	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
-	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-	start = pgdat->nbp_rl_start;
-	if (now - start > MSEC_PER_SEC &&
-	    cmpxchg(&pgdat->nbp_rl_start, start, now) == start)
-		pgdat->nbp_rl_nr_cand = nr_cand;
-	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
-		return true;
-	return false;
-}
-
-#define NUMA_MIGRATION_ADJUST_STEPS	16
-
-static void numa_promotion_adjust_threshold(struct pglist_data *pgdat,
-					    unsigned long rate_limit,
-					    unsigned int ref_th)
-{
-	unsigned int now, start, th_period, unit_th, th;
-	unsigned long nr_cand, ref_cand, diff_cand;
-
-	now = jiffies_to_msecs(jiffies);
-	th_period = sysctl_numa_balancing_scan_period_max;
-	start = pgdat->nbp_th_start;
-	if (now - start > th_period &&
-	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
-		ref_cand = rate_limit *
-			sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC;
-		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
-		unit_th = ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS;
-		th = pgdat->nbp_threshold ? : ref_th;
-		if (diff_cand > ref_cand * 11 / 10)
-			th = max(th - unit_th, unit_th);
-		else if (diff_cand < ref_cand * 9 / 10)
-			th = min(th + unit_th, ref_th * 2);
-		pgdat->nbp_th_nr_cand = nr_cand;
-		pgdat->nbp_threshold = th;
-	}
-}
-
 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 				int src_nid, int dst_cpu)
 {
@@ -1917,33 +1797,11 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 
 	/*
 	 * The pages in slow memory node should be migrated according
-	 * to hot/cold instead of private/shared.
-	 */
-	if (folio_use_access_time(folio)) {
-		struct pglist_data *pgdat;
-		unsigned long rate_limit;
-		unsigned int latency, th, def_th;
-		long nr = folio_nr_pages(folio);
-
-		pgdat = NODE_DATA(dst_nid);
-		if (pgdat_free_space_enough(pgdat)) {
-			/* workload changed, reset hot threshold */
-			pgdat->nbp_threshold = 0;
-			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr);
-			return true;
-		}
-
-		def_th = sysctl_numa_balancing_hot_threshold;
-		rate_limit = MB_TO_PAGES(sysctl_numa_balancing_promote_rate_limit);
-		numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
-
-		th = pgdat->nbp_threshold ? : def_th;
-		latency = numa_hint_fault_latency(folio);
-		if (latency >= th)
-			return false;
-
-		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
-	}
+	 * to hot/cold instead of private/shared. Also the migration
+	 * of such pages are handled by kmigrated.
+	 */
+	if (folio_use_access_time(folio))
+		return true;
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1d1b74950332..4a0b7fb195e5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -39,6 +39,7 @@
 #include <linux/compat.h>
 #include <linux/pgalloc_tag.h>
 #include <linux/pagewalk.h>
+#include <linux/pghot.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -2050,29 +2051,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 
 	target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
 					&last_cpupid);
+	nid = target_nid;
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
-	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
-		flags |= TNF_MIGRATE_FAIL;
-		goto out_map;
-	}
-	/* The folio is isolated and isolation code holds a folio reference. */
-	spin_unlock(vmf->ptl);
-	writable = false;
 
-	if (!migrate_misplaced_folio(folio, target_nid)) {
-		flags |= TNF_MIGRATED;
-		nid = target_nid;
-		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
-		return 0;
-	}
+	writable = false;
 
-	flags |= TNF_MIGRATE_FAIL;
-	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) {
-		spin_unlock(vmf->ptl);
-		return 0;
-	}
 out_map:
 	/* Restore the PMD */
 	pmd = pmd_modify(pmdp_get(vmf->pmd), vma->vm_page_prot);
@@ -2083,8 +2067,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 	spin_unlock(vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
+	if (nid != NUMA_NO_NODE) {
+		pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT, jiffies);
 		task_numa_fault(last_cpupid, nid, HPAGE_PMD_NR, flags);
+	}
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..435fde53c993 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -74,6 +74,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
+#include <linux/pghot.h>
 #include <linux/sched/sysctl.h>
 
 #include <trace/events/kmem.h>
@@ -5989,34 +5990,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
 					writable, &last_cpupid);
+	nid = target_nid;
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
-	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
-		flags |= TNF_MIGRATE_FAIL;
-		goto out_map;
-	}
-	/* The folio is isolated and isolation code holds a folio reference. */
-	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
 	writable = false;
 	ignore_writable = true;
-
-	/* Migrate to the requested node */
-	if (!migrate_misplaced_folio(folio, target_nid)) {
-		nid = target_nid;
-		flags |= TNF_MIGRATED;
-		task_numa_fault(last_cpupid, nid, nr_pages, flags);
-		return 0;
-	}
-
-	flags |= TNF_MIGRATE_FAIL;
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				       vmf->address, &vmf->ptl);
-	if (unlikely(!vmf->pte))
-		return 0;
-	if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
 out_map:
 	/*
 	 * Make it present again, depending on how arch implements
@@ -6030,8 +6009,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 					    writable);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
+	if (nid != NUMA_NO_NODE) {
+		pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT, jiffies);
 		task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	}
 	return 0;
 }
 
diff --git a/mm/pghot.c b/mm/pghot.c
index 7c1a32f8a7ba..07bf987ca6f9 100644
--- a/mm/pghot.c
+++ b/mm/pghot.c
@@ -12,6 +12,9 @@
  * the hot pages. kmigrated runs for each lower tier node. It iterates
  * over the node's PFNs and  migrates pages marked for migration into
  * their targeted nodes.
+ *
+ * Migration rate-limiting and dynamic threshold logic implementations
+ * were moved from NUMA Balancing mode 2.
  */
 #include <linux/mm.h>
 #include <linux/migrate.h>
@@ -19,6 +22,8 @@
 #include <linux/cpuhotplug.h>
 #include <linux/pghot.h>
 
+/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
+static unsigned int sysctl_pghot_promote_rate_limit = 65536;
 static unsigned int sysctl_pghot_freq_window = PGHOT_FREQ_WINDOW;
 
 /*
@@ -100,6 +105,14 @@ static const struct ctl_table pghot_sysctls[] = {
 		.proc_handler   = proc_dointvec_minmax,
 		.extra1         = SYSCTL_ZERO,
 	},
+	{
+		.procname	= "pghot_promote_rate_limit_MBps",
+		.data		= &sysctl_pghot_promote_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 };
 #endif
 
@@ -193,8 +206,13 @@ int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
 		old_freq = (hotness >> PGHOT_FREQ_SHIFT) & PGHOT_FREQ_MASK;
 		old_time = (hotness >> PGHOT_TIME_SHIFT) & PGHOT_TIME_MASK;
 
-		if (((time - old_time) > msecs_to_jiffies(sysctl_pghot_freq_window))
-		    || (nid != NUMA_NO_NODE && old_nid != nid))
+		/*
+		 * Bypass the new window logic for NUMA hint fault source
+		 * as it is too slow in reporting accesses.
+		 * TODO: Fix this.
+		 */
+		if ((((time - old_time) > msecs_to_jiffies(sysctl_pghot_freq_window))
+		    && (src != PGHOT_HINT_FAULT)) || (nid != NUMA_NO_NODE && old_nid != nid))
 			new_window = true;
 
 		if (new_window)
@@ -220,6 +238,110 @@ int pghot_record_access(unsigned long pfn, int nid, int src, unsigned long now)
 	return 0;
 }
 
+/*
+ * For memory tiering mode, if there are enough free pages (more than
+ * enough watermark defined here) in fast memory node, to take full
+ * advantage of fast memory capacity, all recently accessed slow
+ * memory pages will be migrated to fast memory node without
+ * considering hot threshold.
+ */
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+	int z;
+	unsigned long enough_wmark;
+
+	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+			   pgdat->node_present_pages >> 4);
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_watermark_ok(zone, 0,
+				      promo_wmark_pages(zone) + enough_wmark,
+				      ZONE_MOVABLE, 0))
+			return true;
+	}
+	return false;
+}
+
+/*
+ * For memory tiering mode, too high promotion/demotion throughput may
+ * hurt application latency.  So we provide a mechanism to rate limit
+ * the number of pages that are tried to be promoted.
+ */
+static bool kmigrated_promotion_rate_limit(struct pglist_data *pgdat, unsigned long rate_limit,
+					   int nr, unsigned long now_ms)
+{
+	unsigned long nr_cand;
+	unsigned int start;
+
+	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
+	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+	start = pgdat->nbp_rl_start;
+	if (now_ms - start > MSEC_PER_SEC &&
+	    cmpxchg(&pgdat->nbp_rl_start, start, now_ms) == start)
+		pgdat->nbp_rl_nr_cand = nr_cand;
+	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
+		return true;
+	return false;
+}
+
+static void kmigrated_promotion_adjust_threshold(struct pglist_data *pgdat,
+						 unsigned long rate_limit, unsigned int ref_th,
+						 unsigned long now_ms)
+{
+	unsigned int start, th_period, unit_th, th;
+	unsigned long nr_cand, ref_cand, diff_cand;
+
+	th_period = KMIGRATED_PROMOTION_THRESHOLD_WINDOW;
+	start = pgdat->nbp_th_start;
+	if (now_ms - start > th_period &&
+	    cmpxchg(&pgdat->nbp_th_start, start, now_ms) == start) {
+		ref_cand = rate_limit *
+			KMIGRATED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
+		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
+		unit_th = ref_th * 2 / KMIGRATED_MIGRATION_ADJUST_STEPS;
+		th = pgdat->nbp_threshold ? : ref_th;
+		if (diff_cand > ref_cand * 11 / 10)
+			th = max(th - unit_th, unit_th);
+		else if (diff_cand < ref_cand * 9 / 10)
+			th = min(th + unit_th, ref_th * 2);
+		pgdat->nbp_th_nr_cand = nr_cand;
+		pgdat->nbp_threshold = th;
+	}
+}
+
+static bool kmigrated_should_migrate_memory(unsigned long nr_pages, unsigned long nid,
+					    unsigned long time)
+{
+	struct pglist_data *pgdat;
+	unsigned long rate_limit;
+	unsigned int th, def_th;
+	unsigned long now = jiffies;
+	unsigned long now_ms = jiffies_to_msecs(now);
+
+	pgdat = NODE_DATA(nid);
+	if (pgdat_free_space_enough(pgdat)) {
+		/* workload changed, reset hot threshold */
+		pgdat->nbp_threshold = 0;
+		mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE_NRL, nr_pages);
+		return true;
+	}
+
+	def_th = sysctl_pghot_freq_window;
+	rate_limit = MB_TO_PAGES(sysctl_pghot_promote_rate_limit);
+	kmigrated_promotion_adjust_threshold(pgdat, rate_limit, def_th, now_ms);
+
+	th = pgdat->nbp_threshold ? : def_th;
+	if (jiffies_to_msecs(now - time) >= th)
+		return false;
+
+	return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
+}
+
 static int pghot_get_hotness(unsigned long pfn, unsigned long *nid, unsigned long *freq,
 				    unsigned long *time)
 {
@@ -287,6 +409,9 @@ static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
 		if (folio_nid(folio) == nid)
 			goto out_next;
 
+		if (!kmigrated_should_migrate_memory(nr, nid, time))
+			goto out_next;
+
 		if (migrate_misplaced_folio_prepare(folio, NULL, nid))
 			goto out_next;
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure
  2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (7 preceding siblings ...)
  2025-11-10  5:23 ` [RFC PATCH v3 8/8] mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking Bharata B Rao
@ 2025-11-19 13:06 ` Bharata B Rao
  8 siblings, 0 replies; 12+ messages in thread
From: Bharata B Rao @ 2025-11-19 13:06 UTC (permalink / raw)
  To: bharata
  Cc: Jonathan.Cameron, akpm, alok.rathore, balbirs, byungchul,
	dave.hansen, dave, david, gourry, joshua.hahnjy, kinseyho,
	linux-kernel, linux-mm, mgorman, mingo, nifan.cxl, peterz,
	raghavendra.kt, riel, rientjes, shivankg, sj, weixugc, willy,
	xuezhengchu, yiannis, ying.huang, yuanchu, ziy

On 10-Nov-25 10:53 AM, Bharata B Rao wrote:
<snip>
> Results
> =======

Earlier I included results from the scenario where there was enough free
memory in the toptier node and hence demotions weren't getting triggered.
Here I am including results from a similar microbenchmark that results in
demotion too.

System details
--------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2 
  0:  10  32  50 
  1:  32  10  60 
  2:  255  255  10

Microbenchmark details
----------------------
Single threaded application that allocates memory on both DRAM and CXL nodes
using mmap(MAP_POPULATE). Every 1G region of allocated memory on CXL node is
accessed at 4K granularity randomly and repetitively to build up the notion
of hotness in the 1GB region that is under access. This should drive promotion.
For promotion to work successfully, the DRAM memory that has been provisioned
(and not being accessed) should be demoted first. There is enough free memory
in the CXL node to for demotions.

In summary, this benchmark creates a memory pressure on DRAM node and does
CXL memory accesses to drive both demotion and promotion.

The number of accesses are fixed and hence, the quicker the accessed pages
get promoted to DRAM, the sooner the benchmark is expected to finish.

DRAM-node			= 1
CXL-node			= 2
Initial DRAM alloc ratio	= 75%
Allocation-size			= 171798691840
Initial DRAM Alloc-size	=	 128849018880
Initial CXL Alloc-size		= 42949672960
Hot-region-size			= 1073741824
Nr-regions			= 160
Nr-regions DRAM			= 120 (provisioned but not accessed)
Nr-hot-regions CXL		= 40
Access pattern			= random
Access granularity		= 4096
Delay b/n accesses		= 0
Load/store ratio		= 50l50s
THP used			= no
Nr accesses			= 42949672960
Nr repetitions			= 1024

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
	 in the patched case. No migrations.
NUMAB2 - Existing hot page promotion for the base case and
	 use of hint faults as source in the patched case.
pgtscan - Klruscand (MGLRU based PTE A bit scanning) source
hwhints - IBS as source

Time taken (microseconds, lower is better)
----------------------------------------------
Source	Base		Patched		Change
----------------------------------------------
NUMAB0	63,036,030	64,441,675	+2.2%
NUMAB2	62,286,691	68,786,394	+10.4%(#)
pgtscan	NA		68,702,226
hwhints	NA		67,455,607
----------------------------------------------

Pages migrated (pgpromote_success)
----------------------------------------------
Source	Base		Patched
----------------------------------------------
NUMAB0	0		0
NUMAB2	82134(*)	0(#)
pgtscan	NA		6,561,136
hwhints	NA		3,293($)
----------------------------------------------
(#) Unlike base NUMAB2, pghot migrates after 2 accesses.
    Getting two successive accesses within the observation window is hard with
    NUMA hint faults. The default sysctl_numa_balancing_scan_size of 256MB is
    too less to obtain significant number of hint faults.
(*) High run-to-run variation, so the average isn't really representative.
    Hint fault latency comes out higher than the default 1s threshold
    mostly, preventing migrations.
($) Sampling limitation

Pages demoted (pgdemote_kswapd+pgdemote_direct)
(This data is not really a comparision point but just providing
these numbers to show that the workload results in both promotion
and demotion)
----------------------------------------------
Source	Base		Patched
----------------------------------------------
NUMAB0	5,222,366	5,341,502
NUMAB2	5,256,310	5,325,845
pgtscan	NA		5,317,709
hwhints	NA		5,287,091
----------------------------------------------

Promotion candidate pages (pgpromote_candidate)
----------------------------------------------
Source	Base		Patched
----------------------------------------------
NUMAB0	0		0
NUMAB2	82,848		0
pgtscan	NA		0
hwhints	NA		0
----------------------------------------------

Non-rate limited Promotion candidate pages (pgpromote_candidate_nrl)
----------------------------------------------
Source	Base		Patched
----------------------------------------------
NUMAB0	0		0
NUMAB2	0		0
pgtscan	NA		6,561,147
hwhints	NA		3,292
----------------------------------------------


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-12-06 10:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-10  5:23 [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 3/8] mm: Hot page tracking and promotion Bharata B Rao
     [not found]   ` <CGME20251126132450epcas5p123220533572f40d70799294cd3ca4819@epcas5p1.samsung.com>
2025-11-26 13:24     ` Alok Rathore
2025-12-06 10:22       ` Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-11-10  5:23 ` [RFC PATCH v3 8/8] mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking Bharata B Rao
2025-11-19 13:06 ` [RFC PATCH v3 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox