[RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
@ 2025-09-10 14:46 Bharata B Rao
  2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
                   ` (8 more replies)
  0 siblings, 9 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

Hi,

This patchset introduces a new subsystem for hot page tracking
and promotion (pghot) that consolidates memory access information
from various sources and enables centralized promotion of hot
pages across memory tiers.

Currently, multiple kernel subsystems detect page accesses
independently. For eg.

- NUMA Balancing via hint faults
- MGLRU via page table scanning for PTE A bit

This patchset consolidates the accesses from these mechanisms by
providing a common API for reporting page accesses and a shared
infrastructure for tracking hotness at PFN granularity and per-node
kernel threads for promoting pages.

Here is a brief summary of how this subsystem works:

- Tracks frequency, last access time and accessing node as
  part of each access record.
- Maintains per-PFN access records in hash lists.
- Classifies pages as hot based on configurable thresholds.
- Uses per-toptier-node max-heaps to prioritize hot pages for promotion.
- Launches per-toptier-node kpromoted threads to perform batched
  migrations.

When different subsystems report page accesses via the API
introduced by this new subsystem, a record for each such page
is stored in hash lists (hashed by PFN value). In addition to
the PFN and target_nid, the hotness record includes parameters
like frequency and time of access from which the hotness is
derived. Repeated reporting of access on the same PFN will result
in updating of hotness information. When the hotness of a
record (as updated during reporting of access) crosses a threshold,
the record becomes part of a max heap data structure. Records
in the max heap are arranged based on the hotness and hence
the top elements of the heap will correspond to the hottest
pages. There will be one such heap for each toptier node so
that per-toptier-node kpromoted thread can easily extract the
top N records from its own heap and perform batched migration.

Three page hotness sources have been integrated with pghot
subsystem on experimental basis:

1. IBS
2. klruscand (based on MGLRU page table walks)
3. NUMA Balancing (mode 2).

Changes in v2
=============
- Moved migration rate-limiting and dynamic threshold logic from
  NUMA Balancing subsystem to pghot. With this, the logic to
  classify a page as hot resembles more closely to the existing
  mechanism.
- Converted NUMA Balancing mode 2 to just detect accesses through
  NUMA hint faults and delegate rest of the processing (hot page
  classification and promotion) to pghot.
- Packed the three parameters required for hot page tracking
  (nid, frequency and timestamp) into a single u32 for space
  efficiency.
- Misc cleanups and refactoring.

This v2 patchset applies on top of upstream commit 8742b2d8935f and
can be fetched from:
https://github.com/AMDESE/linux-mm/tree/bharata/kpromoted-rfcv2

v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/
v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/

TODOs
=====
- Memory allocation: High volume of allocations and frees (millions)
  from atomic context needs evaluation.
- Memory overhead: The amount of data needed for tracking hotness is
  also a concern.
- Integrate Kscand[1], the PTE A bit based approach that Raghavendra KT
  is working upon, so that Kscand acts as temperature sources and
  uses pghot for hot page heuristics and promotion.
- Heap pruning: Consider adding heap pruning mechanism for periodic
  cleaning of cold records.
- Address Ying Huang's comment about merging migrate_misplaced_folio()
  and migrate_misplaced_folios_batch() and correctly handling memcg
  stats counting properly in the latter.
- Testing: Light functional testing done; performance benchmarking and
  stress testing will follow in the next iterations.

Any feedback is welcome!

Bharata B Rao (5):
  mm: migrate: Allow misplaced migration without VMA too
  mm: Hot page tracking and promotion
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses
  mm: sched: Move hot page promotion from NUMAB=2 to kpromoted

Gregory Price (1):
  migrate: implement migrate_misplaced_folios_batch

Kinsey Ho (2):
  mm: mglru: generalize page table walk
  mm: klruscand: use mglru scanning for page promotion

 arch/x86/events/amd/ibs.c           |  11 +
 arch/x86/include/asm/entry-common.h |   3 +
 arch/x86/include/asm/hardirq.h      |   2 +
 arch/x86/include/asm/ibs.h          |   9 +
 arch/x86/include/asm/msr-index.h    |  16 +
 arch/x86/mm/Makefile                |   3 +-
 arch/x86/mm/ibs.c                   | 343 +++++++++++++++
 include/linux/migrate.h             |   6 +
 include/linux/mmzone.h              |  16 +
 include/linux/pghot.h               |  98 +++++
 include/linux/vm_event_item.h       |  26 ++
 kernel/sched/fair.c                 | 149 +------
 mm/Kconfig                          |  19 +
 mm/Makefile                         |   2 +
 mm/internal.h                       |   4 +
 mm/klruscand.c                      | 118 +++++
 mm/memory.c                         |  32 +-
 mm/migrate.c                        |  36 +-
 mm/mm_init.c                        |  10 +
 mm/pghot.c                          | 648 ++++++++++++++++++++++++++++
 mm/vmscan.c                         | 176 ++++++--
 mm/vmstat.c                         |  26 ++
 22 files changed, 1535 insertions(+), 218 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/klruscand.c
 create mode 100644 mm/pghot.c

[1] Kscand - https://lore.kernel.org/linux-mm/20250814153307.1553061-1-raghavendra.kt@amd.com/
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

We want isolation of misplaced folios to work in contexts
where VMA isn't available. In order to prepare for that
allow migrate_misplaced_folio_prepare() to be called with
a NULL VMA.

When migrate_misplaced_folio_prepare() is called with non-NULL
VMA, it will check if the folio is mapped shared and that requires
holding PTL lock. This path isn't taken when the function is
invoked with NULL VMA (migration outside of process context).
Hence for such cases, it is not necessary this function be
called with PTL lock held.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 425401b2d4e1..7e356c0b1b5a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2619,7 +2619,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 
 /*
  * Prepare for calling migrate_misplaced_folio() by isolating the folio if
- * permitted. Must be called with the PTL still held.
+ * permitted. Must be called with the PTL still held if called with a non-NULL
+ * vma.
  */
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -2636,7 +2637,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_maybe_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
+		if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
 			return -EACCES;
 
 		/*
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-10-03 10:36   ` Jonathan Cameron
  2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

From: Gregory Price <gourry@gourry.net>

A common operation in tiering is to migrate multiple pages at once.
The migrate_misplaced_folio function requires one call for each
individual folio.  Expose a batch-variant of the same call for use
when doing batch migrations.

Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/migrate.h |  6 ++++++
 mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index acadd41e0b5c..0593f5869be8 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -107,6 +107,7 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
+int migrate_misplaced_folios_batch(struct list_head *foliolist, int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -117,6 +118,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline int migrate_misplaced_folios_batch(struct list_head *foliolist,
+						 int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index 7e356c0b1b5a..1268a95eda0e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2714,5 +2714,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	BUG_ON(!list_empty(&migratepages));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/*
+ * Batch variant of migrate_misplaced_folio. Attempts to migrate
+ * a folio list to the specified destination.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio.
+ *
+ * This function will un-isolate the folios, dereference them, and
+ * remove them from the list before returning.
+ */
+int migrate_misplaced_folios_batch(struct list_head *folio_list, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	unsigned int nr_succeeded;
+	int nr_remaining;
+
+	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+				     NULL, node, MIGRATE_ASYNC,
+				     MR_NUMA_MISPLACED, &nr_succeeded);
+	if (nr_remaining)
+		putback_movable_pages(folio_list);
+
+	if (nr_succeeded) {
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+	}
+	BUG_ON(!list_empty(folio_list));
+	return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch
  2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
@ 2025-10-03 10:36   ` Jonathan Cameron
  2025-10-03 11:02     ` Bharata B Rao
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-03 10:36 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore

On Wed, 10 Sep 2025 20:16:47 +0530
Bharata B Rao <bharata@amd.com> wrote:

> From: Gregory Price <gourry@gourry.net>
> 
> A common operation in tiering is to migrate multiple pages at once.
> The migrate_misplaced_folio function requires one call for each
> individual folio.  Expose a batch-variant of the same call for use
> when doing batch migrations.
> 
I probably missed an earlier discussion of this but what does the
_batch postfix add over the plural (folios)?

> Signed-off-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  include/linux/migrate.h |  6 ++++++
>  mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index acadd41e0b5c..0593f5869be8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -107,6 +107,7 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>  int migrate_misplaced_folio_prepare(struct folio *folio,
>  		struct vm_area_struct *vma, int node);
>  int migrate_misplaced_folio(struct folio *folio, int node);
> +int migrate_misplaced_folios_batch(struct list_head *foliolist, int node);
>  #else
>  static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>  		struct vm_area_struct *vma, int node)
> @@ -117,6 +118,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>  {
>  	return -EAGAIN; /* can't migrate now */
>  }
> +static inline int migrate_misplaced_folios_batch(struct list_head *foliolist,
> +						 int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #ifdef CONFIG_MIGRATION
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 7e356c0b1b5a..1268a95eda0e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2714,5 +2714,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
>  	BUG_ON(!list_empty(&migratepages));
>  	return nr_remaining ? -EAGAIN : 0;
>  }
> +
> +/*

Kernel-doc perhaps appropriate?

> + * Batch variant of migrate_misplaced_folio. Attempts to migrate
> + * a folio list to the specified destination.
> + *
> + * Caller is expected to have isolated the folios by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio.
> + *
> + * This function will un-isolate the folios, dereference them, and
> + * remove them from the list before returning.
> + */
> +int migrate_misplaced_folios_batch(struct list_head *folio_list, int node)
> +{
> +	pg_data_t *pgdat = NODE_DATA(node);
> +	unsigned int nr_succeeded;
> +	int nr_remaining;
> +
> +	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> +				     NULL, node, MIGRATE_ASYNC,
> +				     MR_NUMA_MISPLACED, &nr_succeeded);
> +	if (nr_remaining)
> +		putback_movable_pages(folio_list);
> +
> +	if (nr_succeeded) {
> +		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> +		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
> +	}
> +	BUG_ON(!list_empty(folio_list));
> +	return nr_remaining ? -EAGAIN : 0;
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_NUMA */



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch
  2025-10-03 10:36   ` Jonathan Cameron
@ 2025-10-03 11:02     ` Bharata B Rao
  0 siblings, 0 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-10-03 11:02 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore

On 03-Oct-25 4:06 PM, Jonathan Cameron wrote:
> On Wed, 10 Sep 2025 20:16:47 +0530
> Bharata B Rao <bharata@amd.com> wrote:
> 
>> From: Gregory Price <gourry@gourry.net>
>>
>> A common operation in tiering is to migrate multiple pages at once.
>> The migrate_misplaced_folio function requires one call for each
>> individual folio.  Expose a batch-variant of the same call for use
>> when doing batch migrations.
>>
> I probably missed an earlier discussion of this but what does the
> _batch postfix add over the plural (folios)?

https://lore.kernel.org/linux-mm/15744682-72ea-472f-9af1-50c3494c0b78@redhat.com/

> 
>> Signed-off-by: Gregory Price <gourry@gourry.net>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> ---
>>  include/linux/migrate.h |  6 ++++++
>>  mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
>>  2 files changed, 37 insertions(+)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index acadd41e0b5c..0593f5869be8 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -107,6 +107,7 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>>  int migrate_misplaced_folio_prepare(struct folio *folio,
>>  		struct vm_area_struct *vma, int node);
>>  int migrate_misplaced_folio(struct folio *folio, int node);
>> +int migrate_misplaced_folios_batch(struct list_head *foliolist, int node);
>>  #else
>>  static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>>  		struct vm_area_struct *vma, int node)
>> @@ -117,6 +118,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>>  {
>>  	return -EAGAIN; /* can't migrate now */
>>  }
>> +static inline int migrate_misplaced_folios_batch(struct list_head *foliolist,
>> +						 int node)
>> +{
>> +	return -EAGAIN; /* can't migrate now */
>> +}
>>  #endif /* CONFIG_NUMA_BALANCING */
>>  
>>  #ifdef CONFIG_MIGRATION
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 7e356c0b1b5a..1268a95eda0e 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2714,5 +2714,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
>>  	BUG_ON(!list_empty(&migratepages));
>>  	return nr_remaining ? -EAGAIN : 0;
>>  }
>> +
>> +/*
> 
> Kernel-doc perhaps appropriate?

Probably yes, will take care in next iteration.

Thanks for looking into this patchset.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 3/8] mm: Hot page tracking and promotion
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
  2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
  2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-10-03 11:17   ` Jonathan Cameron
  2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

This introduces a sub-system for collecting memory access
information from different sources. It maintains the hotness
information based on the access history and time of access.

Additionally, it provides per-lowertier-node kernel threads
(named kpromoted) that periodically promote the pages that
are eligible for promotion.

Sub-systems that generate hot page access info can report that
using this API:

int pghot_record_access(u64 pfn, int nid, int src,
			unsigned long time)

@pfn: The PFN of the memory accessed
@nid: The accessing NUMA node ID
@src: The temperature source (sub-system) that generated the
      access info
@time: The access time in jiffies

Some temperature sources may not provide the nid from which
the page was accessed. This is true for sources that use
page table scanning for PTE Accessed bit. For such sources,
the default toptier node to which such pages should be promoted
is hard coded.

Also, the access time provided some sources may at best be
considered approximate. This is especially true for hot pages
detected by PTE A bit scanning.

The hot PFN records are stored in hash lists hashed by PFN value.
The PFN records that are categorized as hot enough to be promoted
are maintained in a per-lowertier-node max heap from which
kpromoted extracts and promotes them.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/mmzone.h        |  11 +
 include/linux/pghot.h         |  96 +++++++
 include/linux/vm_event_item.h |   9 +
 mm/Kconfig                    |  11 +
 mm/Makefile                   |   1 +
 mm/mm_init.c                  |  10 +
 mm/pghot.c                    | 524 ++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |   9 +
 8 files changed, 671 insertions(+)
 create mode 100644 include/linux/pghot.h
 create mode 100644 mm/pghot.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c5da9141983..f7094babed10 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1349,6 +1349,10 @@ struct memory_failure_stats {
 };
 #endif
 
+#ifdef CONFIG_PGHOT
+#include <linux/pghot.h>
+#endif
+
 /*
  * On NUMA machines, each NUMA node would have a pg_data_t to describe
  * it's memory layout. On UMA machines there is a single pglist_data which
@@ -1497,6 +1501,13 @@ typedef struct pglist_data {
 #ifdef CONFIG_MEMORY_FAILURE
 	struct memory_failure_stats mf_stats;
 #endif
+#ifdef CONFIG_PGHOT
+	struct task_struct *kpromoted;
+	wait_queue_head_t kpromoted_wait;
+	struct pghot_info **phi_buf;
+	struct max_heap heap;
+	spinlock_t heap_lock;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/pghot.h b/include/linux/pghot.h
new file mode 100644
index 000000000000..1443643aab13
--- /dev/null
+++ b/include/linux/pghot.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KPROMOTED_H
+#define _LINUX_KPROMOTED_H
+
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/workqueue_types.h>
+
+/* Page hotness temperature sources */
+enum pghot_src {
+	PGHOT_HW_HINTS,
+	PGHOT_PGTABLE_SCAN,
+	PGHOT_HINT_FAULT,
+};
+
+#ifdef CONFIG_PGHOT
+
+#define KPROMOTED_FREQ_WINDOW	(5 * MSEC_PER_SEC)
+
+/* 2 accesses within a window will make the page a promotion candidate */
+#define KPROMOTED_FREQ_THRESHOLD	2
+
+#define PGHOT_FREQ_BITS		3
+#define PGHOT_NID_BITS		10
+#define PGHOT_TIME_BITS		19
+
+#define PGHOT_FREQ_MAX		(1 << PGHOT_FREQ_BITS)
+#define PGHOT_NID_MAX		(1 << PGHOT_NID_BITS)
+
+/*
+ * last_update is stored in 19 bits which can represent up to
+ * 8.73s with HZ=1000
+ */
+#define PGHOT_TIME_MASK		GENMASK_U32(PGHOT_TIME_BITS - 1, 0)
+
+/*
+ * The following two defines control the number of hash lists
+ * that are maintained for tracking PFN accesses.
+ */
+#define PGHOT_HASH_PCT		50	/* % of lower tier memory pages to track */
+#define PGHOT_HASH_ENTRIES	1024	/* Number of entries per list, ideal case */
+
+/*
+ * Percentage of hash entries that can reside in heap as migrate-ready
+ * candidates
+ */
+#define PGHOT_HEAP_PCT		25
+
+#define KPROMOTED_MIGRATE_BATCH	1024
+
+/*
+ * If target NID isn't available, kpromoted promotes to node 0
+ * by default.
+ *
+ * TODO: Need checks to validate that default node is indeed
+ * present and is a toptier node.
+ */
+#define KPROMOTED_DEFAULT_NODE	0
+
+struct pghot_info {
+	unsigned long pfn;
+
+	/*
+	 * The following three fundamental parameters
+	 * required to track the hotness of page/PFN are
+	 * packed within a single u32.
+	 */
+	u32 frequency:PGHOT_FREQ_BITS; /* Number of accesses within current window */
+	u32 nid:PGHOT_NID_BITS; /* Most recent access from this node */
+	u32 last_update:PGHOT_TIME_BITS; /* Most recent access time */
+
+	struct hlist_node hnode;
+	size_t heap_idx; /* Position in max heap for quick retreival */
+};
+
+struct max_heap {
+	size_t nr;
+	size_t size;
+	struct pghot_info **data;
+	DECLARE_FLEX_ARRAY(struct pghot_info *, preallocated);
+};
+
+/*
+ * The wakeup interval of kpromoted threads
+ */
+#define KPROMOTE_DELAY	20	/* 20ms */
+
+int pghot_record_access(u64 pfn, int nid, int src, unsigned long now);
+#else
+static inline int pghot_record_access(u64 pfn, int nid, int src,
+				      unsigned long now)
+{
+	return 0;
+}
+#endif /* CONFIG_PGHOT */
+#endif /* _LINUX_KPROMOTED_H */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9e15a088ba38..a996fa9df785 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -186,6 +186,15 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSTACK_REST,
 #endif
 #endif /* CONFIG_DEBUG_STACK_USAGE */
+		PGHOT_RECORDED_ACCESSES,
+		PGHOT_RECORD_HWHINTS,
+		PGHOT_RECORD_PGTSCANS,
+		PGHOT_RECORD_HINTFAULTS,
+		PGHOT_RECORDS_HASH,
+		PGHOT_RECORDS_HEAP,
+		KPROMOTED_RIGHT_NODE,
+		KPROMOTED_NON_LRU,
+		KPROMOTED_DROPPED,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf..8b236eb874cf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1381,6 +1381,17 @@ config PT_RECLAIM
 
 	  Note: now only empty user PTE page table pages will be reclaimed.
 
+config PGHOT
+	bool "Hot page tracking and promotion"
+	def_bool y
+	depends on NUMA && MIGRATION && MMU
+	select MIN_HEAP
+	help
+	  A sub-system to track page accesses in lower tier memory and
+	  maintain hot page information. Promotes hot pages from lower
+	  tiers to top tier by using the memory access information provided
+	  by various sources. Asynchronous promotion is done by per-node
+	  kernel threads.
 
 source "mm/damon/Kconfig"
 
diff --git a/mm/Makefile b/mm/Makefile
index ef54aa615d9d..ecdd5241bea8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
+obj-$(CONFIG_PGHOT) += pghot.o
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5c21b3af216b..f7992be3ff7f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1402,6 +1402,15 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
 static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
 #endif
 
+#ifdef CONFIG_PGHOT
+static void pgdat_init_kpromoted(struct pglist_data *pgdat)
+{
+	init_waitqueue_head(&pgdat->kpromoted_wait);
+}
+#else
+static void pgdat_init_kpromoted(struct pglist_data *pgdat) {}
+#endif
+
 static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 {
 	int i;
@@ -1411,6 +1420,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
+	pgdat_init_kpromoted(pgdat);
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
diff --git a/mm/pghot.c b/mm/pghot.c
new file mode 100644
index 000000000000..9f7581818b8f
--- /dev/null
+++ b/mm/pghot.c
@@ -0,0 +1,524 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Maintains information about hot pages from slower tier nodes and
+ * promotes them.
+ *
+ * Info about accessed pages are stored in hash lists indexed by PFN.
+ * Info about pages that are hot enough to be promoted are stored in
+ * a per-toptier-node max_heap.
+ *
+ * kpromoted is a kernel thread that runs on each toptier node and
+ * promotes pages from max_heap.
+ */
+#include <linux/pghot.h>
+#include <linux/kthread.h>
+#include <linux/mmzone.h>
+#include <linux/migrate.h>
+#include <linux/memory-tiers.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/vmalloc.h>
+#include <linux/hashtable.h>
+#include <linux/min_heap.h>
+
+struct pghot_hash {
+	struct hlist_head hash;
+	spinlock_t lock;
+};
+
+static struct pghot_hash *phi_hash;
+static int phi_hash_order;
+static int phi_heap_entries;
+static struct kmem_cache *phi_cache __ro_after_init;
+static bool kpromoted_started __ro_after_init;
+
+static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
+
+#ifdef CONFIG_SYSCTL
+static const struct ctl_table pghot_sysctls[] = {
+	{
+		.procname	= "pghot_promote_freq_window_ms",
+		.data		= &sysctl_pghot_freq_window,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
+};
+#endif
+static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
+{
+	return (*(struct pghot_info **)lhs)->frequency >
+		(*(struct pghot_info **)rhs)->frequency;
+}
+
+static void phi_heap_swp(void *lhs, void *rhs, void *args)
+{
+	struct pghot_info **l = (struct pghot_info **)lhs;
+	struct pghot_info **r = (struct pghot_info **)rhs;
+	int lindex = l - (struct pghot_info **)args;
+	int rindex = r - (struct pghot_info **)args;
+	struct pghot_info *tmp = *l;
+
+	*l = *r;
+	*r = tmp;
+
+	(*l)->heap_idx = lindex;
+	(*r)->heap_idx = rindex;
+}
+
+static const struct min_heap_callbacks phi_heap_cb = {
+	.less = phi_heap_less,
+	.swp = phi_heap_swp,
+};
+
+static void phi_heap_update_entry(struct max_heap *phi_heap, struct pghot_info *phi)
+{
+	int orig_idx = phi->heap_idx;
+
+	min_heap_sift_up(phi_heap, phi->heap_idx, &phi_heap_cb,
+			 phi_heap->data);
+	if (phi_heap->data[phi->heap_idx]->heap_idx == orig_idx)
+		min_heap_sift_down(phi_heap, phi->heap_idx,
+				   &phi_heap_cb, phi_heap->data);
+}
+
+static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
+{
+	if (phi_heap->nr >= phi_heap_entries)
+		return false;
+
+	phi->heap_idx = phi_heap->nr;
+	min_heap_push(phi_heap, &phi, &phi_heap_cb, phi_heap->data);
+
+	return true;
+}
+
+static bool phi_is_pfn_hot(struct pghot_info *phi)
+{
+	struct page *page = pfn_to_online_page(phi->pfn);
+	unsigned long now = jiffies;
+	struct folio *folio;
+
+	if (!page || is_zone_device_page(page))
+		return false;
+
+	folio = page_folio(page);
+	if (!folio_test_lru(folio)) {
+		count_vm_event(KPROMOTED_NON_LRU);
+		return false;
+	}
+	if (folio_nid(folio) == phi->nid) {
+		count_vm_event(KPROMOTED_RIGHT_NODE);
+		return false;
+	}
+
+	return true;
+}
+
+static struct folio *kpromoted_isolate_folio(struct pghot_info *phi)
+{
+	struct page *page = pfn_to_page(phi->pfn);
+	struct folio *folio;
+
+	if (!page)
+		return NULL;
+
+	folio = page_folio(page);
+	if (migrate_misplaced_folio_prepare(folio, NULL, phi->nid))
+		return NULL;
+	else
+		return folio;
+}
+
+static struct pghot_info *phi_alloc(unsigned long pfn)
+{
+	struct pghot_info *phi;
+
+	phi = kmem_cache_zalloc(phi_cache, GFP_NOWAIT);
+	if (!phi)
+		return NULL;
+
+	phi->pfn = pfn;
+	phi->heap_idx = -1;
+	return phi;
+}
+
+static inline void phi_free(struct pghot_info *phi)
+{
+	kmem_cache_free(phi_cache, phi);
+}
+
+static int phi_heap_extract(pg_data_t *pgdat, int batch_count, int freq_th,
+			    struct list_head *migrate_list, int *count)
+{
+	spinlock_t *phi_heap_lock = &pgdat->heap_lock;
+	struct max_heap *phi_heap = &pgdat->heap;
+	int max_retries = 10;
+	int bkt, i = 0;
+
+	if (batch_count < 0 || !migrate_list || !count || freq_th < 1 ||
+	    freq_th > KPROMOTED_FREQ_THRESHOLD)
+		return -EINVAL;
+
+	*count = 0;
+	for (i = 0; i < batch_count; i++) {
+		struct pghot_info *top = NULL;
+		bool should_continue = false;
+		struct folio *folio;
+		int retries = 0;
+
+		while (retries < max_retries) {
+			spin_lock(phi_heap_lock);
+			if (phi_heap->nr > 0 && phi_heap->data[0]->frequency >= freq_th) {
+				should_continue = true;
+				bkt = hash_min(phi_heap->data[0]->pfn, phi_hash_order);
+				top = phi_heap->data[0];
+			}
+			spin_unlock(phi_heap_lock);
+
+			if (!should_continue)
+				goto done;
+
+			spin_lock(&phi_hash[bkt].lock);
+			spin_lock(phi_heap_lock);
+			if (phi_heap->nr == 0 || phi_heap->data[0] != top ||
+			    phi_heap->data[0]->frequency < freq_th) {
+				spin_unlock(phi_heap_lock);
+				spin_unlock(&phi_hash[bkt].lock);
+				retries++;
+				continue;
+			}
+
+			top = phi_heap->data[0];
+			hlist_del_init(&top->hnode);
+
+			phi_heap->nr--;
+			if (phi_heap->nr > 0) {
+				phi_heap->data[0] = phi_heap->data[phi_heap->nr];
+				phi_heap->data[0]->heap_idx = 0;
+				min_heap_sift_down(phi_heap, 0, &phi_heap_cb,
+						   phi_heap->data);
+			}
+
+			spin_unlock(phi_heap_lock);
+			spin_unlock(&phi_hash[bkt].lock);
+
+			if (!phi_is_pfn_hot(top)) {
+				count_vm_event(KPROMOTED_DROPPED);
+				goto skip;
+			}
+
+			folio = kpromoted_isolate_folio(top);
+			if (folio) {
+				list_add(&folio->lru, migrate_list);
+				(*count)++;
+			}
+skip:
+			phi_free(top);
+			break;
+		}
+		if (retries >= max_retries) {
+			pr_warn("%s: Too many retries\n", __func__);
+			break;
+		}
+
+	}
+done:
+	return 0;
+}
+
+static void phi_heap_add_or_adjust(struct pghot_info *phi)
+{
+	pg_data_t *pgdat = NODE_DATA(phi->nid);
+	struct max_heap *phi_heap = &pgdat->heap;
+
+	spin_lock(&pgdat->heap_lock);
+	if (phi->heap_idx >= 0 && phi->heap_idx < phi_heap->nr &&
+	    phi_heap->data[phi->heap_idx] == phi) {
+		/* Entry exists in heap */
+		if (phi->frequency < KPROMOTED_FREQ_THRESHOLD) {
+			/* Below threshold, remove from the heap */
+			phi_heap->nr--;
+			if (phi->heap_idx < phi_heap->nr) {
+				phi_heap->data[phi->heap_idx] =
+					phi_heap->data[phi_heap->nr];
+				phi_heap->data[phi->heap_idx]->heap_idx =
+					phi->heap_idx;
+				min_heap_sift_down(phi_heap, phi->heap_idx,
+						   &phi_heap_cb, phi_heap->data);
+			}
+			phi->heap_idx = -1;
+
+		} else {
+			/* Update position in heap */
+			phi_heap_update_entry(phi_heap, phi);
+		}
+	} else if (phi->frequency >= KPROMOTED_FREQ_THRESHOLD) {
+		/*
+		 * Add to the heap. If heap is full we will have
+		 * to wait for the next access reporting to elevate
+		 * it to heap.
+		 */
+		if (phi_heap_insert(phi_heap, phi))
+			count_vm_event(PGHOT_RECORDS_HEAP);
+	}
+	spin_unlock(&pgdat->heap_lock);
+}
+
+static struct pghot_info *phi_lookup(unsigned long pfn, int bkt)
+{
+	struct pghot_info *phi;
+
+	hlist_for_each_entry(phi, &phi_hash[bkt].hash, hnode) {
+		if (phi->pfn == pfn)
+			return phi;
+	}
+	return NULL;
+}
+
+/*
+ * Called by subsystems that generate page hotness/access information.
+ *
+ *  @pfn: The PFN of the memory accessed
+ *  @nid: The accessing NUMA node ID
+ *  @src: The temperature source (sub-system) that generated the
+ *        access info
+ *  @time: The access time in jiffies
+ *
+ * Maintains the access records per PFN, classifies them as
+ * hot based on subsequent accesses and finally hands over
+ * them to kpromoted for migration.
+ */
+int pghot_record_access(u64 pfn, int nid, int src, unsigned long now)
+{
+	struct pghot_info *phi;
+	struct page *page;
+	struct folio *folio;
+	int bkt;
+	bool new_entry = false, new_window = false;
+	u32 cur_time = now & PGHOT_TIME_MASK;
+
+	if (!kpromoted_started)
+		return -EINVAL;
+
+	if (nid >= PGHOT_NID_MAX)
+		return -EINVAL;
+
+	count_vm_event(PGHOT_RECORDED_ACCESSES);
+
+	switch (src) {
+	case PGHOT_HW_HINTS:
+		count_vm_event(PGHOT_RECORD_HWHINTS);
+		break;
+	case PGHOT_PGTABLE_SCAN:
+		count_vm_event(PGHOT_RECORD_PGTSCANS);
+		break;
+	case PGHOT_HINT_FAULT:
+		count_vm_event(PGHOT_RECORD_HINTFAULTS);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/*
+	 * Record only accesses from lower tiers.
+	 */
+	if (node_is_toptier(pfn_to_nid(pfn)))
+		return 0;
+
+	/*
+	 * Reject the non-migratable pages right away.
+	 */
+	page = pfn_to_online_page(pfn);
+	if (!page || is_zone_device_page(page))
+		return 0;
+
+	folio = page_folio(page);
+	if (!folio_test_lru(folio))
+		return 0;
+
+	bkt = hash_min(pfn, phi_hash_order);
+	spin_lock(&phi_hash[bkt].lock);
+	phi = phi_lookup(pfn, bkt);
+	if (!phi) {
+		phi = phi_alloc(pfn);
+		if (!phi)
+			goto out;
+		new_entry = true;
+	}
+
+	/*
+	 * If the previous access was beyond the threshold window
+	 * start frequency tracking afresh.
+	 */
+	if (((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_window)) ||
+	    (nid != NUMA_NO_NODE && phi->nid != nid))
+		new_window = true;
+
+	if (new_entry || new_window) {
+		/* New window */
+		phi->frequency = 1; /* TODO: Factor in the history */
+	} else if (phi->frequency < PGHOT_FREQ_MAX)
+		phi->frequency++;
+	phi->last_update = cur_time;
+	phi->nid = (nid == NUMA_NO_NODE) ? KPROMOTED_DEFAULT_NODE : nid;
+
+	if (new_entry) {
+		/* Insert the new entry into hash table */
+		hlist_add_head(&phi->hnode, &phi_hash[bkt].hash);
+		count_vm_event(PGHOT_RECORDS_HASH);
+	} else {
+		/* Add/update the position in heap */
+		phi_heap_add_or_adjust(phi);
+	}
+out:
+	spin_unlock(&phi_hash[bkt].lock);
+	return 0;
+}
+
+/*
+ * Extract the hot page records and batch-migrate the
+ * hot pages.
+ */
+static void kpromoted_migrate(pg_data_t *pgdat)
+{
+	int count, ret;
+	LIST_HEAD(migrate_list);
+
+	/*
+	 * Extract the top N elements from the heap that match
+	 * the requested hotness threshold.
+	 *
+	 * PFNs ineligible from migration standpoint are removed
+	 * from the heap and hash.
+	 *
+	 * Folios eligible for migration are isolated and returned
+	 * in @migrate_list.
+	 */
+	ret = phi_heap_extract(pgdat, KPROMOTED_MIGRATE_BATCH,
+			       KPROMOTED_FREQ_THRESHOLD, &migrate_list, &count);
+	if (ret)
+		return;
+
+	if (!list_empty(&migrate_list))
+		migrate_misplaced_folios_batch(&migrate_list, pgdat->node_id);
+}
+
+static int kpromoted(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t *)p;
+
+	while (!kthread_should_stop()) {
+		wait_event_timeout(pgdat->kpromoted_wait, false,
+				   msecs_to_jiffies(KPROMOTE_DELAY));
+		kpromoted_migrate(pgdat);
+	}
+	return 0;
+}
+
+static int kpromoted_run(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	int ret = 0;
+
+	if (!node_is_toptier(nid))
+		return 0;
+
+	if (!pgdat->phi_buf) {
+		pgdat->phi_buf = vzalloc_node(phi_heap_entries * sizeof(struct pghot_info *),
+					      nid);
+		if (!pgdat->phi_buf)
+			return -ENOMEM;
+
+		min_heap_init(&pgdat->heap, pgdat->phi_buf, phi_heap_entries);
+		spin_lock_init(&pgdat->heap_lock);
+	}
+
+	if (!pgdat->kpromoted)
+		pgdat->kpromoted = kthread_create_on_node(kpromoted, pgdat, nid,
+							  "kpromoted%d", nid);
+	if (IS_ERR(pgdat->kpromoted)) {
+		ret = PTR_ERR(pgdat->kpromoted);
+		pgdat->kpromoted = NULL;
+		pr_info("Failed to start kpromoted%d, ret %d\n", nid, ret);
+	} else {
+		wake_up_process(pgdat->kpromoted);
+	}
+	return ret;
+}
+
+/*
+ * TODO: Handle cleanup during node offline.
+ */
+static int __init pghot_init(void)
+{
+	unsigned int hash_size;
+	size_t hash_entries;
+	size_t nr_pages = 0;
+	pg_data_t *pgdat;
+	int i, nid, ret;
+
+	/*
+	 * Arrive at the hash and heap sizes based on the
+	 * number of pages present in the lower tier nodes.
+	 */
+	for_each_node_state(nid, N_MEMORY) {
+		if (!node_is_toptier(nid))
+			nr_pages += NODE_DATA(nid)->node_present_pages;
+	}
+
+	if (!nr_pages)
+		return 0;
+
+	hash_entries = nr_pages * PGHOT_HASH_PCT / 100;
+	hash_size = hash_entries / PGHOT_HASH_ENTRIES;
+	phi_hash_order = ilog2(hash_size);
+
+	phi_hash = vmalloc(sizeof(struct pghot_hash) * hash_size);
+	if (!phi_hash) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	for (i = 0; i < hash_size; i++) {
+		INIT_HLIST_HEAD(&phi_hash[i].hash);
+		spin_lock_init(&phi_hash[i].lock);
+	}
+
+	phi_cache = KMEM_CACHE(pghot_info, 0);
+	if (unlikely(!phi_cache)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	phi_heap_entries = hash_entries * PGHOT_HEAP_PCT / 100;
+	for_each_node_state(nid, N_CPU) {
+		ret = kpromoted_run(nid);
+		if (ret)
+			goto out_stop_kthread;
+	}
+
+	register_sysctl_init("vm", pghot_sysctls);
+	kpromoted_started = true;
+	pr_info("pghot: Started page hotness monitoring and promotion thread\n");
+	pr_info("pghot: nr_pages %ld hash_size %d hash_entries %ld hash_order %d heap_entries %d\n",
+	       nr_pages, hash_size, hash_entries, phi_hash_order, phi_heap_entries);
+	return 0;
+
+out_stop_kthread:
+	for_each_node_state(nid, N_CPU) {
+		pgdat = NODE_DATA(nid);
+		if (pgdat->kpromoted) {
+			kthread_stop(pgdat->kpromoted);
+			pgdat->kpromoted = NULL;
+			vfree(pgdat->phi_buf);
+		}
+	}
+out:
+	kmem_cache_destroy(phi_cache);
+	vfree(phi_hash);
+	return ret;
+}
+
+late_initcall(pghot_init)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 71cd1ceba191..ee122c2cd137 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1494,6 +1494,15 @@ const char * const vmstat_text[] = {
 	[I(KSTACK_REST)]			= "kstack_rest",
 #endif
 #endif
+	[I(PGHOT_RECORDED_ACCESSES)]		= "pghot_recorded_accesses",
+	[I(PGHOT_RECORD_HWHINTS)]		= "pghot_recorded_hwhints",
+	[I(PGHOT_RECORD_PGTSCANS)]		= "pghot_recorded_pgtscans",
+	[I(PGHOT_RECORD_HINTFAULTS)]		= "pghot_recorded_hintfaults",
+	[I(PGHOT_RECORDS_HASH)]			= "pghot_records_hash",
+	[I(PGHOT_RECORDS_HEAP)]			= "pghot_records_heap",
+	[I(KPROMOTED_RIGHT_NODE)]		= "kpromoted_right_node",
+	[I(KPROMOTED_NON_LRU)]			= "kpromoted_non_lru",
+	[I(KPROMOTED_DROPPED)]			= "kpromoted_dropped",
 #undef I
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 };
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 3/8] mm: Hot page tracking and promotion
  2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
@ 2025-10-03 11:17   ` Jonathan Cameron
  2025-10-06  4:13     ` Bharata B Rao
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-03 11:17 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore

On Wed, 10 Sep 2025 20:16:48 +0530
Bharata B Rao <bharata@amd.com> wrote:

> This introduces a sub-system for collecting memory access
> information from different sources. It maintains the hotness
> information based on the access history and time of access.
> 
> Additionally, it provides per-lowertier-node kernel threads
> (named kpromoted) that periodically promote the pages that
> are eligible for promotion.
> 
> Sub-systems that generate hot page access info can report that
> using this API:
> 
> int pghot_record_access(u64 pfn, int nid, int src,
> 			unsigned long time)
> 
> @pfn: The PFN of the memory accessed
> @nid: The accessing NUMA node ID
> @src: The temperature source (sub-system) that generated the
>       access info
> @time: The access time in jiffies
> 
> Some temperature sources may not provide the nid from which
> the page was accessed. This is true for sources that use
> page table scanning for PTE Accessed bit. For such sources,
> the default toptier node to which such pages should be promoted
> is hard coded.
> 
> Also, the access time provided some sources may at best be
> considered approximate. This is especially true for hot pages
> detected by PTE A bit scanning.
> 
> The hot PFN records are stored in hash lists hashed by PFN value.
> The PFN records that are categorized as hot enough to be promoted
> are maintained in a per-lowertier-node max heap from which
> kpromoted extracts and promotes them.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>

A fairly superficial review only of this. At some point I'll aim to take a closer
look at the heap bit.


> ---
>  include/linux/mmzone.h        |  11 +
>  include/linux/pghot.h         |  96 +++++++
>  include/linux/vm_event_item.h |   9 +
>  mm/Kconfig                    |  11 +
>  mm/Makefile                   |   1 +
>  mm/mm_init.c                  |  10 +
>  mm/pghot.c                    | 524 ++++++++++++++++++++++++++++++++++
>  mm/vmstat.c                   |   9 +
>  8 files changed, 671 insertions(+)
>  create mode 100644 include/linux/pghot.h
>  create mode 100644 mm/pghot.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0c5da9141983..f7094babed10 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h

> diff --git a/include/linux/pghot.h b/include/linux/pghot.h
> new file mode 100644
> index 000000000000..1443643aab13
> --- /dev/null
> +++ b/include/linux/pghot.h

> +
> +struct pghot_info {
> +	unsigned long pfn;
> +
> +	/*
> +	 * The following three fundamental parameters
> +	 * required to track the hotness of page/PFN are
> +	 * packed within a single u32.
> +	 */
> +	u32 frequency:PGHOT_FREQ_BITS; /* Number of accesses within current window */
> +	u32 nid:PGHOT_NID_BITS; /* Most recent access from this node */
> +	u32 last_update:PGHOT_TIME_BITS; /* Most recent access time */

Add spaces around : I think to help the eye parse those.

> +
> +	struct hlist_node hnode;
> +	size_t heap_idx; /* Position in max heap for quick retreival */
> +};
> +
> +struct max_heap {
> +	size_t nr;
> +	size_t size;
> +	struct pghot_info **data;
> +	DECLARE_FLEX_ARRAY(struct pghot_info *, preallocated);

That macro is all about use in unions rather than generally being needed.
Do you need that here rather than
	struct pg_hot_info *preallocated[];

Can you add a __counted_by() marking?


> +};


> diff --git a/mm/pghot.c b/mm/pghot.c
> new file mode 100644
> index 000000000000..9f7581818b8f
> --- /dev/null
> +++ b/mm/pghot.c
> @@ -0,0 +1,524 @@

> +
> +static struct folio *kpromoted_isolate_folio(struct pghot_info *phi)
> +{
> +	struct page *page = pfn_to_page(phi->pfn);
> +	struct folio *folio;
> +
> +	if (!page)
> +		return NULL;
> +
> +	folio = page_folio(page);
> +	if (migrate_misplaced_folio_prepare(folio, NULL, phi->nid))
> +		return NULL;
> +	else

else not needed.

> +		return folio;
> +}

> +static int phi_heap_extract(pg_data_t *pgdat, int batch_count, int freq_th,
> +			    struct list_head *migrate_list, int *count)
> +{
> +	spinlock_t *phi_heap_lock = &pgdat->heap_lock;
> +	struct max_heap *phi_heap = &pgdat->heap;
> +	int max_retries = 10;
> +	int bkt, i = 0;
> +
> +	if (batch_count < 0 || !migrate_list || !count || freq_th < 1 ||
> +	    freq_th > KPROMOTED_FREQ_THRESHOLD)
> +		return -EINVAL;
> +
> +	*count = 0;
> +	for (i = 0; i < batch_count; i++) {
> +		struct pghot_info *top = NULL;
> +		bool should_continue = false;
> +		struct folio *folio;
> +		int retries = 0;
> +
> +		while (retries < max_retries) {
> +			spin_lock(phi_heap_lock);
> +			if (phi_heap->nr > 0 && phi_heap->data[0]->frequency >= freq_th) {
> +				should_continue = true;
> +				bkt = hash_min(phi_heap->data[0]->pfn, phi_hash_order);
> +				top = phi_heap->data[0];
> +			}
> +			spin_unlock(phi_heap_lock);
> +
> +			if (!should_continue)
> +				goto done;
> +
> +			spin_lock(&phi_hash[bkt].lock);
> +			spin_lock(phi_heap_lock);
> +			if (phi_heap->nr == 0 || phi_heap->data[0] != top ||
> +			    phi_heap->data[0]->frequency < freq_th) {
> +				spin_unlock(phi_heap_lock);
> +				spin_unlock(&phi_hash[bkt].lock);
> +				retries++;
> +				continue;
> +			}
> +
> +			top = phi_heap->data[0];
> +			hlist_del_init(&top->hnode);
> +
> +			phi_heap->nr--;
> +			if (phi_heap->nr > 0) {
> +				phi_heap->data[0] = phi_heap->data[phi_heap->nr];
> +				phi_heap->data[0]->heap_idx = 0;
> +				min_heap_sift_down(phi_heap, 0, &phi_heap_cb,
> +						   phi_heap->data);
> +			}
> +
> +			spin_unlock(phi_heap_lock);
> +			spin_unlock(&phi_hash[bkt].lock);
> +
> +			if (!phi_is_pfn_hot(top)) {
> +				count_vm_event(KPROMOTED_DROPPED);
> +				goto skip;
> +			}
> +
> +			folio = kpromoted_isolate_folio(top);
> +			if (folio) {
> +				list_add(&folio->lru, migrate_list);
> +				(*count)++;
> +			}
> +skip:
> +			phi_free(top);
> +			break;
> +		}
> +		if (retries >= max_retries) {
> +			pr_warn("%s: Too many retries\n", __func__);
> +			break;
> +		}
> +
> +	}
> +done:
If that is all there is, I'd use an early return as tends to give
simpler code.

> +	return 0;
> +}
> +
> +static void phi_heap_add_or_adjust(struct pghot_info *phi)
> +{
> +	pg_data_t *pgdat = NODE_DATA(phi->nid);
> +	struct max_heap *phi_heap = &pgdat->heap;
> +
> +	spin_lock(&pgdat->heap_lock);

guard() perhaps.

> +	if (phi->heap_idx >= 0 && phi->heap_idx < phi_heap->nr &&
> +	    phi_heap->data[phi->heap_idx] == phi) {
> +		/* Entry exists in heap */
> +		if (phi->frequency < KPROMOTED_FREQ_THRESHOLD) {
> +			/* Below threshold, remove from the heap */
> +			phi_heap->nr--;
> +			if (phi->heap_idx < phi_heap->nr) {
> +				phi_heap->data[phi->heap_idx] =
> +					phi_heap->data[phi_heap->nr];
> +				phi_heap->data[phi->heap_idx]->heap_idx =
> +					phi->heap_idx;
> +				min_heap_sift_down(phi_heap, phi->heap_idx,
> +						   &phi_heap_cb, phi_heap->data);
> +			}
> +			phi->heap_idx = -1;
> +
> +		} else {
> +			/* Update position in heap */
> +			phi_heap_update_entry(phi_heap, phi);
> +		}
> +	} else if (phi->frequency >= KPROMOTED_FREQ_THRESHOLD) {
> +		/*
> +		 * Add to the heap. If heap is full we will have
> +		 * to wait for the next access reporting to elevate
> +		 * it to heap.
> +		 */
> +		if (phi_heap_insert(phi_heap, phi))
> +			count_vm_event(PGHOT_RECORDS_HEAP);
> +	}
> +	spin_unlock(&pgdat->heap_lock);
> +}

> +
> +/*
> + * Called by subsystems that generate page hotness/access information.
> + *
> + *  @pfn: The PFN of the memory accessed
> + *  @nid: The accessing NUMA node ID
> + *  @src: The temperature source (sub-system) that generated the
> + *        access info
> + *  @time: The access time in jiffies
> + *
> + * Maintains the access records per PFN, classifies them as
> + * hot based on subsequent accesses and finally hands over
> + * them to kpromoted for migration.
> + */
> +int pghot_record_access(u64 pfn, int nid, int src, unsigned long now)
> +{
> +	struct pghot_info *phi;
> +	struct page *page;
> +	struct folio *folio;
> +	int bkt;
> +	bool new_entry = false, new_window = false;
> +	u32 cur_time = now & PGHOT_TIME_MASK;
> +
> +	if (!kpromoted_started)
> +		return -EINVAL;
> +
> +	if (nid >= PGHOT_NID_MAX)
> +		return -EINVAL;
> +
> +	count_vm_event(PGHOT_RECORDED_ACCESSES);
> +
> +	switch (src) {
> +	case PGHOT_HW_HINTS:
> +		count_vm_event(PGHOT_RECORD_HWHINTS);
> +		break;
> +	case PGHOT_PGTABLE_SCAN:
> +		count_vm_event(PGHOT_RECORD_PGTSCANS);
> +		break;
> +	case PGHOT_HINT_FAULT:
> +		count_vm_event(PGHOT_RECORD_HINTFAULTS);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	/*
> +	 * Record only accesses from lower tiers.
> +	 */
> +	if (node_is_toptier(pfn_to_nid(pfn)))
> +		return 0;
> +
> +	/*
> +	 * Reject the non-migratable pages right away.
> +	 */
> +	page = pfn_to_online_page(pfn);
> +	if (!page || is_zone_device_page(page))
> +		return 0;
> +
> +	folio = page_folio(page);
> +	if (!folio_test_lru(folio))
> +		return 0;
> +
> +	bkt = hash_min(pfn, phi_hash_order);
> +	spin_lock(&phi_hash[bkt].lock);

If this doesn't get more complex later, use guard() and early returns on error.

> +	phi = phi_lookup(pfn, bkt);
> +	if (!phi) {
> +		phi = phi_alloc(pfn);
> +		if (!phi)
> +			goto out;

Not an error?  Add a comment on why not perhaps.

> +		new_entry = true;
> +	}
> +
> +	/*
> +	 * If the previous access was beyond the threshold window
> +	 * start frequency tracking afresh.
> +	 */
> +	if (((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_window)) ||
> +	    (nid != NUMA_NO_NODE && phi->nid != nid))
> +		new_window = true;
> +
> +	if (new_entry || new_window) {
> +		/* New window */
> +		phi->frequency = 1; /* TODO: Factor in the history */
> +	} else if (phi->frequency < PGHOT_FREQ_MAX)
> +		phi->frequency++;
> +	phi->last_update = cur_time;
> +	phi->nid = (nid == NUMA_NO_NODE) ? KPROMOTED_DEFAULT_NODE : nid;
> +
> +	if (new_entry) {
> +		/* Insert the new entry into hash table */
> +		hlist_add_head(&phi->hnode, &phi_hash[bkt].hash);
> +		count_vm_event(PGHOT_RECORDS_HASH);
> +	} else {
> +		/* Add/update the position in heap */
> +		phi_heap_add_or_adjust(phi);
> +	}
> +out:
> +	spin_unlock(&phi_hash[bkt].lock);
> +	return 0;
> +}
> +
> +/*
> + * Extract the hot page records and batch-migrate the
> + * hot pages.

Wrap comments to 80 chars.

> + */
> +static void kpromoted_migrate(pg_data_t *pgdat)
> +{
> +	int count, ret;
> +	LIST_HEAD(migrate_list);
> +
> +	/*
> +	 * Extract the top N elements from the heap that match
> +	 * the requested hotness threshold.
> +	 *
> +	 * PFNs ineligible from migration standpoint are removed
> +	 * from the heap and hash.
> +	 *
> +	 * Folios eligible for migration are isolated and returned
> +	 * in @migrate_list.
> +	 */
> +	ret = phi_heap_extract(pgdat, KPROMOTED_MIGRATE_BATCH,
> +			       KPROMOTED_FREQ_THRESHOLD, &migrate_list, &count);
> +	if (ret)
> +		return;
> +
> +	if (!list_empty(&migrate_list))
> +		migrate_misplaced_folios_batch(&migrate_list, pgdat->node_id);
> +}
> +
> +static int kpromoted(void *p)
> +{
> +	pg_data_t *pgdat = (pg_data_t *)p;

Cast not needed.
	pg_data_t *pgdat = p;


> +
> +	while (!kthread_should_stop()) {
> +		wait_event_timeout(pgdat->kpromoted_wait, false,
> +				   msecs_to_jiffies(KPROMOTE_DELAY));
> +		kpromoted_migrate(pgdat);
> +	}
> +	return 0;
> +}
> +
> +static int kpromoted_run(int nid)
> +{
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +	int ret = 0;
> +
> +	if (!node_is_toptier(nid))
> +		return 0;
> +
> +	if (!pgdat->phi_buf) {
> +		pgdat->phi_buf = vzalloc_node(phi_heap_entries * sizeof(struct pghot_info *),
> +					      nid);

I'd use sizeof(*pgdat->phi_buf) here to avoid need to check types match when reading the
code.  Sadly there isn't a vcalloc_node().

> +		if (!pgdat->phi_buf)
> +			return -ENOMEM;
> +
> +		min_heap_init(&pgdat->heap, pgdat->phi_buf, phi_heap_entries);
> +		spin_lock_init(&pgdat->heap_lock);
> +	}
> +
> +	if (!pgdat->kpromoted)
> +		pgdat->kpromoted = kthread_create_on_node(kpromoted, pgdat, nid,
> +							  "kpromoted%d", nid);
> +	if (IS_ERR(pgdat->kpromoted)) {
> +		ret = PTR_ERR(pgdat->kpromoted);
> +		pgdat->kpromoted = NULL;
> +		pr_info("Failed to start kpromoted%d, ret %d\n", nid, ret);

Unless there is going to be more in later patches that prevents it. Just
return here.

> +	} else {
> +		wake_up_process(pgdat->kpromoted);
> +	}
> +	return ret;

return 0; //after change suggested above.

> +}
> +
> +/*
> + * TODO: Handle cleanup during node offline.
> + */
> +static int __init pghot_init(void)
> +{
> +	unsigned int hash_size;
> +	size_t hash_entries;
> +	size_t nr_pages = 0;
> +	pg_data_t *pgdat;
> +	int i, nid, ret;
> +
> +	/*
> +	 * Arrive at the hash and heap sizes based on the
> +	 * number of pages present in the lower tier nodes.

Trivial: Wrap closer to 80 chars.

> +	 */
> +	for_each_node_state(nid, N_MEMORY) {
> +		if (!node_is_toptier(nid))
> +			nr_pages += NODE_DATA(nid)->node_present_pages;
> +	}
> +
> +	if (!nr_pages)
> +		return 0;
> +
> +	hash_entries = nr_pages * PGHOT_HASH_PCT / 100;
> +	hash_size = hash_entries / PGHOT_HASH_ENTRIES;
> +	phi_hash_order = ilog2(hash_size);
> +
> +	phi_hash = vmalloc(sizeof(struct pghot_hash) * hash_size);

Prefer sizeof(*phy_hash) so I don't need to check types match :)

vcalloc() probably appropriate here.



> +	if (!phi_hash) {
> +		ret = -ENOMEM;
> +		goto out;

Out label isn't clearly an 'error' which is a little confusing.

> +	}
> +
> +	for (i = 0; i < hash_size; i++) {
> +		INIT_HLIST_HEAD(&phi_hash[i].hash);
> +		spin_lock_init(&phi_hash[i].lock);
> +	}
> +
> +	phi_cache = KMEM_CACHE(pghot_info, 0);
> +	if (unlikely(!phi_cache)) {
> +		ret = -ENOMEM;
> +		goto out;
Whilst not strictly necessary I'd add multiple labels so only things
that have been allocated are free rather than relying on them being
NULL otherwise.  Whilst not a correctness thing it makes it slightly
easier to check tear down paths are correct.

> +	}
> +
> +	phi_heap_entries = hash_entries * PGHOT_HEAP_PCT / 100;
> +	for_each_node_state(nid, N_CPU) {
> +		ret = kpromoted_run(nid);
> +		if (ret)
> +			goto out_stop_kthread;
> +	}
> +
> +	register_sysctl_init("vm", pghot_sysctls);
> +	kpromoted_started = true;
> +	pr_info("pghot: Started page hotness monitoring and promotion thread\n");
> +	pr_info("pghot: nr_pages %ld hash_size %d hash_entries %ld hash_order %d heap_entries %d\n",
> +	       nr_pages, hash_size, hash_entries, phi_hash_order, phi_heap_entries);
> +	return 0;
> +
> +out_stop_kthread:
> +	for_each_node_state(nid, N_CPU) {
> +		pgdat = NODE_DATA(nid);
> +		if (pgdat->kpromoted) {
> +			kthread_stop(pgdat->kpromoted);
> +			pgdat->kpromoted = NULL;
> +			vfree(pgdat->phi_buf);
> +		}
> +	}
> +out:
> +	kmem_cache_destroy(phi_cache);
> +	vfree(phi_hash);
> +	return ret;
> +}
> +
> +late_initcall(pghot_init)




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 3/8] mm: Hot page tracking and promotion
  2025-10-03 11:17   ` Jonathan Cameron
@ 2025-10-06  4:13     ` Bharata B Rao
  0 siblings, 0 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-10-06  4:13 UTC (permalink / raw)
  To: mgorman
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore

On 03-Oct-25 4:47 PM, Jonathan Cameron wrote:
> On Wed, 10 Sep 2025 20:16:48 +0530
> Bharata B Rao <bharata@amd.com> wrote:
>> +
>> +struct max_heap {
>> +	size_t nr;
>> +	size_t size;
>> +	struct pghot_info **data;
>> +	DECLARE_FLEX_ARRAY(struct pghot_info *, preallocated);
> 
> That macro is all about use in unions rather than generally being needed.
> Do you need that here rather than
> 	struct pg_hot_info *preallocated[];
> 
> Can you add a __counted_by() marking?

I was using DEFINE_MIN_HEAP macro earlier which gave problems
when I had to define per-node instance of the heap with the
same name. The workaround for that resulted in the use of above
flex array.

Not needed, I will revert back to using array of pointers with
__counted_by() marking.

>> +
>> +	for (i = 0; i < hash_size; i++) {
>> +		INIT_HLIST_HEAD(&phi_hash[i].hash);
>> +		spin_lock_init(&phi_hash[i].lock);
>> +	}
>> +
>> +	phi_cache = KMEM_CACHE(pghot_info, 0);
>> +	if (unlikely(!phi_cache)) {
>> +		ret = -ENOMEM;
>> +		goto out;
> Whilst not strictly necessary I'd add multiple labels so only things
> that have been allocated are free rather than relying on them being
> NULL otherwise.  Whilst not a correctness thing it makes it slightly
> easier to check tear down paths are correct.

In general I agree but for freeing with a loop exit, the current
method appeared much simpler.

I will take care of rest of the review comments.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (2 preceding siblings ...)
  2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-10-03 12:19   ` Jonathan Cameron
  2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

Use IBS (Instruction Based Sampling) feature present
in AMD processors for memory access tracking. The access
information obtained from IBS via NMI is fed to kpromoted
daemon for futher action.

In addition to many other information related to the memory
access, IBS provides physical (and virtual) address of the access
and indicates if the access came from slower tier. Only memory
accesses originating from slower tiers are further acted upon
by this driver.

The samples are initially accumulated in percpu buffers which
are flushed to pghot hot page tracking mechanism using irq_work.

TODO: Many counters are added to vmstat just as debugging aid
for now.

About IBS
---------
IBS can be programmed to provide data about instruction
execution periodically. This is done by programming a desired
sample count (number of ops) in a control register. When the
programmed number of ops are dispatched, a micro-op gets tagged,
various information about the tagged micro-op's execution is
populated in IBS execution MSRs and an interrupt is raised.
While IBS provides a lot of data for each sample, for the
purpose of  memory access profiling, we are interested in
linear and physical address of the memory access that reached
DRAM. Recent AMD processors provide further filtering where
it is possible to limit the sampling to those ops that had
an L3 miss which greately reduces the non-useful samples.

While IBS provides capability to sample instruction fetch
and execution, only IBS execution sampling is used here
to collect data about memory accesses that occur during
the instruction execution.

More information about IBS is available in Sec 13.3 of
AMD64 Architecture Programmer's Manual, Volume 2:System
Programming which is present at:
https://bugzilla.kernel.org/attachment.cgi?id=288923

Information about MSRs used for programming IBS can be
found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h
Model 11h B1 which is currently present at:
https://www.amd.com/system/files/TechDocs/55901_0.25.zip

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/events/amd/ibs.c        |  11 ++
 arch/x86/include/asm/ibs.h       |   7 +
 arch/x86/include/asm/msr-index.h |  16 ++
 arch/x86/mm/Makefile             |   3 +-
 arch/x86/mm/ibs.c                | 311 +++++++++++++++++++++++++++++++
 include/linux/vm_event_item.h    |  17 ++
 mm/vmstat.c                      |  17 ++
 7 files changed, 381 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 112f43b23ebf..1498dc9caeb2 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -13,9 +13,11 @@
 #include <linux/ptrace.h>
 #include <linux/syscore_ops.h>
 #include <linux/sched/clock.h>
+#include <linux/pghot.h>
 
 #include <asm/apic.h>
 #include <asm/msr.h>
+#include <asm/ibs.h>
 
 #include "../perf_event.h"
 
@@ -1756,6 +1758,15 @@ static __init int amd_ibs_init(void)
 {
 	u32 caps;
 
+	/*
+	 * TODO: Find a clean way to disable perf IBS so that IBS
+	 * can be used for memory access profiling.
+	 */
+	if (arch_hw_access_profiling) {
+		pr_info("IBS isn't available for perf use\n");
+		return 0;
+	}
+
 	caps = __get_ibs_caps();
 	if (!caps)
 		return -ENODEV;	/* ibs not supported by the cpu */
diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h
new file mode 100644
index 000000000000..b5a4f2ca6330
--- /dev/null
+++ b/arch/x86/include/asm/ibs.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IBS_H
+#define _ASM_X86_IBS_H
+
+extern bool arch_hw_access_profiling;
+
+#endif /* _ASM_X86_IBS_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b65c3ba5fa14..55d26380550c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -742,6 +742,22 @@
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
 
+/* AMD IBS MSR bits */
+#define MSR_AMD64_IBSOPDATA2_DATASRC			0x7
+#define MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE		0x1
+#define MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR	0x2
+#define MSR_AMD64_IBSOPDATA2_DATASRC_DRAM		0x3
+#define MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE	0x5
+#define MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM		0x8
+#define	MSR_AMD64_IBSOPDATA2_RMTNODE			0x10
+
+#define MSR_AMD64_IBSOPDATA3_LDOP		BIT_ULL(0)
+#define MSR_AMD64_IBSOPDATA3_STOP		BIT_ULL(1)
+#define MSR_AMD64_IBSOPDATA3_DCMISS		BIT_ULL(7)
+#define MSR_AMD64_IBSOPDATA3_LADDR_VALID	BIT_ULL(17)
+#define MSR_AMD64_IBSOPDATA3_PADDR_VALID	BIT_ULL(18)
+#define MSR_AMD64_IBSOPDATA3_L2MISS		BIT_ULL(20)
+
 /* Zen4 */
 #define MSR_ZEN4_BP_CFG                 0xc001102e
 #define MSR_ZEN4_BP_CFG_BP_SPEC_REDUCE_BIT 4
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5b9908f13dcf..967e5af9eba9 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,7 +22,8 @@ CFLAGS_REMOVE_pgprot.o			= -pg
 endif
 
 obj-y				:=  init.o init_$(BITS).o fault.o ioremap.o extable.o mmap.o \
-				    pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o
+				    pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o \
+				    ibs.o
 
 obj-y				+= pat/
 
diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
new file mode 100644
index 000000000000..6669710dd35b
--- /dev/null
+++ b/arch/x86/mm/ibs.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/init.h>
+#include <linux/pghot.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+#include <linux/irq_work.h>
+
+#include <asm/nmi.h>
+#include <asm/perf_event.h> /* TODO: Move defns like IBS_OP_ENABLE into non-perf header */
+#include <asm/apic.h>
+#include <asm/ibs.h>
+
+bool arch_hw_access_profiling;
+static u64 ibs_config __read_mostly;
+static u32 ibs_caps;
+
+#define IBS_NR_SAMPLES	150
+
+/*
+ * Basic access info captured for each memory access.
+ */
+struct ibs_sample {
+	unsigned long pfn;
+	unsigned long time;	/* jiffies when accessed */
+	int nid;		/* Accessing node ID, if known */
+};
+
+/*
+ * Percpu buffer of access samples. Samples are accumulated here
+ * before pushing them to kpromoted for further action.
+ */
+struct ibs_sample_pcpu {
+	struct ibs_sample samples[IBS_NR_SAMPLES];
+	int head, tail;
+};
+
+struct ibs_sample_pcpu __percpu *ibs_s;
+
+/*
+ * The workqueue for pushing the percpu access samples to kpromoted.
+ */
+static struct work_struct ibs_work;
+static struct irq_work ibs_irq_work;
+
+/*
+ * Record the IBS-reported access sample in percpu buffer.
+ * Called from IBS NMI handler.
+ */
+static int ibs_push_sample(unsigned long pfn, int nid, unsigned long time)
+{
+	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
+	int next = ibs_pcpu->head + 1;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	if (next == ibs_pcpu->tail)
+		return 0;
+
+	ibs_pcpu->samples[ibs_pcpu->head].pfn = pfn;
+	ibs_pcpu->samples[ibs_pcpu->head].time = time;
+	ibs_pcpu->head = next;
+	return 1;
+}
+
+static int ibs_pop_sample(struct ibs_sample *s)
+{
+	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
+
+	int next = ibs_pcpu->tail + 1;
+
+	if (ibs_pcpu->head == ibs_pcpu->tail)
+		return 0;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	*s = ibs_pcpu->samples[ibs_pcpu->tail];
+	ibs_pcpu->tail = next;
+	return 1;
+}
+
+/*
+ * Remove access samples from percpu buffer and send them
+ * to kpromoted for further action.
+ */
+static void ibs_work_handler(struct work_struct *work)
+{
+	struct ibs_sample s;
+
+	while (ibs_pop_sample(&s))
+		pghot_record_access(s.pfn, s.nid, PGHOT_HW_HINTS, s.time);
+}
+
+static void ibs_irq_handler(struct irq_work *i)
+{
+	schedule_work_on(smp_processor_id(), &ibs_work);
+}
+
+/*
+ * IBS NMI handler: Process the memory access info reported by IBS.
+ *
+ * Reads the MSRs to collect all the information about the reported
+ * memory access, validates the access, stores the valid sample and
+ * schedules the work on this CPU to further process the sample.
+ */
+static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs)
+{
+	struct mm_struct *mm = current->mm;
+	u64 ops_ctl, ops_data3, ops_data2;
+	u64 laddr = -1, paddr = -1;
+	u64 data_src, rmt_node;
+	struct page *page;
+	unsigned long pfn;
+
+	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
+
+	/*
+	 * When IBS sampling period is reprogrammed via read-modify-update
+	 * of MSR_AMD64_IBSOPCTL, overflow NMIs could be generated with
+	 * IBS_OP_ENABLE not set. For such cases, return as HANDLED.
+	 *
+	 * With this, the handler will say "handled" for all NMIs that
+	 * aren't related to this NMI.  This stems from the limitation of
+	 * having both status and control bits in one MSR.
+	 */
+	if (!(ops_ctl & IBS_OP_VAL))
+		goto handled;
+
+	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_VAL);
+
+	count_vm_event(HWHINT_NR_EVENTS);
+
+	if (!user_mode(regs)) {
+		count_vm_event(HWHINT_KERNEL);
+		goto handled;
+	}
+
+	if (!mm) {
+		count_vm_event(HWHINT_KTHREAD);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA3, ops_data3);
+
+	/* Load/Store ops only */
+	/* TODO: DataSrc isn't valid for stores, so filter out stores? */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_LDOP |
+			   MSR_AMD64_IBSOPDATA3_STOP))) {
+		count_vm_event(HWHINT_NON_LOAD_STORES);
+		goto handled;
+	}
+
+	/* Discard the sample if it was L1 or L2 hit */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_DCMISS |
+			   MSR_AMD64_IBSOPDATA3_L2MISS))) {
+		count_vm_event(HWHINT_DC_L2_HITS);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA2, ops_data2);
+	data_src = ops_data2 & MSR_AMD64_IBSOPDATA2_DATASRC;
+	if (ibs_caps & IBS_CAPS_ZEN4)
+		data_src |= ((ops_data2 & 0xC0) >> 3);
+
+	switch (data_src) {
+	case MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE:
+		count_vm_event(HWHINT_LOCAL_L3L1L2);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR:
+		count_vm_event(HWHINT_LOCAL_PEER_CACHE_NEAR);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_DRAM:
+		count_vm_event(HWHINT_DRAM_ACCESSES);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM:
+		count_vm_event(HWHINT_CXL_ACCESSES);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE:
+		count_vm_event(HWHINT_FAR_CACHE_HITS);
+		break;
+	}
+
+	rmt_node = ops_data2 & MSR_AMD64_IBSOPDATA2_RMTNODE;
+	if (rmt_node)
+		count_vm_event(HWHINT_REMOTE_NODE);
+
+	/* Is linear addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_LADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCLINAD, laddr);
+	else {
+		count_vm_event(HWHINT_LADDR_INVALID);
+		goto handled;
+	}
+
+	/* Discard kernel address accesses */
+	if (laddr & (1UL << 63)) {
+		count_vm_event(HWHINT_KERNEL_ADDR);
+		goto handled;
+	}
+
+	/* Is phys addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_PADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCPHYSAD, paddr);
+	else {
+		count_vm_event(HWHINT_PADDR_INVALID);
+		goto handled;
+	}
+
+	pfn = PHYS_PFN(paddr);
+	page = pfn_to_online_page(pfn);
+	if (!page)
+		goto handled;
+
+	if (!PageLRU(page)) {
+		count_vm_event(HWHINT_NON_LRU);
+		goto handled;
+	}
+
+	if (!ibs_push_sample(pfn, numa_node_id(), jiffies)) {
+		count_vm_event(HWHINT_BUFFER_FULL);
+		goto handled;
+	}
+
+	irq_work_queue(&ibs_irq_work);
+	count_vm_event(HWHINT_USEFUL_SAMPLES);
+
+handled:
+	return NMI_HANDLED;
+}
+
+static inline int get_ibs_lvt_offset(void)
+{
+	u64 val;
+
+	rdmsrl(MSR_AMD64_IBSCTL, val);
+	if (!(val & IBSCTL_LVT_OFFSET_VALID))
+		return -EINVAL;
+
+	return val & IBSCTL_LVT_OFFSET_MASK;
+}
+
+static void setup_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset < 0)
+		goto failed;
+
+	if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0))
+		return;
+failed:
+	pr_warn("IBS APIC setup failed on cpu #%d\n",
+		smp_processor_id());
+}
+
+static void clear_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset >= 0)
+		setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1);
+}
+
+static int x86_amd_ibs_access_profile_startup(unsigned int cpu)
+{
+	setup_APIC_ibs();
+	return 0;
+}
+
+static int x86_amd_ibs_access_profile_teardown(unsigned int cpu)
+{
+	clear_APIC_ibs();
+	return 0;
+}
+
+static int __init ibs_access_profiling_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_IBS)) {
+		pr_info("IBS capability is unavailable for access profiling\n");
+		return 0;
+	}
+
+	ibs_s = alloc_percpu_gfp(struct ibs_sample_pcpu, GFP_KERNEL | __GFP_ZERO);
+	if (!ibs_s)
+		return 0;
+
+	INIT_WORK(&ibs_work, ibs_work_handler);
+	init_irq_work(&ibs_irq_work, ibs_irq_handler);
+
+	/* Uses IBS Op sampling */
+	ibs_config = IBS_OP_CNT_CTL | IBS_OP_ENABLE;
+	ibs_caps = cpuid_eax(IBS_CPUID_FEATURES);
+	if (ibs_caps & IBS_CAPS_ZEN4)
+		ibs_config |= IBS_OP_L3MISSONLY;
+
+	register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs");
+
+	cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
+			  "x86/amd/ibs_access_profile:starting",
+			  x86_amd_ibs_access_profile_startup,
+			  x86_amd_ibs_access_profile_teardown);
+
+	pr_info("IBS setup for memory access profiling\n");
+	return 0;
+}
+
+arch_initcall(ibs_access_profiling_init);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a996fa9df785..bca57b05766d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -195,6 +195,23 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KPROMOTED_RIGHT_NODE,
 		KPROMOTED_NON_LRU,
 		KPROMOTED_DROPPED,
+		HWHINT_NR_EVENTS,
+		HWHINT_KERNEL,
+		HWHINT_KTHREAD,
+		HWHINT_NON_LOAD_STORES,
+		HWHINT_DC_L2_HITS,
+		HWHINT_LOCAL_L3L1L2,
+		HWHINT_LOCAL_PEER_CACHE_NEAR,
+		HWHINT_FAR_CACHE_HITS,
+		HWHINT_DRAM_ACCESSES,
+		HWHINT_CXL_ACCESSES,
+		HWHINT_REMOTE_NODE,
+		HWHINT_LADDR_INVALID,
+		HWHINT_KERNEL_ADDR,
+		HWHINT_PADDR_INVALID,
+		HWHINT_NON_LRU,
+		HWHINT_BUFFER_FULL,
+		HWHINT_USEFUL_SAMPLES,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ee122c2cd137..aa743708c79b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1503,6 +1503,23 @@ const char * const vmstat_text[] = {
 	[I(KPROMOTED_RIGHT_NODE)]		= "kpromoted_right_node",
 	[I(KPROMOTED_NON_LRU)]			= "kpromoted_non_lru",
 	[I(KPROMOTED_DROPPED)]			= "kpromoted_dropped",
+	[I(HWHINT_NR_EVENTS)]			= "hwhint_nr_events",
+	[I(HWHINT_KERNEL)]			= "hwhint_kernel",
+	[I(HWHINT_KTHREAD)]			= "hwhint_kthread",
+	[I(HWHINT_NON_LOAD_STORES)]		= "hwhint_non_load_stores",
+	[I(HWHINT_DC_L2_HITS)]			= "hwhint_dc_l2_hits",
+	[I(HWHINT_LOCAL_L3L1L2)]		= "hwhint_local_l3l1l2",
+	[I(HWHINT_LOCAL_PEER_CACHE_NEAR)]	= "hwhint_local_peer_cache_near",
+	[I(HWHINT_FAR_CACHE_HITS)]		= "hwhint_far_cache_hits",
+	[I(HWHINT_DRAM_ACCESSES)]		= "hwhint_dram_accesses",
+	[I(HWHINT_CXL_ACCESSES)]		= "hwhint_cxl_accesses",
+	[I(HWHINT_REMOTE_NODE)]			= "hwhint_remote_node",
+	[I(HWHINT_LADDR_INVALID)]		= "hwhint_invalid_laddr",
+	[I(HWHINT_KERNEL_ADDR)]			= "hwhint_kernel_addr",
+	[I(HWHINT_PADDR_INVALID)]		= "hwhint_invalid_paddr",
+	[I(HWHINT_NON_LRU)]			= "hwhint_non_lru",
+	[I(HWHINT_BUFFER_FULL)]			= "hwhint_buffer_full",
+	[I(HWHINT_USEFUL_SAMPLES)]		= "hwhint_useful_samples",
 #undef I
 #endif /* CONFIG_VM_EVENT_COUNTERS */
 };
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling
  2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
@ 2025-10-03 12:19   ` Jonathan Cameron
  2025-10-06  4:28     ` Bharata B Rao
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-03 12:19 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore

On Wed, 10 Sep 2025 20:16:49 +0530
Bharata B Rao <bharata@amd.com> wrote:

> Use IBS (Instruction Based Sampling) feature present
> in AMD processors for memory access tracking. The access
> information obtained from IBS via NMI is fed to kpromoted
> daemon for futher action.
> 
> In addition to many other information related to the memory
> access, IBS provides physical (and virtual) address of the access
> and indicates if the access came from slower tier. Only memory
> accesses originating from slower tiers are further acted upon
> by this driver.
> 
> The samples are initially accumulated in percpu buffers which
> are flushed to pghot hot page tracking mechanism using irq_work.
> 
> TODO: Many counters are added to vmstat just as debugging aid
> for now.
> 
> About IBS
> ---------
> IBS can be programmed to provide data about instruction
> execution periodically. This is done by programming a desired
> sample count (number of ops) in a control register. When the
> programmed number of ops are dispatched, a micro-op gets tagged,
> various information about the tagged micro-op's execution is
> populated in IBS execution MSRs and an interrupt is raised.
> While IBS provides a lot of data for each sample, for the
> purpose of  memory access profiling, we are interested in
> linear and physical address of the memory access that reached
> DRAM. Recent AMD processors provide further filtering where
> it is possible to limit the sampling to those ops that had
> an L3 miss which greately reduces the non-useful samples.
> 
> While IBS provides capability to sample instruction fetch
> and execution, only IBS execution sampling is used here
> to collect data about memory accesses that occur during
> the instruction execution.
> 
> More information about IBS is available in Sec 13.3 of
> AMD64 Architecture Programmer's Manual, Volume 2:System
> Programming which is present at:
> https://bugzilla.kernel.org/attachment.cgi?id=288923
> 
> Information about MSRs used for programming IBS can be
> found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h
> Model 11h B1 which is currently present at:
> https://www.amd.com/system/files/TechDocs/55901_0.25.zip
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  arch/x86/events/amd/ibs.c        |  11 ++
>  arch/x86/include/asm/ibs.h       |   7 +
>  arch/x86/include/asm/msr-index.h |  16 ++
>  arch/x86/mm/Makefile             |   3 +-
>  arch/x86/mm/ibs.c                | 311 +++++++++++++++++++++++++++++++
>  include/linux/vm_event_item.h    |  17 ++
>  mm/vmstat.c                      |  17 ++
>  7 files changed, 381 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/asm/ibs.h
>  create mode 100644 arch/x86/mm/ibs.c
> 
> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
> index 112f43b23ebf..1498dc9caeb2 100644
> --- a/arch/x86/events/amd/ibs.c
> +++ b/arch/x86/events/amd/ibs.c
> @@ -13,9 +13,11 @@
>  #include <linux/ptrace.h>
>  #include <linux/syscore_ops.h>
>  #include <linux/sched/clock.h>
> +#include <linux/pghot.h>
>  
>  #include <asm/apic.h>
>  #include <asm/msr.h>
> +#include <asm/ibs.h>
>  
>  #include "../perf_event.h"
>  
> @@ -1756,6 +1758,15 @@ static __init int amd_ibs_init(void)
>  {
>  	u32 caps;
>  
> +	/*
> +	 * TODO: Find a clean way to disable perf IBS so that IBS
> +	 * can be used for memory access profiling.

Agreed on this being a key thing.  This applies to quite a few
other sources of data so finding a generally acceptable solution to this
would be great.  Davidlohr mentioned on the CXL sync that he has
something tackling this for the CHMU driver around this.


> +	 */
> +	if (arch_hw_access_profiling) {
> +		pr_info("IBS isn't available for perf use\n");
> +		return 0;
> +	}
> +
>  	caps = __get_ibs_caps();
>  	if (!caps)
>  		return -ENODEV;	/* ibs not supported by the cpu */

> diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
> new file mode 100644
> index 000000000000..6669710dd35b
> --- /dev/null
> +++ b/arch/x86/mm/ibs.c
> @@ -0,0 +1,311 @@

...

> +
> +static int ibs_pop_sample(struct ibs_sample *s)
> +{
> +	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
> +
> +	int next = ibs_pcpu->tail + 1;
> +
> +	if (ibs_pcpu->head == ibs_pcpu->tail)
> +		return 0;
> +
> +	if (next >= IBS_NR_SAMPLES)

== seems more appropriate to me.  If it's > then something went wrong
and we lost data.

> +		next = 0;
> +
> +	*s = ibs_pcpu->samples[ibs_pcpu->tail];
> +	ibs_pcpu->tail = next;
> +	return 1;
> +}


> +static void setup_APIC_ibs(void)
> +{
> +	int offset;
> +
> +	offset = get_ibs_lvt_offset();
> +	if (offset < 0)
> +		goto failed;
> +
> +	if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0))
> +		return;
> +failed:
> +	pr_warn("IBS APIC setup failed on cpu #%d\n",
> +		smp_processor_id());

Unless this is going to get more complex, move that up to the if () block
above and return directly there.

> +}

> +static int __init ibs_access_profiling_init(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_IBS)) {
> +		pr_info("IBS capability is unavailable for access profiling\n");
> +		return 0;
> +	}
> +
> +	ibs_s = alloc_percpu_gfp(struct ibs_sample_pcpu, GFP_KERNEL | __GFP_ZERO);

sizeof(*ibs_s).
Same as in other cases. It's nice to avoid having to check types when reviewing code.

> +	if (!ibs_s)
> +		return 0;
> +
> +	INIT_WORK(&ibs_work, ibs_work_handler);
> +	init_irq_work(&ibs_irq_work, ibs_irq_handler);
> +
> +	/* Uses IBS Op sampling */
> +	ibs_config = IBS_OP_CNT_CTL | IBS_OP_ENABLE;
> +	ibs_caps = cpuid_eax(IBS_CPUID_FEATURES);
> +	if (ibs_caps & IBS_CAPS_ZEN4)
> +		ibs_config |= IBS_OP_L3MISSONLY;
ibs_config seems to only be used locally so the global seems unnecessary.
You'll need to pass it in to the one user in the next patch though.


> +
> +	register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs");
> +
> +	cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
> +			  "x86/amd/ibs_access_profile:starting",
> +			  x86_amd_ibs_access_profile_startup,
> +			  x86_amd_ibs_access_profile_teardown);
> +
> +	pr_info("IBS setup for memory access profiling\n");
> +	return 0;
> +}
> +
> +arch_initcall(ibs_access_profiling_init);




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling
  2025-10-03 12:19   ` Jonathan Cameron
@ 2025-10-06  4:28     ` Bharata B Rao
  0 siblings, 0 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-10-06  4:28 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore

On 03-Oct-25 5:49 PM, Jonathan Cameron wrote:
> On Wed, 10 Sep 2025 20:16:49 +0530
> Bharata B Rao <bharata@amd.com> wrote:
>>  
>> @@ -1756,6 +1758,15 @@ static __init int amd_ibs_init(void)
>>  {
>>  	u32 caps;
>>  
>> +	/*
>> +	 * TODO: Find a clean way to disable perf IBS so that IBS
>> +	 * can be used for memory access profiling.
> 
> Agreed on this being a key thing.  This applies to quite a few
> other sources of data so finding a generally acceptable solution to this
> would be great.  Davidlohr mentioned on the CXL sync that he has
> something tackling this for the CHMU driver around this.

Okay, will wait to check that.

> 
> 
>> +	 */
>> +	if (arch_hw_access_profiling) {
>> +		pr_info("IBS isn't available for perf use\n");
>> +		return 0;
>> +	}
>> +
>>  	caps = __get_ibs_caps();
>>  	if (!caps)
>>  		return -ENODEV;	/* ibs not supported by the cpu */
> 
>> diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
>> new file mode 100644
>> index 000000000000..6669710dd35b
>> --- /dev/null
>> +++ b/arch/x86/mm/ibs.c
>> @@ -0,0 +1,311 @@
> 
> ...
> 
>> +
>> +static int ibs_pop_sample(struct ibs_sample *s)
>> +{
>> +	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
>> +
>> +	int next = ibs_pcpu->tail + 1;
>> +
>> +	if (ibs_pcpu->head == ibs_pcpu->tail)
>> +		return 0;
>> +
>> +	if (next >= IBS_NR_SAMPLES)
> 
> == seems more appropriate to me.  If it's > then something went wrong
> and we lost data.

Makes sense, will try that.

> 
>> +		next = 0;
>> +
>> +	*s = ibs_pcpu->samples[ibs_pcpu->tail];
>> +	ibs_pcpu->tail = next;
>> +	return 1;
>> +}
> 
> 
>> +static void setup_APIC_ibs(void)
>> +{
>> +	int offset;
>> +
>> +	offset = get_ibs_lvt_offset();
>> +	if (offset < 0)
>> +		goto failed;
>> +
>> +	if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0))
>> +		return;
>> +failed:
>> +	pr_warn("IBS APIC setup failed on cpu #%d\n",
>> +		smp_processor_id());
> 
> Unless this is going to get more complex, move that up to the if () block
> above and return directly there.

I want to print a warning for both the cases: when LVT offset couldn't
be obtained and also when LVT entry couldn't be setup.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (3 preceding siblings ...)
  2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-10-03 12:22   ` Jonathan Cameron
  2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

Enable IBS memory access data collection for user memory
accesses by programming the required MSRs. The profiling
is turned ON only for user mode execution and turned OFF
for kernel mode execution. Profiling is explicitly disabled
for NMI handler too.

TODOs:

- IBS sampling rate is kept fixed for now.
- Arch/vendor separation/isolation of the code needs relook.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/include/asm/entry-common.h |  3 +++
 arch/x86/include/asm/hardirq.h      |  2 ++
 arch/x86/include/asm/ibs.h          |  2 ++
 arch/x86/mm/ibs.c                   | 32 +++++++++++++++++++++++++++++
 4 files changed, 39 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index d535a97c7284..7144b57d209b 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -9,10 +9,12 @@
 #include <asm/io_bitmap.h>
 #include <asm/fpu/api.h>
 #include <asm/fred.h>
+#include <asm/ibs.h>
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
 {
+	hw_access_profiling_stop();
 	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
 		/*
 		 * Make sure that the entry code gave us a sensible EFLAGS
@@ -99,6 +101,7 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 static __always_inline void arch_exit_to_user_mode(void)
 {
 	amd_clear_divider();
+	hw_access_profiling_start();
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index f00c09ffe6a9..0752cb6ebd7a 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -91,4 +91,6 @@ static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(void)
 static __always_inline void kvm_set_cpu_l1tf_flush_l1d(void) { }
 #endif /* IS_ENABLED(CONFIG_KVM_INTEL) */
 
+#define arch_nmi_enter()	hw_access_profiling_stop()
+#define arch_nmi_exit()		hw_access_profiling_start()
 #endif /* _ASM_X86_HARDIRQ_H */
diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h
index b5a4f2ca6330..6b480958534e 100644
--- a/arch/x86/include/asm/ibs.h
+++ b/arch/x86/include/asm/ibs.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_IBS_H
 #define _ASM_X86_IBS_H
 
+void hw_access_profiling_start(void);
+void hw_access_profiling_stop(void);
 extern bool arch_hw_access_profiling;
 
 #endif /* _ASM_X86_IBS_H */
diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
index 6669710dd35b..3128e8fa5f39 100644
--- a/arch/x86/mm/ibs.c
+++ b/arch/x86/mm/ibs.c
@@ -16,6 +16,7 @@ static u64 ibs_config __read_mostly;
 static u32 ibs_caps;
 
 #define IBS_NR_SAMPLES	150
+#define IBS_SAMPLE_PERIOD      10000
 
 /*
  * Basic access info captured for each memory access.
@@ -98,6 +99,36 @@ static void ibs_irq_handler(struct irq_work *i)
 	schedule_work_on(smp_processor_id(), &ibs_work);
 }
 
+void hw_access_profiling_stop(void)
+{
+	u64 ops_ctl;
+
+	if (!arch_hw_access_profiling)
+		return;
+
+	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
+	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_ENABLE);
+}
+
+void hw_access_profiling_start(void)
+{
+	u64 config = 0;
+	unsigned int period = IBS_SAMPLE_PERIOD;
+
+	if (!arch_hw_access_profiling)
+		return;
+
+	/* Disable IBS for kernel thread */
+	if (!current->mm)
+		goto out;
+
+	config = (period >> 4)  & IBS_OP_MAX_CNT;
+	config |= (period & IBS_OP_MAX_CNT_EXT_MASK);
+	config |= ibs_config;
+out:
+	wrmsrl(MSR_AMD64_IBSOPCTL, config);
+}
+
 /*
  * IBS NMI handler: Process the memory access info reported by IBS.
  *
@@ -304,6 +335,7 @@ static int __init ibs_access_profiling_init(void)
 			  x86_amd_ibs_access_profile_startup,
 			  x86_amd_ibs_access_profile_teardown);
 
+	arch_hw_access_profiling = true;
 	pr_info("IBS setup for memory access profiling\n");
 	return 0;
 }
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses
  2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
@ 2025-10-03 12:22   ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-03 12:22 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore

On Wed, 10 Sep 2025 20:16:50 +0530
Bharata B Rao <bharata@amd.com> wrote:

> Enable IBS memory access data collection for user memory
> accesses by programming the required MSRs. The profiling
> is turned ON only for user mode execution and turned OFF
> for kernel mode execution. Profiling is explicitly disabled
> for NMI handler too.
> 
> TODOs:
> 
> - IBS sampling rate is kept fixed for now.
> - Arch/vendor separation/isolation of the code needs relook.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
One "oops I misread it" wrt to review of previous patch.
+ a really trivial thing.

J
> ---
>  arch/x86/include/asm/entry-common.h |  3 +++
>  arch/x86/include/asm/hardirq.h      |  2 ++
>  arch/x86/include/asm/ibs.h          |  2 ++
>  arch/x86/mm/ibs.c                   | 32 +++++++++++++++++++++++++++++
>  4 files changed, 39 insertions(+)

> diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
> index 6669710dd35b..3128e8fa5f39 100644
> --- a/arch/x86/mm/ibs.c
> +++ b/arch/x86/mm/ibs.c
> @@ -16,6 +16,7 @@ static u64 ibs_config __read_mostly;
>  static u32 ibs_caps;
>  
>  #define IBS_NR_SAMPLES	150
> +#define IBS_SAMPLE_PERIOD      10000
>  
>  /*
>   * Basic access info captured for each memory access.
> @@ -98,6 +99,36 @@ static void ibs_irq_handler(struct irq_work *i)
>  	schedule_work_on(smp_processor_id(), &ibs_work);
>  }
>  
> +void hw_access_profiling_stop(void)
> +{
> +	u64 ops_ctl;
> +
> +	if (!arch_hw_access_profiling)
> +		return;
> +
> +	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
> +	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_ENABLE);
> +}
> +
> +void hw_access_profiling_start(void)
> +{
> +	u64 config = 0;
> +	unsigned int period = IBS_SAMPLE_PERIOD;
> +
> +	if (!arch_hw_access_profiling)
> +		return;
> +
> +	/* Disable IBS for kernel thread */
> +	if (!current->mm)
> +		goto out;
> +
> +	config = (period >> 4)  & IBS_OP_MAX_CNT;

Bonus space though before & that can go.

> +	config |= (period & IBS_OP_MAX_CNT_EXT_MASK);
> +	config |= ibs_config;

Ah. Ignore comment in previous patch on this not being global. Clearly it needs
to be. Oops I misread this earlier.

> +out:
> +	wrmsrl(MSR_AMD64_IBSOPCTL, config);
> +}
> +
>  /*
>   * IBS NMI handler: Process the memory access info reported by IBS.
>   *
> @@ -304,6 +335,7 @@ static int __init ibs_access_profiling_init(void)
>  			  x86_amd_ibs_access_profile_startup,
>  			  x86_amd_ibs_access_profile_teardown);
>  
> +	arch_hw_access_profiling = true;
>  	pr_info("IBS setup for memory access profiling\n");
>  	return 0;
>  }



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 6/8] mm: mglru: generalize page table walk
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (4 preceding siblings ...)
  2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

From: Kinsey Ho <kinseyho@google.com>

Refactor the existing MGLRU page table walking logic to make it
resumable.

Additionally, introduce two hooks into the MGLRU page table walk:
accessed callback and flush callback. The accessed callback is called
for each accessed page detected via the scanned accessed bit. The flush
callback is called when the accessed callback reports an out of space
error. This allows for processing pages in batches for efficiency.

With a generalised page table walk, introduce a new scan function which
repeatedly scans on the same young generation and does not add a new
young generation.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/mmzone.h |   5 ++
 mm/internal.h          |   4 +
 mm/vmscan.c            | 176 ++++++++++++++++++++++++++++++-----------
 3 files changed, 139 insertions(+), 46 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f7094babed10..4ad15490aff6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -533,6 +533,8 @@ struct lru_gen_mm_walk {
 	unsigned long seq;
 	/* the next address within an mm to scan */
 	unsigned long next_addr;
+	/* called for each accessed pte/pmd */
+	int (*accessed_cb)(unsigned long pfn);
 	/* to batch promoted pages */
 	int nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
 	/* to batch the mm stats */
@@ -540,6 +542,9 @@ struct lru_gen_mm_walk {
 	/* total batched items */
 	int batched;
 	int swappiness;
+	/* for the pmd under scanning */
+	int nr_young_pte;
+	int nr_total_pte;
 	bool force_scan;
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 45b725c3dc03..6c2c86abfde2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -548,6 +548,10 @@ static inline int user_proactive_reclaim(char *buf,
 	return 0;
 }
 #endif
+void set_task_reclaim_state(struct task_struct *task,
+				   struct reclaim_state *rs);
+void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq,
+			 int (*accessed_cb)(unsigned long), void (*flush_cb)(void));
 
 /*
  * in mm/rmap.c:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7de11524a936..4146e17f90ae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -289,7 +289,7 @@ static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
 			continue;				\
 		else
 
-static void set_task_reclaim_state(struct task_struct *task,
+void set_task_reclaim_state(struct task_struct *task,
 				   struct reclaim_state *rs)
 {
 	/* Check for an overwrite */
@@ -3092,7 +3092,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
 
 	VM_WARN_ON_ONCE(mm_state->seq + 1 < walk->seq);
 
-	if (walk->seq <= mm_state->seq)
+	if (!walk->accessed_cb && walk->seq <= mm_state->seq)
 		goto done;
 
 	if (!mm_state->head)
@@ -3518,16 +3518,14 @@ static void walk_update_folio(struct lru_gen_mm_walk *walk, struct folio *folio,
 	}
 }
 
-static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
-			   struct mm_walk *args)
+static int walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
+			   struct mm_walk *args, bool *suitable)
 {
-	int i;
+	int i, err = 0;
 	bool dirty;
 	pte_t *pte;
 	spinlock_t *ptl;
 	unsigned long addr;
-	int total = 0;
-	int young = 0;
 	struct folio *last = NULL;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct mem_cgroup *memcg = lruvec_memcg(walk->lruvec);
@@ -3537,17 +3535,21 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	pmd_t pmdval;
 
 	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval, &ptl);
-	if (!pte)
-		return false;
+	if (!pte) {
+		*suitable = false;
+		return 0;
+	}
 
 	if (!spin_trylock(ptl)) {
 		pte_unmap(pte);
-		return true;
+		*suitable = true;
+		return 0;
 	}
 
 	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
 		pte_unmap_unlock(pte, ptl);
-		return false;
+		*suitable = false;
+		return 0;
 	}
 
 	arch_enter_lazy_mmu_mode();
@@ -3557,7 +3559,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		struct folio *folio;
 		pte_t ptent = ptep_get(pte + i);
 
-		total++;
+		walk->nr_total_pte++;
 		walk->mm_stats[MM_LEAF_TOTAL]++;
 
 		pfn = get_pte_pfn(ptent, args->vma, addr, pgdat);
@@ -3581,23 +3583,34 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		if (pte_dirty(ptent))
 			dirty = true;
 
-		young++;
+		walk->nr_young_pte++;
 		walk->mm_stats[MM_LEAF_YOUNG]++;
+
+		if (!walk->accessed_cb)
+			continue;
+
+		err = walk->accessed_cb(pfn);
+		if (err) {
+			walk->next_addr = addr + PAGE_SIZE;
+			break;
+		}
 	}
 
 	walk_update_folio(walk, last, gen, dirty);
 	last = NULL;
 
-	if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
+	if (!err && i < PTRS_PER_PTE &&
+	    get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
 		goto restart;
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte, ptl);
 
-	return suitable_to_scan(total, young);
+	*suitable = suitable_to_scan(walk->nr_total_pte, walk->nr_young_pte);
+	return err;
 }
 
-static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
+static int walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma,
 				  struct mm_walk *args, unsigned long *bitmap, unsigned long *first)
 {
 	int i;
@@ -3610,6 +3623,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int gen = lru_gen_from_seq(max_seq);
+	int err = 0;
 
 	VM_WARN_ON_ONCE(pud_leaf(*pud));
 
@@ -3617,13 +3631,13 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	if (*first == -1) {
 		*first = addr;
 		bitmap_zero(bitmap, MIN_LRU_BATCH);
-		return;
+		return 0;
 	}
 
 	i = addr == -1 ? 0 : pmd_index(addr) - pmd_index(*first);
 	if (i && i <= MIN_LRU_BATCH) {
 		__set_bit(i - 1, bitmap);
-		return;
+		return 0;
 	}
 
 	pmd = pmd_offset(pud, *first);
@@ -3673,6 +3687,16 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 			dirty = true;
 
 		walk->mm_stats[MM_LEAF_YOUNG]++;
+		if (!walk->accessed_cb)
+			goto next;
+
+		err = walk->accessed_cb(pfn);
+		if (err) {
+			i = find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
+
+			walk->next_addr = (*first & PMD_MASK) + i * PMD_SIZE;
+			break;
+		}
 next:
 		i = i > MIN_LRU_BATCH ? 0 : find_next_bit(bitmap, MIN_LRU_BATCH, i) + 1;
 	} while (i <= MIN_LRU_BATCH);
@@ -3683,9 +3707,10 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
 	spin_unlock(ptl);
 done:
 	*first = -1;
+	return err;
 }
 
-static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
+static int walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			   struct mm_walk *args)
 {
 	int i;
@@ -3697,6 +3722,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 	unsigned long first = -1;
 	struct lru_gen_mm_walk *walk = args->private;
 	struct lru_gen_mm_state *mm_state = get_mm_state(walk->lruvec);
+	int err = 0;
 
 	VM_WARN_ON_ONCE(pud_leaf(*pud));
 
@@ -3710,6 +3736,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 	/* walk_pte_range() may call get_next_vma() */
 	vma = args->vma;
 	for (i = pmd_index(start), addr = start; addr != end; i++, addr = next) {
+		bool suitable;
 		pmd_t val = pmdp_get_lockless(pmd + i);
 
 		next = pmd_addr_end(addr, end);
@@ -3726,7 +3753,10 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			walk->mm_stats[MM_LEAF_TOTAL]++;
 
 			if (pfn != -1)
-				walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
+				err = walk_pmd_range_locked(pud, addr, vma, args,
+						bitmap, &first);
+			if (err)
+				return err;
 			continue;
 		}
 
@@ -3735,33 +3765,50 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end,
 			if (!pmd_young(val))
 				continue;
 
-			walk_pmd_range_locked(pud, addr, vma, args, bitmap, &first);
+			err = walk_pmd_range_locked(pud, addr, vma, args,
+						bitmap, &first);
+			if (err)
+				return err;
 		}
 
 		if (!walk->force_scan && !test_bloom_filter(mm_state, walk->seq, pmd + i))
 			continue;
 
+		err = walk_pte_range(&val, addr, next, args, &suitable);
+		if (err && walk->next_addr < next && first == -1)
+			return err;
+
+		walk->nr_total_pte = 0;
+		walk->nr_young_pte = 0;
+
 		walk->mm_stats[MM_NONLEAF_FOUND]++;
 
-		if (!walk_pte_range(&val, addr, next, args))
-			continue;
+		if (!suitable)
+			goto next;
 
 		walk->mm_stats[MM_NONLEAF_ADDED]++;
 
 		/* carry over to the next generation */
 		update_bloom_filter(mm_state, walk->seq + 1, pmd + i);
+next:
+		if (err) {
+			walk->next_addr = first;
+			return err;
+		}
 	}
 
-	walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first);
+	err = walk_pmd_range_locked(pud, -1, vma, args, bitmap, &first);
 
-	if (i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
+	if (!err && i < PTRS_PER_PMD && get_next_vma(PUD_MASK, PMD_SIZE, args, &start, &end))
 		goto restart;
+
+	return err;
 }
 
 static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 			  struct mm_walk *args)
 {
-	int i;
+	int i, err;
 	pud_t *pud;
 	unsigned long addr;
 	unsigned long next;
@@ -3779,7 +3826,9 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 		if (!pud_present(val) || WARN_ON_ONCE(pud_leaf(val)))
 			continue;
 
-		walk_pmd_range(&val, addr, next, args);
+		err = walk_pmd_range(&val, addr, next, args);
+		if (err)
+			return err;
 
 		if (need_resched() || walk->batched >= MAX_LRU_BATCH) {
 			end = (addr | ~PUD_MASK) + 1;
@@ -3800,40 +3849,48 @@ static int walk_pud_range(p4d_t *p4d, unsigned long start, unsigned long end,
 	return -EAGAIN;
 }
 
-static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+static int try_walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
 {
+	int err;
 	static const struct mm_walk_ops mm_walk_ops = {
 		.test_walk = should_skip_vma,
 		.p4d_entry = walk_pud_range,
 		.walk_lock = PGWALK_RDLOCK,
 	};
-	int err;
 	struct lruvec *lruvec = walk->lruvec;
 
-	walk->next_addr = FIRST_USER_ADDRESS;
+	DEFINE_MAX_SEQ(lruvec);
 
-	do {
-		DEFINE_MAX_SEQ(lruvec);
+	err = -EBUSY;
 
-		err = -EBUSY;
+	/* another thread might have called inc_max_seq() */
+	if (walk->seq != max_seq)
+		return err;
 
-		/* another thread might have called inc_max_seq() */
-		if (walk->seq != max_seq)
-			break;
+	/* the caller might be holding the lock for write */
+	if (mmap_read_trylock(mm)) {
+		err = walk_page_range(mm, walk->next_addr, ULONG_MAX,
+				      &mm_walk_ops, walk);
 
-		/* the caller might be holding the lock for write */
-		if (mmap_read_trylock(mm)) {
-			err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk);
+		mmap_read_unlock(mm);
+	}
 
-			mmap_read_unlock(mm);
-		}
+	if (walk->batched) {
+		spin_lock_irq(&lruvec->lru_lock);
+		reset_batch_size(walk);
+		spin_unlock_irq(&lruvec->lru_lock);
+	}
 
-		if (walk->batched) {
-			spin_lock_irq(&lruvec->lru_lock);
-			reset_batch_size(walk);
-			spin_unlock_irq(&lruvec->lru_lock);
-		}
+	return err;
+}
 
+static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
+{
+	int err;
+
+	walk->next_addr = FIRST_USER_ADDRESS;
+	do {
+		err = try_walk_mm(mm, walk);
 		cond_resched();
 	} while (err == -EAGAIN);
 }
@@ -4045,6 +4102,33 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness
 	return success;
 }
 
+void lru_gen_scan_lruvec(struct lruvec *lruvec, unsigned long seq,
+			 int (*accessed_cb)(unsigned long), void (*flush_cb)(void))
+{
+	struct lru_gen_mm_walk *walk = current->reclaim_state->mm_walk;
+	struct mm_struct *mm = NULL;
+
+	walk->lruvec = lruvec;
+	walk->seq = seq;
+	walk->accessed_cb = accessed_cb;
+	walk->swappiness = MAX_SWAPPINESS;
+
+	do {
+		int err = -EBUSY;
+
+		iterate_mm_list(walk, &mm);
+		if (!mm)
+			break;
+
+		walk->next_addr = FIRST_USER_ADDRESS;
+		do {
+			err = try_walk_mm(mm, walk);
+			cond_resched();
+			flush_cb();
+		} while (err == -EAGAIN);
+	} while (mm);
+}
+
 static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
 			       int swappiness, bool force_scan)
 {
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (5 preceding siblings ...)
  2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-10-03 12:30   ` Jonathan Cameron
  2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
  2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
  8 siblings, 1 reply; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

From: Kinsey Ho <kinseyho@google.com>

Introduce a new kernel daemon, klruscand, that periodically invokes the
MGLRU page table walk. It leverages the new callbacks to gather access
information and forwards it to the pghot hot page tracking sub-system
for promotion decisions.

This benefits from reusing the existing MGLRU page table walk
infrastructure, which is optimized with features such as hierarchical
scanning and bloom filters to reduce CPU overhead.

As an additional optimization to be added in the future, we can tune
the scan intervals for each memcg.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Bharata B Rao <bharata@amd.com>
	[Reduced the scan interval to 100ms, pfn_t to unsigned long]
---
 mm/Kconfig     |   8 ++++
 mm/Makefile    |   1 +
 mm/klruscand.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+)
 create mode 100644 mm/klruscand.c

diff --git a/mm/Kconfig b/mm/Kconfig
index 8b236eb874cf..6d53c1208729 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1393,6 +1393,14 @@ config PGHOT
 	  by various sources. Asynchronous promotion is done by per-node
 	  kernel threads.
 
+config KLRUSCAND
+	bool "Kernel lower tier access scan daemon"
+	default y
+	depends on PGHOT && LRU_GEN_WALKS_MMU
+	help
+	  Scan for accesses from lower tiers by invoking MGLRU to perform
+	  page table walks.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index ecdd5241bea8..05a96ec35aa3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -148,3 +148,4 @@ obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
 obj-$(CONFIG_PGHOT) += pghot.o
+obj-$(CONFIG_KLRUSCAND) += klruscand.o
diff --git a/mm/klruscand.c b/mm/klruscand.c
new file mode 100644
index 000000000000..1a51aab29bd9
--- /dev/null
+++ b/mm/klruscand.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memcontrol.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/random.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
+#include <linux/slab.h>
+#include <linux/sched/clock.h>
+#include <linux/memory-tiers.h>
+#include <linux/sched/mm.h>
+#include <linux/sched.h>
+#include <linux/pghot.h>
+
+#include "internal.h"
+
+#define KLRUSCAND_INTERVAL_MS 100
+#define BATCH_SIZE (2 << 16)
+
+static struct task_struct *scan_thread;
+static unsigned long pfn_batch[BATCH_SIZE];
+static int batch_index;
+
+static void flush_cb(void)
+{
+	int i = 0;
+
+	for (; i < batch_index; i++) {
+		u64 pfn = pfn_batch[i];
+
+		pghot_record_access((unsigned long)pfn, NUMA_NO_NODE,
+					PGHOT_PGTABLE_SCAN, jiffies);
+
+		if (i % 16 == 0)
+			cond_resched();
+	}
+	batch_index = 0;
+}
+
+static int accessed_cb(unsigned long pfn)
+{
+	if (batch_index >= BATCH_SIZE)
+		return -EAGAIN;
+
+	pfn_batch[batch_index++] = pfn;
+	return 0;
+}
+
+static int klruscand_run(void *unused)
+{
+	struct lru_gen_mm_walk *walk;
+
+	walk = kzalloc(sizeof(*walk),
+		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	if (!walk)
+		return -ENOMEM;
+
+	while (!kthread_should_stop()) {
+		unsigned long next_wake_time;
+		long sleep_time;
+		struct mem_cgroup *memcg;
+		int flags;
+		int nid;
+
+		next_wake_time = jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL_MS);
+
+		for_each_node_state(nid, N_MEMORY) {
+			pg_data_t *pgdat = NODE_DATA(nid);
+			struct reclaim_state rs = { 0 };
+
+			if (node_is_toptier(nid))
+				continue;
+
+			rs.mm_walk = walk;
+			set_task_reclaim_state(current, &rs);
+			flags = memalloc_noreclaim_save();
+
+			memcg = mem_cgroup_iter(NULL, NULL, NULL);
+			do {
+				struct lruvec *lruvec =
+					mem_cgroup_lruvec(memcg, pgdat);
+				unsigned long max_seq =
+					READ_ONCE((lruvec)->lrugen.max_seq);
+
+				lru_gen_scan_lruvec(lruvec, max_seq,
+						    accessed_cb, flush_cb);
+				cond_resched();
+			} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+			memalloc_noreclaim_restore(flags);
+			set_task_reclaim_state(current, NULL);
+			memset(walk, 0, sizeof(*walk));
+		}
+
+		sleep_time = next_wake_time - jiffies;
+		if (sleep_time > 0 && sleep_time != MAX_SCHEDULE_TIMEOUT)
+			schedule_timeout_idle(sleep_time);
+	}
+	kfree(walk);
+	return 0;
+}
+
+static int __init klruscand_init(void)
+{
+	struct task_struct *task;
+
+	task = kthread_run(klruscand_run, NULL, "klruscand");
+
+	if (IS_ERR(task)) {
+		pr_err("Failed to create klruscand kthread\n");
+		return PTR_ERR(task);
+	}
+
+	scan_thread = task;
+	return 0;
+}
+module_init(klruscand_init);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion
  2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
@ 2025-10-03 12:30   ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-03 12:30 UTC (permalink / raw)
  To: Bharata B Rao, akpm, david
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore

On Wed, 10 Sep 2025 20:16:52 +0530
Bharata B Rao <bharata@amd.com> wrote:

> From: Kinsey Ho <kinseyho@google.com>
> 
> Introduce a new kernel daemon, klruscand, that periodically invokes the
> MGLRU page table walk. It leverages the new callbacks to gather access
> information and forwards it to the pghot hot page tracking sub-system
> for promotion decisions.
> 
> This benefits from reusing the existing MGLRU page table walk
> infrastructure, which is optimized with features such as hierarchical
> scanning and bloom filters to reduce CPU overhead.
> 
> As an additional optimization to be added in the future, we can tune
> the scan intervals for each memcg.
> 
> Signed-off-by: Kinsey Ho <kinseyho@google.com>
> Signed-off-by: Yuanchu Xie <yuanchu@google.com>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> 	[Reduced the scan interval to 100ms, pfn_t to unsigned long]
Some very minor comments inline.  I know even less about the stuff this
is using than IBS (and I don't know much about that ;)

J
> ---
>  mm/Kconfig     |   8 ++++
>  mm/Makefile    |   1 +
>  mm/klruscand.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 127 insertions(+)
>  create mode 100644 mm/klruscand.c
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 8b236eb874cf..6d53c1208729 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1393,6 +1393,14 @@ config PGHOT
>  	  by various sources. Asynchronous promotion is done by per-node
>  	  kernel threads.
>  
> +config KLRUSCAND
> +	bool "Kernel lower tier access scan daemon"
> +	default y

Why default to y? That's very rarely done for new features.

> +	depends on PGHOT && LRU_GEN_WALKS_MMU
> +	help
> +	  Scan for accesses from lower tiers by invoking MGLRU to perform
> +	  page table walks.

> diff --git a/mm/klruscand.c b/mm/klruscand.c
> new file mode 100644
> index 000000000000..1a51aab29bd9
> --- /dev/null
> +++ b/mm/klruscand.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/memcontrol.h>

Probably pick some ordering scheme for includes.
I'm not spotting what is currently used here.

> +#include <linux/kthread.h>
> +#include <linux/module.h>
> +#include <linux/vmalloc.h>
> +#include <linux/random.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_inline.h>
> +#include <linux/slab.h>
> +#include <linux/sched/clock.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched.h>
> +#include <linux/pghot.h>
> +
> +#include "internal.h"
> +
> +#define KLRUSCAND_INTERVAL_MS 100
> +#define BATCH_SIZE (2 << 16)
> +
> +static struct task_struct *scan_thread;
> +static unsigned long pfn_batch[BATCH_SIZE];
> +static int batch_index;
> +
> +static void flush_cb(void)
> +{
> +	int i = 0;
> +
> +	for (; i < batch_index; i++) {
> +		u64 pfn = pfn_batch[i];

Why dance through types?  pfn_batch is unsigned long and it is
cast back to that below.

> +
> +		pghot_record_access((unsigned long)pfn, NUMA_NO_NODE,
> +					PGHOT_PGTABLE_SCAN, jiffies);
> +
> +		if (i % 16 == 0)

No problem with this, but maybe a comment on why 16?

> +			cond_resched();
> +	}
> +	batch_index = 0;
> +}

> +static int klruscand_run(void *unused)
> +{
> +	struct lru_gen_mm_walk *walk;
> +
> +	walk = kzalloc(sizeof(*walk),
> +		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);

Maybe use __free() magic so we can forget about having to clear this up on exit.
Entirely up to you though as doesn't simplify code much in this case.

> +	if (!walk)
> +		return -ENOMEM;
> +
> +	while (!kthread_should_stop()) {
> +		unsigned long next_wake_time;
> +		long sleep_time;
> +		struct mem_cgroup *memcg;
> +		int flags;
> +		int nid;
> +
> +		next_wake_time = jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL_MS);
> +
> +		for_each_node_state(nid, N_MEMORY) {
> +			pg_data_t *pgdat = NODE_DATA(nid);
> +			struct reclaim_state rs = { 0 };
> +
> +			if (node_is_toptier(nid))
> +				continue;
> +
> +			rs.mm_walk = walk;
> +			set_task_reclaim_state(current, &rs);
> +			flags = memalloc_noreclaim_save();
> +
> +			memcg = mem_cgroup_iter(NULL, NULL, NULL);
> +			do {
> +				struct lruvec *lruvec =
> +					mem_cgroup_lruvec(memcg, pgdat);
> +				unsigned long max_seq =
> +					READ_ONCE((lruvec)->lrugen.max_seq);
> +
> +				lru_gen_scan_lruvec(lruvec, max_seq,
> +						    accessed_cb, flush_cb);
> +				cond_resched();
> +			} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> +
> +			memalloc_noreclaim_restore(flags);
> +			set_task_reclaim_state(current, NULL);
> +			memset(walk, 0, sizeof(*walk));
> +		}
> +
> +		sleep_time = next_wake_time - jiffies;
> +		if (sleep_time > 0 && sleep_time != MAX_SCHEDULE_TIMEOUT)
> +			schedule_timeout_idle(sleep_time);
> +	}
> +	kfree(walk);
> +	return 0;
> +}



^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (6 preceding siblings ...)
  2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
@ 2025-09-10 14:46 ` Bharata B Rao
  2025-10-03 12:38   ` Jonathan Cameron
  2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
  8 siblings, 1 reply; 53+ messages in thread
From: Bharata B Rao @ 2025-09-10 14:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, Bharata B Rao

Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
mode of NUMA Balancing) does hot page detection (via hint faults),
hot page classification and eventual promotion, all by itself and
sits within the scheduler.

With the new hot page tracking and promotion mechanism being
available, NUMA Balancing can limit itself to detection of
hot pages (via hint faults) and off-load rest of the
functionality to the common hot page tracking system.

pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
hot page info. In addition, the migration rate limiting and
dynamic threshold logic are moved to kpromoted so that the same
can be used for hot pages reported by other sources too.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/pghot.h |   2 +
 kernel/sched/fair.c   | 149 ++----------------------------------------
 mm/memory.c           |  32 ++-------
 mm/pghot.c            | 132 +++++++++++++++++++++++++++++++++++--
 4 files changed, 142 insertions(+), 173 deletions(-)

diff --git a/include/linux/pghot.h b/include/linux/pghot.h
index 1443643aab13..98a72e01bdd6 100644
--- a/include/linux/pghot.h
+++ b/include/linux/pghot.h
@@ -47,6 +47,8 @@ enum pghot_src {
 #define PGHOT_HEAP_PCT		25
 
 #define KPROMOTED_MIGRATE_BATCH	1024
+#define KPROMOTED_MIGRATION_ADJUST_STEPS	16
+#define KPROMOTED_PROMOTION_THRESHOLD_WINDOW	60000
 
 /*
  * If target NID isn't available, kpromoted promotes to node 0
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..54eeddb6ec23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -125,11 +125,6 @@ int __weak arch_asym_cpu_priority(int cpu)
 static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
 #endif
 
-#ifdef CONFIG_NUMA_BALANCING
-/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
-static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
-#endif
-
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table sched_fair_sysctls[] = {
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -142,16 +137,6 @@ static const struct ctl_table sched_fair_sysctls[] = {
 		.extra1         = SYSCTL_ONE,
 	},
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	{
-		.procname	= "numa_balancing_promote_rate_limit_MBps",
-		.data		= &sysctl_numa_balancing_promote_rate_limit,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	},
-#endif /* CONFIG_NUMA_BALANCING */
 };
 
 static int __init sched_fair_sysctl_init(void)
@@ -1800,108 +1785,6 @@ static inline bool cpupid_valid(int cpupid)
 	return cpupid_to_cpu(cpupid) < nr_cpu_ids;
 }
 
-/*
- * For memory tiering mode, if there are enough free pages (more than
- * enough watermark defined here) in fast memory node, to take full
- * advantage of fast memory capacity, all recently accessed slow
- * memory pages will be migrated to fast memory node without
- * considering hot threshold.
- */
-static bool pgdat_free_space_enough(struct pglist_data *pgdat)
-{
-	int z;
-	unsigned long enough_wmark;
-
-	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
-			   pgdat->node_present_pages >> 4);
-	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
-		struct zone *zone = pgdat->node_zones + z;
-
-		if (!populated_zone(zone))
-			continue;
-
-		if (zone_watermark_ok(zone, 0,
-				      promo_wmark_pages(zone) + enough_wmark,
-				      ZONE_MOVABLE, 0))
-			return true;
-	}
-	return false;
-}
-
-/*
- * For memory tiering mode, when page tables are scanned, the scan
- * time will be recorded in struct page in addition to make page
- * PROT_NONE for slow memory page.  So when the page is accessed, in
- * hint page fault handler, the hint page fault latency is calculated
- * via,
- *
- *	hint page fault latency = hint page fault time - scan time
- *
- * The smaller the hint page fault latency, the higher the possibility
- * for the page to be hot.
- */
-static int numa_hint_fault_latency(struct folio *folio)
-{
-	int last_time, time;
-
-	time = jiffies_to_msecs(jiffies);
-	last_time = folio_xchg_access_time(folio, time);
-
-	return (time - last_time) & PAGE_ACCESS_TIME_MASK;
-}
-
-/*
- * For memory tiering mode, too high promotion/demotion throughput may
- * hurt application latency.  So we provide a mechanism to rate limit
- * the number of pages that are tried to be promoted.
- */
-static bool numa_promotion_rate_limit(struct pglist_data *pgdat,
-				      unsigned long rate_limit, int nr)
-{
-	unsigned long nr_cand;
-	unsigned int now, start;
-
-	now = jiffies_to_msecs(jiffies);
-	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
-	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-	start = pgdat->nbp_rl_start;
-	if (now - start > MSEC_PER_SEC &&
-	    cmpxchg(&pgdat->nbp_rl_start, start, now) == start)
-		pgdat->nbp_rl_nr_cand = nr_cand;
-	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
-		return true;
-	return false;
-}
-
-#define NUMA_MIGRATION_ADJUST_STEPS	16
-
-static void numa_promotion_adjust_threshold(struct pglist_data *pgdat,
-					    unsigned long rate_limit,
-					    unsigned int ref_th)
-{
-	unsigned int now, start, th_period, unit_th, th;
-	unsigned long nr_cand, ref_cand, diff_cand;
-
-	now = jiffies_to_msecs(jiffies);
-	th_period = sysctl_numa_balancing_scan_period_max;
-	start = pgdat->nbp_th_start;
-	if (now - start > th_period &&
-	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
-		ref_cand = rate_limit *
-			sysctl_numa_balancing_scan_period_max / MSEC_PER_SEC;
-		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
-		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
-		unit_th = ref_th * 2 / NUMA_MIGRATION_ADJUST_STEPS;
-		th = pgdat->nbp_threshold ? : ref_th;
-		if (diff_cand > ref_cand * 11 / 10)
-			th = max(th - unit_th, unit_th);
-		else if (diff_cand < ref_cand * 9 / 10)
-			th = min(th + unit_th, ref_th * 2);
-		pgdat->nbp_th_nr_cand = nr_cand;
-		pgdat->nbp_threshold = th;
-	}
-}
-
 bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 				int src_nid, int dst_cpu)
 {
@@ -1917,33 +1800,11 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 
 	/*
 	 * The pages in slow memory node should be migrated according
-	 * to hot/cold instead of private/shared.
-	 */
-	if (folio_use_access_time(folio)) {
-		struct pglist_data *pgdat;
-		unsigned long rate_limit;
-		unsigned int latency, th, def_th;
-
-		pgdat = NODE_DATA(dst_nid);
-		if (pgdat_free_space_enough(pgdat)) {
-			/* workload changed, reset hot threshold */
-			pgdat->nbp_threshold = 0;
-			return true;
-		}
-
-		def_th = sysctl_numa_balancing_hot_threshold;
-		rate_limit = sysctl_numa_balancing_promote_rate_limit << \
-			(20 - PAGE_SHIFT);
-		numa_promotion_adjust_threshold(pgdat, rate_limit, def_th);
-
-		th = pgdat->nbp_threshold ? : def_th;
-		latency = numa_hint_fault_latency(folio);
-		if (latency >= th)
-			return false;
-
-		return !numa_promotion_rate_limit(pgdat, rate_limit,
-						  folio_nr_pages(folio));
-	}
+	 * to hot/cold instead of private/shared. Also the migration
+	 * of such pages are handled by kpromoted.
+	 */
+	if (folio_use_access_time(folio))
+		return true;
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
 	last_cpupid = folio_xchg_last_cpupid(folio, this_cpupid);
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..eeb34e8d9b8e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -75,6 +75,7 @@
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
 #include <linux/sched/sysctl.h>
+#include <linux/pghot.h>
 
 #include <trace/events/kmem.h>
 
@@ -5864,34 +5865,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
 					writable, &last_cpupid);
+	nid = target_nid;
 	if (target_nid == NUMA_NO_NODE)
 		goto out_map;
-	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
-		flags |= TNF_MIGRATE_FAIL;
-		goto out_map;
-	}
-	/* The folio is isolated and isolation code holds a folio reference. */
-	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
 	writable = false;
 	ignore_writable = true;
-
-	/* Migrate to the requested node */
-	if (!migrate_misplaced_folio(folio, target_nid)) {
-		nid = target_nid;
-		flags |= TNF_MIGRATED;
-		task_numa_fault(last_cpupid, nid, nr_pages, flags);
-		return 0;
-	}
-
-	flags |= TNF_MIGRATE_FAIL;
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				       vmf->address, &vmf->ptl);
-	if (unlikely(!vmf->pte))
-		return 0;
-	if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
-		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
-	}
 out_map:
 	/*
 	 * Make it present again, depending on how arch implements
@@ -5905,8 +5884,11 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 					    writable);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 
-	if (nid != NUMA_NO_NODE)
+	if (nid != NUMA_NO_NODE) {
+		pghot_record_access(folio_pfn(folio), nid, PGHOT_HINT_FAULT,
+				    jiffies);
 		task_numa_fault(last_cpupid, nid, nr_pages, flags);
+	}
 	return 0;
 }
 
diff --git a/mm/pghot.c b/mm/pghot.c
index 9f7581818b8f..9f5746892bce 100644
--- a/mm/pghot.c
+++ b/mm/pghot.c
@@ -9,6 +9,9 @@
  *
  * kpromoted is a kernel thread that runs on each toptier node and
  * promotes pages from max_heap.
+ *
+ * Migration rate-limiting and dynamic threshold logic implementations
+ * were moved from NUMA Balancing mode 2.
  */
 #include <linux/pghot.h>
 #include <linux/kthread.h>
@@ -34,6 +37,9 @@ static bool kpromoted_started __ro_after_init;
 
 static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
 
+/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
+static unsigned int sysctl_pghot_promote_rate_limit = 65536;
+
 #ifdef CONFIG_SYSCTL
 static const struct ctl_table pghot_sysctls[] = {
 	{
@@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "pghot_promote_rate_limit_MBps",
+		.data		= &sysctl_pghot_promote_rate_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
 };
 #endif
+
 static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
 {
 	return (*(struct pghot_info **)lhs)->frequency >
@@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
 	return true;
 }
 
+/*
+ * For memory tiering mode, if there are enough free pages (more than
+ * enough watermark defined here) in fast memory node, to take full
+ * advantage of fast memory capacity, all recently accessed slow
+ * memory pages will be migrated to fast memory node without
+ * considering hot threshold.
+ */
+static bool pgdat_free_space_enough(struct pglist_data *pgdat)
+{
+	int z;
+	unsigned long enough_wmark;
+
+	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
+			   pgdat->node_present_pages >> 4);
+	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+		struct zone *zone = pgdat->node_zones + z;
+
+		if (!populated_zone(zone))
+			continue;
+
+		if (zone_watermark_ok(zone, 0,
+				      promo_wmark_pages(zone) + enough_wmark,
+				      ZONE_MOVABLE, 0))
+			return true;
+	}
+	return false;
+}
+
+/*
+ * For memory tiering mode, too high promotion/demotion throughput may
+ * hurt application latency.  So we provide a mechanism to rate limit
+ * the number of pages that are tried to be promoted.
+ */
+static bool kpromoted_promotion_rate_limit(struct pglist_data *pgdat,
+					   unsigned long rate_limit, int nr,
+					   unsigned long time)
+{
+	unsigned long nr_cand;
+	unsigned int now, start;
+
+	now = jiffies_to_msecs(time);
+	mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
+	nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+	start = pgdat->nbp_rl_start;
+	if (now - start > MSEC_PER_SEC &&
+	    cmpxchg(&pgdat->nbp_rl_start, start, now) == start)
+		pgdat->nbp_rl_nr_cand = nr_cand;
+	if (nr_cand - pgdat->nbp_rl_nr_cand >= rate_limit)
+		return true;
+	return false;
+}
+
+static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,
+						 unsigned long rate_limit,
+						 unsigned int ref_th,
+						 unsigned long now)
+{
+	unsigned int start, th_period, unit_th, th;
+	unsigned long nr_cand, ref_cand, diff_cand;
+
+	now = jiffies_to_msecs(now);
+	th_period = KPROMOTED_PROMOTION_THRESHOLD_WINDOW;
+	start = pgdat->nbp_th_start;
+	if (now - start > th_period &&
+	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
+		ref_cand = rate_limit *
+			KPROMOTED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
+		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
+		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
+		unit_th = ref_th * 2 / KPROMOTED_MIGRATION_ADJUST_STEPS;
+		th = pgdat->nbp_threshold ? : ref_th;
+		if (diff_cand > ref_cand * 11 / 10)
+			th = max(th - unit_th, unit_th);
+		else if (diff_cand < ref_cand * 9 / 10)
+			th = min(th + unit_th, ref_th * 2);
+		pgdat->nbp_th_nr_cand = nr_cand;
+		pgdat->nbp_threshold = th;
+	}
+}
+
+static inline unsigned int pghot_access_latency(struct pghot_info *phi, u32  now)
+{
+	return (now - phi->last_update);
+}
+
 static bool phi_is_pfn_hot(struct pghot_info *phi)
 {
 	struct page *page = pfn_to_online_page(phi->pfn);
-	unsigned long now = jiffies;
 	struct folio *folio;
+	struct pglist_data *pgdat;
+	unsigned long rate_limit;
+	unsigned int latency, th, def_th;
+	unsigned long now = jiffies;
 
 	if (!page || is_zone_device_page(page))
 		return false;
@@ -113,7 +216,24 @@ static bool phi_is_pfn_hot(struct pghot_info *phi)
 		return false;
 	}
 
-	return true;
+	pgdat = NODE_DATA(phi->nid);
+	if (pgdat_free_space_enough(pgdat)) {
+		/* workload changed, reset hot threshold */
+		pgdat->nbp_threshold = 0;
+		return true;
+	}
+
+	def_th = sysctl_pghot_freq_window;
+	rate_limit = sysctl_pghot_promote_rate_limit << (20 - PAGE_SHIFT);
+	kpromoted_promotion_adjust_threshold(pgdat, rate_limit, def_th, now);
+
+	th = pgdat->nbp_threshold ? : def_th;
+	latency = pghot_access_latency(phi, now & PGHOT_TIME_MASK);
+	if (latency >= th)
+		return false;
+
+	return !kpromoted_promotion_rate_limit(pgdat, rate_limit,
+					       folio_nr_pages(folio), now);
 }
 
 static struct folio *kpromoted_isolate_folio(struct pghot_info *phi)
@@ -351,9 +471,13 @@ int pghot_record_access(u64 pfn, int nid, int src, unsigned long now)
 	/*
 	 * If the previous access was beyond the threshold window
 	 * start frequency tracking afresh.
+	 *
+	 * Bypass the new window logic for NUMA hint fault source
+	 * as it is too slow in reporting accesses.
+	 * TODO: Fix this.
 	 */
-	if (((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_window)) ||
-	    (nid != NUMA_NO_NODE && phi->nid != nid))
+	if ((((cur_time - phi->last_update) > msecs_to_jiffies(sysctl_pghot_freq_window))
+	    && (src != PGHOT_HINT_FAULT)) || (nid != NUMA_NO_NODE && phi->nid != nid))
 		new_window = true;
 
 	if (new_entry || new_window) {
-- 
2.34.1



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
  2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
@ 2025-10-03 12:38   ` Jonathan Cameron
  2025-10-06  5:57     ` Bharata B Rao
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-03 12:38 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore

On Wed, 10 Sep 2025 20:16:53 +0530
Bharata B Rao <bharata@amd.com> wrote:

> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
> mode of NUMA Balancing) does hot page detection (via hint faults),
> hot page classification and eventual promotion, all by itself and
> sits within the scheduler.
> 
> With the new hot page tracking and promotion mechanism being
> available, NUMA Balancing can limit itself to detection of
> hot pages (via hint faults) and off-load rest of the
> functionality to the common hot page tracking system.
> 
> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
> hot page info. In addition, the migration rate limiting and
> dynamic threshold logic are moved to kpromoted so that the same
> can be used for hot pages reported by other sources too.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>

Making a direct replacement without any fallback to previous method
is going to need a lot of data to show there are no important regressions.

So bold move if that's the intent! 

J
> ---
>  include/linux/pghot.h |   2 +
>  kernel/sched/fair.c   | 149 ++----------------------------------------
>  mm/memory.c           |  32 ++-------
>  mm/pghot.c            | 132 +++++++++++++++++++++++++++++++++++--
>  4 files changed, 142 insertions(+), 173 deletions(-)
> 

> diff --git a/mm/pghot.c b/mm/pghot.c
> index 9f7581818b8f..9f5746892bce 100644
> --- a/mm/pghot.c
> +++ b/mm/pghot.c
> @@ -9,6 +9,9 @@
>   *
>   * kpromoted is a kernel thread that runs on each toptier node and
>   * promotes pages from max_heap.
> + *
> + * Migration rate-limiting and dynamic threshold logic implementations
> + * were moved from NUMA Balancing mode 2.
>   */
>  #include <linux/pghot.h>
>  #include <linux/kthread.h>
> @@ -34,6 +37,9 @@ static bool kpromoted_started __ro_after_init;
>  
>  static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
>  
> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;

If the comment correlates with the value, this is 64 GiB/s?  That seems
unlikely if I guess possible.

> +
>  #ifdef CONFIG_SYSCTL
>  static const struct ctl_table pghot_sysctls[] = {
>  	{
> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
>  		.proc_handler	= proc_dointvec_minmax,
>  		.extra1		= SYSCTL_ZERO,
>  	},
> +	{
> +		.procname	= "pghot_promote_rate_limit_MBps",
> +		.data		= &sysctl_pghot_promote_rate_limit,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +	},
>  };
>  #endif
> +
Put that in earlier patch to reduce noise here.

>  static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
>  {
>  	return (*(struct pghot_info **)lhs)->frequency >
> @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
>  	return true;
>  }
>  
> +/*
> + * For memory tiering mode, if there are enough free pages (more than
> + * enough watermark defined here) in fast memory node, to take full

I'd use enough_wmark   Just because "more than enough" is a common
English phrase and I at least tripped over that sentence as a result!

> + * advantage of fast memory capacity, all recently accessed slow
> + * memory pages will be migrated to fast memory node without
> + * considering hot threshold.
> + */
> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
> +{
> +	int z;
> +	unsigned long enough_wmark;
> +
> +	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
> +			   pgdat->node_present_pages >> 4);
> +	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
> +		struct zone *zone = pgdat->node_zones + z;
> +
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		if (zone_watermark_ok(zone, 0,
> +				      promo_wmark_pages(zone) + enough_wmark,
> +				      ZONE_MOVABLE, 0))
> +			return true;
> +	}
> +	return false;
> +}

> +
> +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,

Needs documentation of the algorithm and the reasons for various choices.

I see it is a code move though so maybe that's a job for another day.

> +						 unsigned long rate_limit,
> +						 unsigned int ref_th,
> +						 unsigned long now)
> +{
> +	unsigned int start, th_period, unit_th, th;
> +	unsigned long nr_cand, ref_cand, diff_cand;
> +
> +	now = jiffies_to_msecs(now);
> +	th_period = KPROMOTED_PROMOTION_THRESHOLD_WINDOW;
> +	start = pgdat->nbp_th_start;
> +	if (now - start > th_period &&
> +	    cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
> +		ref_cand = rate_limit *
> +			KPROMOTED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
> +		nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
> +		diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
> +		unit_th = ref_th * 2 / KPROMOTED_MIGRATION_ADJUST_STEPS;
> +		th = pgdat->nbp_threshold ? : ref_th;
> +		if (diff_cand > ref_cand * 11 / 10)
> +			th = max(th - unit_th, unit_th);
> +		else if (diff_cand < ref_cand * 9 / 10)
> +			th = min(th + unit_th, ref_th * 2);
> +		pgdat->nbp_th_nr_cand = nr_cand;
> +		pgdat->nbp_threshold = th;
> +	}
> +}
 +
>  static bool phi_is_pfn_hot(struct pghot_info *phi)
>  {
>  	struct page *page = pfn_to_online_page(phi->pfn);
> -	unsigned long now = jiffies;
>  	struct folio *folio;
> +	struct pglist_data *pgdat;
> +	unsigned long rate_limit;
> +	unsigned int latency, th, def_th;
> +	unsigned long now = jiffies;
>  
Avoid the reorder.  Just put it here in first place if you prefer this.





^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
  2025-10-03 12:38   ` Jonathan Cameron
@ 2025-10-06  5:57     ` Bharata B Rao
  2025-10-06  9:53       ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Bharata B Rao @ 2025-10-06  5:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore


On 03-Oct-25 6:08 PM, Jonathan Cameron wrote:
> On Wed, 10 Sep 2025 20:16:53 +0530
> Bharata B Rao <bharata@amd.com> wrote:
> 
>> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
>> mode of NUMA Balancing) does hot page detection (via hint faults),
>> hot page classification and eventual promotion, all by itself and
>> sits within the scheduler.
>>
>> With the new hot page tracking and promotion mechanism being
>> available, NUMA Balancing can limit itself to detection of
>> hot pages (via hint faults) and off-load rest of the
>> functionality to the common hot page tracking system.
>>
>> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
>> hot page info. In addition, the migration rate limiting and
>> dynamic threshold logic are moved to kpromoted so that the same
>> can be used for hot pages reported by other sources too.
>>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
> 
> Making a direct replacement without any fallback to previous method
> is going to need a lot of data to show there are no important regressions.
> 
> So bold move if that's the intent! 

Firstly I am only moving the existing hot page heuristics that is part of
NUMAB=2 to kpromoted so that the same can be applied to hot pages being
identified by other sources. So the hint fault mechanism that is inherent
to NUMAB=2 still remains.

In fact, kscand effort started as a potential replacement for the existing
hot page promotion mechanism by getting rid of hint faults and moving the
page table scanning out of process context.

In any case, I will start including numbers from the next post.
>>  
>>  static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
>>  
>> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
>> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;
> 
> If the comment correlates with the value, this is 64 GiB/s?  That seems
> unlikely if I guess possible.

IIUC, the existing logic tries to limit promotion rate to 64 GiB/s by
limiting the number of candidate pages that are promoted within the
1s observation interval.

Are you saying that achieving the rate of 64 GiB/s is not possible
or unlikely?

> 
>> +
>>  #ifdef CONFIG_SYSCTL
>>  static const struct ctl_table pghot_sysctls[] = {
>>  	{
>> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
>>  		.proc_handler	= proc_dointvec_minmax,
>>  		.extra1		= SYSCTL_ZERO,
>>  	},
>> +	{
>> +		.procname	= "pghot_promote_rate_limit_MBps",
>> +		.data		= &sysctl_pghot_promote_rate_limit,
>> +		.maxlen		= sizeof(unsigned int),
>> +		.mode		= 0644,
>> +		.proc_handler	= proc_dointvec_minmax,
>> +		.extra1		= SYSCTL_ZERO,
>> +	},
>>  };
>>  #endif
>> +
> Put that in earlier patch to reduce noise here.

This patch moves the hot page heuristics to kpromoted and hence this
related sysctl is also being moved in this patch.

> 
>>  static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
>>  {
>>  	return (*(struct pghot_info **)lhs)->frequency >
>> @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
>>  	return true;
>>  }
>>  
>> +/*
>> + * For memory tiering mode, if there are enough free pages (more than
>> + * enough watermark defined here) in fast memory node, to take full
> 
> I'd use enough_wmark   Just because "more than enough" is a common
> English phrase and I at least tripped over that sentence as a result!

Ah I see that, but as you note later, I am currently only doing the
movement.

> 
>> + * advantage of fast memory capacity, all recently accessed slow
>> + * memory pages will be migrated to fast memory node without
>> + * considering hot threshold.
>> + */
>> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
>> +{
>> +	int z;
>> +	unsigned long enough_wmark;
>> +
>> +	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
>> +			   pgdat->node_present_pages >> 4);
>> +	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>> +		struct zone *zone = pgdat->node_zones + z;
>> +
>> +		if (!populated_zone(zone))
>> +			continue;
>> +
>> +		if (zone_watermark_ok(zone, 0,
>> +				      promo_wmark_pages(zone) + enough_wmark,
>> +				      ZONE_MOVABLE, 0))
>> +			return true;
>> +	}
>> +	return false;
>> +}
> 
>> +
>> +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,
> 
> Needs documentation of the algorithm and the reasons for various choices.
> 
> I see it is a code move though so maybe that's a job for another day.

Sure.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
  2025-10-06  5:57     ` Bharata B Rao
@ 2025-10-06  9:53       ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-06  9:53 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, dave.hansen, gourry, hannes, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore

On Mon, 6 Oct 2025 11:27:21 +0530
Bharata B Rao <bharata@amd.com> wrote:

> On 03-Oct-25 6:08 PM, Jonathan Cameron wrote:
> > On Wed, 10 Sep 2025 20:16:53 +0530
> > Bharata B Rao <bharata@amd.com> wrote:
> >   
> >> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
> >> mode of NUMA Balancing) does hot page detection (via hint faults),
> >> hot page classification and eventual promotion, all by itself and
> >> sits within the scheduler.
> >>
> >> With the new hot page tracking and promotion mechanism being
> >> available, NUMA Balancing can limit itself to detection of
> >> hot pages (via hint faults) and off-load rest of the
> >> functionality to the common hot page tracking system.
> >>
> >> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
> >> hot page info. In addition, the migration rate limiting and
> >> dynamic threshold logic are moved to kpromoted so that the same
> >> can be used for hot pages reported by other sources too.
> >>
> >> Signed-off-by: Bharata B Rao <bharata@amd.com>  
> > 
> > Making a direct replacement without any fallback to previous method
> > is going to need a lot of data to show there are no important regressions.
> > 
> > So bold move if that's the intent!   
> 
> Firstly I am only moving the existing hot page heuristics that is part of
> NUMAB=2 to kpromoted so that the same can be applied to hot pages being
> identified by other sources. So the hint fault mechanism that is inherent
> to NUMAB=2 still remains.

That makes sense.

> 
> In fact, kscand effort started as a potential replacement for the existing
> hot page promotion mechanism by getting rid of hint faults and moving the
> page table scanning out of process context.

Understood and I'm in favor of the that approach but not sure it will be
a fit for all workloads.

> 
> In any case, I will start including numbers from the next post.

Great.

> >>  
> >>  static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
> >>  
> >> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
> >> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;  
> > 
> > If the comment correlates with the value, this is 64 GiB/s?  That seems
> > unlikely if I guess possible.  
> 
> IIUC, the existing logic tries to limit promotion rate to 64 GiB/s by
> limiting the number of candidate pages that are promoted within the
> 1s observation interval.
> 
> Are you saying that achieving the rate of 64 GiB/s is not possible
> or unlikely?

Seem rather too high to me, but maybe I just have the wrong mental model
of what we should be moving. 
> 
> >   
> >> +
> >>  #ifdef CONFIG_SYSCTL
> >>  static const struct ctl_table pghot_sysctls[] = {
> >>  	{
> >> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
> >>  		.proc_handler	= proc_dointvec_minmax,
> >>  		.extra1		= SYSCTL_ZERO,
> >>  	},
> >> +	{
> >> +		.procname	= "pghot_promote_rate_limit_MBps",
> >> +		.data		= &sysctl_pghot_promote_rate_limit,
> >> +		.maxlen		= sizeof(unsigned int),
> >> +		.mode		= 0644,
> >> +		.proc_handler	= proc_dointvec_minmax,
> >> +		.extra1		= SYSCTL_ZERO,
> >> +	},
> >>  };
> >>  #endif
> >> +  
> > Put that in earlier patch to reduce noise here.  
> 
> This patch moves the hot page heuristics to kpromoted and hence this
> related sysctl is also being moved in this patch.

I just mean the blank line - not the block above.
This is just a patch set tidying up comment.

Jonathan




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
                   ` (7 preceding siblings ...)
  2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
@ 2025-09-10 15:39 ` Matthew Wilcox
  2025-09-10 16:01   ` Gregory Price
  8 siblings, 1 reply; 53+ messages in thread
From: Matthew Wilcox @ 2025-09-10 15:39 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	hannes, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes,
	sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy,
	yuanchu, balbirs, alok.rathore

On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
> This patchset introduces a new subsystem for hot page tracking
> and promotion (pghot) that consolidates memory access information
> from various sources and enables centralized promotion of hot
> pages across memory tiers.

Just to be clear, I continue to believe this is a terrible idea and we
should not do this.  If systems will be built with CXL (and given the
horrendous performance, I cannot see why they would be), the kernel
should not be migrating memory around like this.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
@ 2025-09-10 16:01   ` Gregory Price
  2025-09-16 19:45     ` David Rientjes
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-09-10 16:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, hannes, mgorman, mingo, peterz, raghavendra.kt,
	riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl,
	xuezhengchu, yiannis, akpm, david, byungchul, kinseyho,
	joshua.hahnjy, yuanchu, balbirs, alok.rathore

On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
> > This patchset introduces a new subsystem for hot page tracking
> > and promotion (pghot) that consolidates memory access information
> > from various sources and enables centralized promotion of hot
> > pages across memory tiers.
> 
> Just to be clear, I continue to believe this is a terrible idea and we
> should not do this.  If systems will be built with CXL (and given the
> horrendous performance, I cannot see why they would be), the kernel
> should not be migrating memory around like this.

I've been considered this problem from the opposite approach since LSFMM.

Rather than decide how to move stuff around, what if instead we just
decide not to ever put certain classes of memory on CXL.  Right now, so
long as CXL is in the page allocator, it's the wild west - any page can
end up anywhere.

I have enough data now from ZONE_MOVABLE-only CXL deployments on real
workloads to show local CXL expansion is valuable and performant enough
to be worth deploying - but the key piece for me is that ZONE_MOVABLE
disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of 
CXL, but allows any given user-driven page allocation (including page
cache, file, and anon mappings) to land there.

I'm hoping to share some of this data in the coming months.

I've yet to see any strong indication that a complex hotness/movement
system is warranted (yet) - but that may simply be because we have
local cards with no switching involved. So far LRU-based promotion and
demotion has been sufficient.

It seems the closer to random-access the access pattern, the less
valuable ANY movement is. Which should be intuitive.  But, having
CXL beats touching disk every day of the week.

So I've become conflicted on this work - but only because I haven't seen
the data to suggest such complexity is warranted.

~Gregory

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-10 16:01   ` Gregory Price
@ 2025-09-16 19:45     ` David Rientjes
  2025-09-16 22:02       ` Gregory Price
                         ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: David Rientjes @ 2025-09-16 19:45 UTC (permalink / raw)
  To: Gregory Price
  Cc: Matthew Wilcox, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, sj, weixugc, ying.huang, ziy, dave,
	nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore

On Wed, 10 Sep 2025, Gregory Price wrote:

> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
> > On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
> > > This patchset introduces a new subsystem for hot page tracking
> > > and promotion (pghot) that consolidates memory access information
> > > from various sources and enables centralized promotion of hot
> > > pages across memory tiers.
> > 
> > Just to be clear, I continue to believe this is a terrible idea and we
> > should not do this.  If systems will be built with CXL (and given the
> > horrendous performance, I cannot see why they would be), the kernel
> > should not be migrating memory around like this.
> 
> I've been considered this problem from the opposite approach since LSFMM.
> 
> Rather than decide how to move stuff around, what if instead we just
> decide not to ever put certain classes of memory on CXL.  Right now, so
> long as CXL is in the page allocator, it's the wild west - any page can
> end up anywhere.
> 
> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
> workloads to show local CXL expansion is valuable and performant enough
> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of 
> CXL, but allows any given user-driven page allocation (including page
> cache, file, and anon mappings) to land there.
> 

This is similar to our use case, although the direct allocation can be 
controlled by cpusets or mempolicies as needed depending on the memory 
access latency required for the workload; nothing new there, though, it's 
the same argument as NUMA in general and the abstraction of these far 
memory nodes as separate NUMA nodes makes this very straightforward.

> I'm hoping to share some of this data in the coming months.
> 
> I've yet to see any strong indication that a complex hotness/movement
> system is warranted (yet) - but that may simply be because we have
> local cards with no switching involved. So far LRU-based promotion and
> demotion has been sufficient.
> 

To me, this is a key point.  As we've discussed in meetings, we're in the 
early days here.  The CHMU does provide a lot of flexibility, both to 
create very good and very bad hotness trackers.  But I think the key point 
is that we have multiple sources of hotness information depending on the 
platform and some of these sources only make sense for the kernel (or a 
BPF offload) to maintain as the source of truth.  Some of these sources 
will be clear-on-read so only one entity would be possible to have as the 
source of truth of page hotness.

I've been pretty focused on the promotion story here rather than demotion 
because of how responsive it needs to be.  Harvesting the page table 
accessed bits or waiting on a sliding window through NUMA Balancing (even 
NUMAB=2) is not as responsive as needed for very fast promotion to top 
tier memory, hence things like the CHMU (or PEBS or IBS etc).

A few things that I think we need to discuss and align on:

 - the kernel as the source of truth for all memory hotness information,
   which can then be abstracted and used for multiple downstream purposes,
   memory tiering only being one of them

 - the long-term plan for NUMAB=2 and memory tiering support in the kernel
   in general, are we planning on supporting this through NUMA hint faults
   forever despite their drawbacks (too slow, too much overhead for KVM)

 - the role of the kernel vs userspace in driving the memory migration;
   lots of discussion on hardware assists that can be leveraged for memory
   migration but today the balancing is driven in process context.  The
   kthread as the driver of migration is yet to be a sold argument, but 
   are where a number of companies are currently looking

There's also some feature support that is possible with these CXL memory 
expansion devices that have started to pop up in labs that can also 
drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to 
chime in as well.

This topic seems due for an alignment session as well, so will look to get 
that scheduled in the coming weeks if people are up for it.

> It seems the closer to random-access the access pattern, the less
> valuable ANY movement is. Which should be intuitive.  But, having
> CXL beats touching disk every day of the week.
> 
> So I've become conflicted on this work - but only because I haven't seen
> the data to suggest such complexity is warranted.
> 
> ~Gregory
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-16 19:45     ` David Rientjes
@ 2025-09-16 22:02       ` Gregory Price
  2025-09-17  0:30       ` Wei Xu
  2025-10-08 17:59       ` Vinicius Petrucci
  2 siblings, 0 replies; 53+ messages in thread
From: Gregory Price @ 2025-09-16 22:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Matthew Wilcox, Bharata B Rao, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, sj, weixugc, ying.huang, ziy, dave,
	nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore

On Tue, Sep 16, 2025 at 12:45:52PM -0700, David Rientjes wrote:
> > I'm hoping to share some of this data in the coming months.
> > 
> > I've yet to see any strong indication that a complex hotness/movement
> > system is warranted (yet) - but that may simply be because we have
> > local cards with no switching involved. So far LRU-based promotion and
> > demotion has been sufficient.
> > 
...
> 
> I've been pretty focused on the promotion story here rather than demotion 
> because of how responsive it needs to be.  Harvesting the page table 
> accessed bits or waiting on a sliding window through NUMA Balancing (even 
> NUMAB=2) is not as responsive as needed for very fast promotion to top 
> tier memory, hence things like the CHMU (or PEBS or IBS etc).
> 

I feel the need to throw out there that we need to set some kind of
baseline for comparison that isn't simply comparing new hotness tracking
stuff against "Doing Nothing".

For example, if we assume MGLRU is the default, we probably want to
compare against some kind of simplistic system that is essentially:

if tier0 has bandwidth room, and
	if tier1 is bandwidth pressured, then
		promote a chunk from tier1 youngest generation LRU
		("hottest") and demote a chunk from tier0 older LRU
		("coldest") [if there's no space available].

Active bandwidth utilization numbers are still a little hard to come
by, but a system like above could be implemented largely in userland
with a few tweaks to reclaim.

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-16 19:45     ` David Rientjes
  2025-09-16 22:02       ` Gregory Price
@ 2025-09-17  0:30       ` Wei Xu
  2025-09-17  3:20         ` Balbir Singh
  2025-09-17 16:49         ` Jonathan Cameron
  2025-10-08 17:59       ` Vinicius Petrucci
  2 siblings, 2 replies; 53+ messages in thread
From: Wei Xu @ 2025-09-17  0:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Gregory Price, Matthew Wilcox, Bharata B Rao, linux-kernel,
	linux-mm, Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, sj, ying.huang, ziy, dave,
	nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore

On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 10 Sep 2025, Gregory Price wrote:
>
> > On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
> > > On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
> > > > This patchset introduces a new subsystem for hot page tracking
> > > > and promotion (pghot) that consolidates memory access information
> > > > from various sources and enables centralized promotion of hot
> > > > pages across memory tiers.
> > >
> > > Just to be clear, I continue to believe this is a terrible idea and we
> > > should not do this.  If systems will be built with CXL (and given the
> > > horrendous performance, I cannot see why they would be), the kernel
> > > should not be migrating memory around like this.
> >
> > I've been considered this problem from the opposite approach since LSFMM.
> >
> > Rather than decide how to move stuff around, what if instead we just
> > decide not to ever put certain classes of memory on CXL.  Right now, so
> > long as CXL is in the page allocator, it's the wild west - any page can
> > end up anywhere.
> >
> > I have enough data now from ZONE_MOVABLE-only CXL deployments on real
> > workloads to show local CXL expansion is valuable and performant enough
> > to be worth deploying - but the key piece for me is that ZONE_MOVABLE
> > disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
> > CXL, but allows any given user-driven page allocation (including page
> > cache, file, and anon mappings) to land there.
> >
>
> This is similar to our use case, although the direct allocation can be
> controlled by cpusets or mempolicies as needed depending on the memory
> access latency required for the workload; nothing new there, though, it's
> the same argument as NUMA in general and the abstraction of these far
> memory nodes as separate NUMA nodes makes this very straightforward.
>
> > I'm hoping to share some of this data in the coming months.
> >
> > I've yet to see any strong indication that a complex hotness/movement
> > system is warranted (yet) - but that may simply be because we have
> > local cards with no switching involved. So far LRU-based promotion and
> > demotion has been sufficient.
> >
>
> To me, this is a key point.  As we've discussed in meetings, we're in the
> early days here.  The CHMU does provide a lot of flexibility, both to
> create very good and very bad hotness trackers.  But I think the key point
> is that we have multiple sources of hotness information depending on the
> platform and some of these sources only make sense for the kernel (or a
> BPF offload) to maintain as the source of truth.  Some of these sources
> will be clear-on-read so only one entity would be possible to have as the
> source of truth of page hotness.
>
> I've been pretty focused on the promotion story here rather than demotion
> because of how responsive it needs to be.  Harvesting the page table
> accessed bits or waiting on a sliding window through NUMA Balancing (even
> NUMAB=2) is not as responsive as needed for very fast promotion to top
> tier memory, hence things like the CHMU (or PEBS or IBS etc).
>
> A few things that I think we need to discuss and align on:
>
>  - the kernel as the source of truth for all memory hotness information,
>    which can then be abstracted and used for multiple downstream purposes,
>    memory tiering only being one of them
>
>  - the long-term plan for NUMAB=2 and memory tiering support in the kernel
>    in general, are we planning on supporting this through NUMA hint faults
>    forever despite their drawbacks (too slow, too much overhead for KVM)
>
>  - the role of the kernel vs userspace in driving the memory migration;
>    lots of discussion on hardware assists that can be leveraged for memory
>    migration but today the balancing is driven in process context.  The
>    kthread as the driver of migration is yet to be a sold argument, but
>    are where a number of companies are currently looking
>
> There's also some feature support that is possible with these CXL memory
> expansion devices that have started to pop up in labs that can also
> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
> chime in as well.
>
> This topic seems due for an alignment session as well, so will look to get
> that scheduled in the coming weeks if people are up for it.

Our experience is that workloads in hyper-scalar data centers such as
Google often have significant cold memory. Offloading this to CXL memory
devices, backed by cheaper, lower-performance media (e.g. DRAM with
hardware compression), can be a practical approach to reduce overall
TCO. Page promotion and demotion are then critical for such a tiered
memory system.

A kernel thread to drive hot page collection and promotion seems
logical, especially since hot page data from new sources (e.g. CHMU)
are collected outside the process execution context and in the form of
physical addresses.

I do agree that we need to balance the complexity and benefits of any
new data structures for hotness tracking.

> > It seems the closer to random-access the access pattern, the less
> > valuable ANY movement is. Which should be intuitive.  But, having
> > CXL beats touching disk every day of the week.
> >
> > So I've become conflicted on this work - but only because I haven't seen
> > the data to suggest such complexity is warranted.
> >
> > ~Gregory
> >


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-17  0:30       ` Wei Xu
@ 2025-09-17  3:20         ` Balbir Singh
  2025-09-17  4:15           ` Bharata B Rao
  2025-09-17 16:49         ` Jonathan Cameron
  1 sibling, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2025-09-17  3:20 UTC (permalink / raw)
  To: Wei Xu, David Rientjes, Bharata B Rao
  Cc: Gregory Price, Matthew Wilcox, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, sj, ying.huang, ziy, dave, nifan.cxl,
	xuezhengchu, yiannis, akpm, david, byungchul, kinseyho,
	joshua.hahnjy, yuanchu, alok.rathore

On 9/17/25 10:30, Wei Xu wrote:
> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:
>>
>> On Wed, 10 Sep 2025, Gregory Price wrote:
>>
>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
>>>>> This patchset introduces a new subsystem for hot page tracking
>>>>> and promotion (pghot) that consolidates memory access information
>>>>> from various sources and enables centralized promotion of hot
>>>>> pages across memory tiers.
>>>>
>>>> Just to be clear, I continue to believe this is a terrible idea and we
>>>> should not do this.  If systems will be built with CXL (and given the
>>>> horrendous performance, I cannot see why they would be), the kernel
>>>> should not be migrating memory around like this.
>>>
>>> I've been considered this problem from the opposite approach since LSFMM.
>>>
>>> Rather than decide how to move stuff around, what if instead we just
>>> decide not to ever put certain classes of memory on CXL.  Right now, so
>>> long as CXL is in the page allocator, it's the wild west - any page can
>>> end up anywhere.
>>>
>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
>>> workloads to show local CXL expansion is valuable and performant enough
>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
>>> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
>>> CXL, but allows any given user-driven page allocation (including page
>>> cache, file, and anon mappings) to land there.
>>>
>>
>> This is similar to our use case, although the direct allocation can be
>> controlled by cpusets or mempolicies as needed depending on the memory
>> access latency required for the workload; nothing new there, though, it's
>> the same argument as NUMA in general and the abstraction of these far
>> memory nodes as separate NUMA nodes makes this very straightforward.
>>
>>> I'm hoping to share some of this data in the coming months.
>>>
>>> I've yet to see any strong indication that a complex hotness/movement
>>> system is warranted (yet) - but that may simply be because we have
>>> local cards with no switching involved. So far LRU-based promotion and
>>> demotion has been sufficient.
>>>
>>
>> To me, this is a key point.  As we've discussed in meetings, we're in the
>> early days here.  The CHMU does provide a lot of flexibility, both to
>> create very good and very bad hotness trackers.  But I think the key point
>> is that we have multiple sources of hotness information depending on the
>> platform and some of these sources only make sense for the kernel (or a
>> BPF offload) to maintain as the source of truth.  Some of these sources
>> will be clear-on-read so only one entity would be possible to have as the
>> source of truth of page hotness.
>>
>> I've been pretty focused on the promotion story here rather than demotion
>> because of how responsive it needs to be.  Harvesting the page table
>> accessed bits or waiting on a sliding window through NUMA Balancing (even
>> NUMAB=2) is not as responsive as needed for very fast promotion to top
>> tier memory, hence things like the CHMU (or PEBS or IBS etc).
>>
>> A few things that I think we need to discuss and align on:
>>
>>  - the kernel as the source of truth for all memory hotness information,
>>    which can then be abstracted and used for multiple downstream purposes,
>>    memory tiering only being one of them
>>
>>  - the long-term plan for NUMAB=2 and memory tiering support in the kernel
>>    in general, are we planning on supporting this through NUMA hint faults
>>    forever despite their drawbacks (too slow, too much overhead for KVM)
>>
>>  - the role of the kernel vs userspace in driving the memory migration;
>>    lots of discussion on hardware assists that can be leveraged for memory
>>    migration but today the balancing is driven in process context.  The
>>    kthread as the driver of migration is yet to be a sold argument, but
>>    are where a number of companies are currently looking
>>
>> There's also some feature support that is possible with these CXL memory
>> expansion devices that have started to pop up in labs that can also
>> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
>> chime in as well.
>>
>> This topic seems due for an alignment session as well, so will look to get
>> that scheduled in the coming weeks if people are up for it.
> 
> Our experience is that workloads in hyper-scalar data centers such as
> Google often have significant cold memory. Offloading this to CXL memory
> devices, backed by cheaper, lower-performance media (e.g. DRAM with
> hardware compression), can be a practical approach to reduce overall
> TCO. Page promotion and demotion are then critical for such a tiered
> memory system.
> 
> A kernel thread to drive hot page collection and promotion seems
> logical, especially since hot page data from new sources (e.g. CHMU)
> are collected outside the process execution context and in the form of
> physical addresses.
> 
> I do agree that we need to balance the complexity and benefits of any
> new data structures for hotness tracking.


I think there is a mismatch in the tiering structure and
the patches. If you see the example in memory tiering

/*
 * ...
 * Example 3:
 *
 * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
 *
 * node distances:
 * node   0    1    2
 *    0  10   20   30
 *    1  20   10   40
 *    2  30   40   10
 *
 * memory_tiers0 = 1
 * memory_tiers1 = 0
 * memory_tiers2 = 2
 *..
 */

The topmost tier need not be DRAM, patch 3 states

"
[..]
 * kpromoted is a kernel thread that runs on each toptier node and
 * promotes pages from max_heap.
"

Also, there is no data in the cover letter to indicate what workloads benefit from
migration to top-tier and by how much?


Balbir



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-17  3:20         ` Balbir Singh
@ 2025-09-17  4:15           ` Bharata B Rao
  0 siblings, 0 replies; 53+ messages in thread
From: Bharata B Rao @ 2025-09-17  4:15 UTC (permalink / raw)
  To: Balbir Singh, Wei Xu, David Rientjes
  Cc: Gregory Price, Matthew Wilcox, linux-kernel, linux-mm,
	Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo, peterz,
	raghavendra.kt, riel, sj, ying.huang, ziy, dave, nifan.cxl,
	xuezhengchu, yiannis, akpm, david, byungchul, kinseyho,
	joshua.hahnjy, yuanchu, alok.rathore



On 17-Sep-25 8:50 AM, Balbir Singh wrote:
> On 9/17/25 10:30, Wei Xu wrote:
>> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:
>>>
>>> On Wed, 10 Sep 2025, Gregory Price wrote:
>>>
>>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
>>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
>>>>>> This patchset introduces a new subsystem for hot page tracking
>>>>>> and promotion (pghot) that consolidates memory access information
>>>>>> from various sources and enables centralized promotion of hot
>>>>>> pages across memory tiers.
>>>>>
>>>>> Just to be clear, I continue to believe this is a terrible idea and we
>>>>> should not do this.  If systems will be built with CXL (and given the
>>>>> horrendous performance, I cannot see why they would be), the kernel
>>>>> should not be migrating memory around like this.
>>>>
>>>> I've been considered this problem from the opposite approach since LSFMM.
>>>>
>>>> Rather than decide how to move stuff around, what if instead we just
>>>> decide not to ever put certain classes of memory on CXL.  Right now, so
>>>> long as CXL is in the page allocator, it's the wild west - any page can
>>>> end up anywhere.
>>>>
>>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
>>>> workloads to show local CXL expansion is valuable and performant enough
>>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
>>>> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
>>>> CXL, but allows any given user-driven page allocation (including page
>>>> cache, file, and anon mappings) to land there.
>>>>
>>>
>>> This is similar to our use case, although the direct allocation can be
>>> controlled by cpusets or mempolicies as needed depending on the memory
>>> access latency required for the workload; nothing new there, though, it's
>>> the same argument as NUMA in general and the abstraction of these far
>>> memory nodes as separate NUMA nodes makes this very straightforward.
>>>
>>>> I'm hoping to share some of this data in the coming months.
>>>>
>>>> I've yet to see any strong indication that a complex hotness/movement
>>>> system is warranted (yet) - but that may simply be because we have
>>>> local cards with no switching involved. So far LRU-based promotion and
>>>> demotion has been sufficient.
>>>>
>>>
>>> To me, this is a key point.  As we've discussed in meetings, we're in the
>>> early days here.  The CHMU does provide a lot of flexibility, both to
>>> create very good and very bad hotness trackers.  But I think the key point
>>> is that we have multiple sources of hotness information depending on the
>>> platform and some of these sources only make sense for the kernel (or a
>>> BPF offload) to maintain as the source of truth.  Some of these sources
>>> will be clear-on-read so only one entity would be possible to have as the
>>> source of truth of page hotness.
>>>
>>> I've been pretty focused on the promotion story here rather than demotion
>>> because of how responsive it needs to be.  Harvesting the page table
>>> accessed bits or waiting on a sliding window through NUMA Balancing (even
>>> NUMAB=2) is not as responsive as needed for very fast promotion to top
>>> tier memory, hence things like the CHMU (or PEBS or IBS etc).
>>>
>>> A few things that I think we need to discuss and align on:
>>>
>>>  - the kernel as the source of truth for all memory hotness information,
>>>    which can then be abstracted and used for multiple downstream purposes,
>>>    memory tiering only being one of them
>>>
>>>  - the long-term plan for NUMAB=2 and memory tiering support in the kernel
>>>    in general, are we planning on supporting this through NUMA hint faults
>>>    forever despite their drawbacks (too slow, too much overhead for KVM)
>>>
>>>  - the role of the kernel vs userspace in driving the memory migration;
>>>    lots of discussion on hardware assists that can be leveraged for memory
>>>    migration but today the balancing is driven in process context.  The
>>>    kthread as the driver of migration is yet to be a sold argument, but
>>>    are where a number of companies are currently looking
>>>
>>> There's also some feature support that is possible with these CXL memory
>>> expansion devices that have started to pop up in labs that can also
>>> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
>>> chime in as well.
>>>
>>> This topic seems due for an alignment session as well, so will look to get
>>> that scheduled in the coming weeks if people are up for it.
>>
>> Our experience is that workloads in hyper-scalar data centers such as
>> Google often have significant cold memory. Offloading this to CXL memory
>> devices, backed by cheaper, lower-performance media (e.g. DRAM with
>> hardware compression), can be a practical approach to reduce overall
>> TCO. Page promotion and demotion are then critical for such a tiered
>> memory system.
>>
>> A kernel thread to drive hot page collection and promotion seems
>> logical, especially since hot page data from new sources (e.g. CHMU)
>> are collected outside the process execution context and in the form of
>> physical addresses.
>>
>> I do agree that we need to balance the complexity and benefits of any
>> new data structures for hotness tracking.
> 
> 
> I think there is a mismatch in the tiering structure and
> the patches. If you see the example in memory tiering
> 
> /*
>  * ...
>  * Example 3:
>  *
>  * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
>  *
>  * node distances:
>  * node   0    1    2
>  *    0  10   20   30
>  *    1  20   10   40
>  *    2  30   40   10
>  *
>  * memory_tiers0 = 1
>  * memory_tiers1 = 0
>  * memory_tiers2 = 2
>  *..
>  */
> 
> The topmost tier need not be DRAM, patch 3 states
> 
> "
> [..]
>  * kpromoted is a kernel thread that runs on each toptier node and
>  * promotes pages from max_heap.

That comment is not accurate, will reword it next time.

Currently I am using kthread_create_on_node() to create one kernel thread
for each toptier node. I haven't tried this patchset with HBM but it should
end up creating a kthread for HBM node too.

However unlike for regular DRAM nodes, the kthread for HBM node can't be
bound to any CPU.

> 
> Also, there is no data in the cover letter to indicate what workloads benefit from
> migration to top-tier and by how much?

I have been trying to get the tracking infrastructure up and hoping to
get some review on that. I will start including numbers from the next iteration.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-17  0:30       ` Wei Xu
  2025-09-17  3:20         ` Balbir Singh
@ 2025-09-17 16:49         ` Jonathan Cameron
  2025-09-25 14:03           ` Yiannis Nikolakopoulos
  1 sibling, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-09-17 16:49 UTC (permalink / raw)
  To: Wei Xu
  Cc: David Rientjes, Gregory Price, Matthew Wilcox, Bharata B Rao,
	linux-kernel, linux-mm, dave.hansen, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, sj, ying.huang, ziy, dave,
	nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore

On Tue, 16 Sep 2025 17:30:46 -0700
Wei Xu <weixugc@google.com> wrote:

> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:
> >
> > On Wed, 10 Sep 2025, Gregory Price wrote:
> >  
> > > On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:  
> > > > On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:  
> > > > > This patchset introduces a new subsystem for hot page tracking
> > > > > and promotion (pghot) that consolidates memory access information
> > > > > from various sources and enables centralized promotion of hot
> > > > > pages across memory tiers.  
> > > >
> > > > Just to be clear, I continue to believe this is a terrible idea and we
> > > > should not do this.  If systems will be built with CXL (and given the
> > > > horrendous performance, I cannot see why they would be), the kernel
> > > > should not be migrating memory around like this.  
> > >
> > > I've been considered this problem from the opposite approach since LSFMM.
> > >
> > > Rather than decide how to move stuff around, what if instead we just
> > > decide not to ever put certain classes of memory on CXL.  Right now, so
> > > long as CXL is in the page allocator, it's the wild west - any page can
> > > end up anywhere.
> > >
> > > I have enough data now from ZONE_MOVABLE-only CXL deployments on real
> > > workloads to show local CXL expansion is valuable and performant enough
> > > to be worth deploying - but the key piece for me is that ZONE_MOVABLE
> > > disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
> > > CXL, but allows any given user-driven page allocation (including page
> > > cache, file, and anon mappings) to land there.
> > >  
> >
> > This is similar to our use case, although the direct allocation can be
> > controlled by cpusets or mempolicies as needed depending on the memory
> > access latency required for the workload; nothing new there, though, it's
> > the same argument as NUMA in general and the abstraction of these far
> > memory nodes as separate NUMA nodes makes this very straightforward.
> >  
> > > I'm hoping to share some of this data in the coming months.
> > >
> > > I've yet to see any strong indication that a complex hotness/movement
> > > system is warranted (yet) - but that may simply be because we have
> > > local cards with no switching involved. So far LRU-based promotion and
> > > demotion has been sufficient.
> > >  
> >
> > To me, this is a key point.  As we've discussed in meetings, we're in the
> > early days here.  The CHMU does provide a lot of flexibility, both to
> > create very good and very bad hotness trackers.  But I think the key point
> > is that we have multiple sources of hotness information depending on the
> > platform and some of these sources only make sense for the kernel (or a
> > BPF offload) to maintain as the source of truth.  Some of these sources
> > will be clear-on-read so only one entity would be possible to have as the
> > source of truth of page hotness.
> >
> > I've been pretty focused on the promotion story here rather than demotion
> > because of how responsive it needs to be.  Harvesting the page table
> > accessed bits or waiting on a sliding window through NUMA Balancing (even
> > NUMAB=2) is not as responsive as needed for very fast promotion to top
> > tier memory, hence things like the CHMU (or PEBS or IBS etc).
> >
> > A few things that I think we need to discuss and align on:
> >
> >  - the kernel as the source of truth for all memory hotness information,
> >    which can then be abstracted and used for multiple downstream purposes,
> >    memory tiering only being one of them
> >
> >  - the long-term plan for NUMAB=2 and memory tiering support in the kernel
> >    in general, are we planning on supporting this through NUMA hint faults
> >    forever despite their drawbacks (too slow, too much overhead for KVM)
> >
> >  - the role of the kernel vs userspace in driving the memory migration;
> >    lots of discussion on hardware assists that can be leveraged for memory
> >    migration but today the balancing is driven in process context.  The
> >    kthread as the driver of migration is yet to be a sold argument, but
> >    are where a number of companies are currently looking
> >
> > There's also some feature support that is possible with these CXL memory
> > expansion devices that have started to pop up in labs that can also
> > drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
> > chime in as well.
> >
> > This topic seems due for an alignment session as well, so will look to get
> > that scheduled in the coming weeks if people are up for it.  
> 
> Our experience is that workloads in hyper-scalar data centers such as
> Google often have significant cold memory. Offloading this to CXL memory
> devices, backed by cheaper, lower-performance media (e.g. DRAM with
> hardware compression), can be a practical approach to reduce overall
> TCO. Page promotion and demotion are then critical for such a tiered
> memory system.

For the hardware compression devices how are you dealing with capacity variation
/ overcommit?  Whilst there have been some discussions on that but without a
backing store of flash or similar it seems to be challenging to use
compressed memory in a tiering system (so as 'normalish' memory) unless you
don't mind occasionally and unexpectedly running out of memory (in nasty
async ways as dirty cache lines get written back).

Or do you mean zswap type use with a hardware offload of the actual
compression?

> 
> A kernel thread to drive hot page collection and promotion seems
> logical, especially since hot page data from new sources (e.g. CHMU)
> are collected outside the process execution context and in the form of
> physical addresses.
> 
> I do agree that we need to balance the complexity and benefits of any
> new data structures for hotness tracking.
> 
> > > It seems the closer to random-access the access pattern, the less
> > > valuable ANY movement is. Which should be intuitive.  But, having
> > > CXL beats touching disk every day of the week.
> > >
> > > So I've become conflicted on this work - but only because I haven't seen
> > > the data to suggest such complexity is warranted.
> > >
> > > ~Gregory
> > >  
> 



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-17 16:49         ` Jonathan Cameron
@ 2025-09-25 14:03           ` Yiannis Nikolakopoulos
  2025-09-25 14:41             ` Gregory Price
  2025-09-25 15:00             ` Jonathan Cameron
  0 siblings, 2 replies; 53+ messages in thread
From: Yiannis Nikolakopoulos @ 2025-09-25 14:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Wei Xu, David Rientjes, Gregory Price, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares



> On 17 Sep 2025, at 18:49, Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> 
> On Tue, 16 Sep 2025 17:30:46 -0700
> Wei Xu <weixugc@google.com> wrote:
> 
>> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:
>>> 
>>> On Wed, 10 Sep 2025, Gregory Price wrote:
>>> 
>>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:  
>>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:  
>>>>>> This patchset introduces a new subsystem for hot page tracking
>>>>>> and promotion (pghot) that consolidates memory access information
>>>>>> from various sources and enables centralized promotion of hot
>>>>>> pages across memory tiers.  
>>>>> 
>>>>> Just to be clear, I continue to believe this is a terrible idea and we
>>>>> should not do this.  If systems will be built with CXL (and given the
>>>>> horrendous performance, I cannot see why they would be), the kernel
>>>>> should not be migrating memory around like this.  
>>>> 
>>>> I've been considered this problem from the opposite approach since LSFMM.
>>>> 
>>>> Rather than decide how to move stuff around, what if instead we just
>>>> decide not to ever put certain classes of memory on CXL.  Right now, so
>>>> long as CXL is in the page allocator, it's the wild west - any page can
>>>> end up anywhere.
>>>> 
>>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
>>>> workloads to show local CXL expansion is valuable and performant enough
>>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
>>>> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
>>>> CXL, but allows any given user-driven page allocation (including page
>>>> cache, file, and anon mappings) to land there.
>>>> 
>>> 
[snip]
>>> There's also some feature support that is possible with these CXL memory
>>> expansion devices that have started to pop up in labs that can also
>>> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
>>> chime in as well.
>>> 
>>> This topic seems due for an alignment session as well, so will look to get
>>> that scheduled in the coming weeks if people are up for it.  
>> 
>> Our experience is that workloads in hyper-scalar data centers such as
>> Google often have significant cold memory. Offloading this to CXL memory
>> devices, backed by cheaper, lower-performance media (e.g. DRAM with
>> hardware compression), can be a practical approach to reduce overall
>> TCO. Page promotion and demotion are then critical for such a tiered
>> memory system.
> 
> For the hardware compression devices how are you dealing with capacity variation
> / overcommit?  
I understand that this is indeed one of the key questions from the upstream
kernel’s perspective.
So, I am jumping in to answer w.r.t. what we do in ZeroPoint; obviously I can
not speak of other solutions/deployments. However, our HW interface follows 
existing open specifications from OCP [1], so what I am describing below is
more widely applicable.

At a very high level, the way our HW works is that the DPA is indeed
overcommitted. Then, there is a control plane over CXL.io (PCIe) which
exposes the real remaining capacity, as well as some configurable
MSI-X interrupts that raise warnings when the capacity crosses over
certain configurable thresholds.

Last year I presented this interface in LSF/MM [2]. Based on the feedback I
got there, we have an early prototype that acts as the *last* memory tier
before reclaim (kind of "compressed tier in lieu of discard" as was
suggested to me by Dan).

What is different from standard tiering is that the control plane is
checked on demotion to make sure there is still capacity left. If not, the
demotion fails. While this seems stable so far, a missing piece is to
ensure that this tier is mainly written by demotions and not arbitrary kernel
allocations (at least as a starting point). I want to explore how mempolicies
can help there, or something of the sort that Gregory described.

This early prototype still needs quite some work in order to find the right
abstractions. Hopefully, I will be able to push an RFC in the near future
(a couple of months).

> Whilst there have been some discussions on that but without a
> backing store of flash or similar it seems to be challenging to use
> compressed memory in a tiering system (so as 'normalish' memory) unless you
> don't mind occasionally and unexpectedly running out of memory (in nasty
> async ways as dirty cache lines get written back).
There are several things that may be done on the device side. For now, I
think the kernel should be unaware of these. But with what I described
above, the goal is to have the capacity thresholds configured in a way
that we can absorb the occasional dirty cache lines that are written back.
> 
> Or do you mean zswap type use with a hardware offload of the actual
> compression?
I would categorize this as a completely different discussion (and product
line for us).

[1] https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf
[2] https://www.youtube.com/watch?v=tXWEbaJmZ_s

Thanks,
Yiannis

PS: Sending from a personal email address to avoid issues with
confidentiality footers of the corporate domain.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 14:03           ` Yiannis Nikolakopoulos
@ 2025-09-25 14:41             ` Gregory Price
  2025-10-16 11:48               ` Yiannis Nikolakopoulos
  2025-09-25 15:00             ` Jonathan Cameron
  1 sibling, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-09-25 14:41 UTC (permalink / raw)
  To: Yiannis Nikolakopoulos
  Cc: Jonathan Cameron, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote:
> > 
> > For the hardware compression devices how are you dealing with capacity variation
> > / overcommit?  
...
> What is different from standard tiering is that the control plane is
> checked on demotion to make sure there is still capacity left. If not, the
> demotion fails. While this seems stable so far, a missing piece is to
> ensure that this tier is mainly written by demotions and not arbitrary kernel
> allocations (at least as a starting point). I want to explore how mempolicies
> can help there, or something of the sort that Gregory described.
> 

Writing back the description as i understand it:

1) The intent is to only have this memory allocable via demotion
   (i.e. no fault or direct allocation from userland possible)

2) The intent is to still have this memory accessible directly (DMA),
   while compressed, not trigger a fault/promotion on access
   (i.e. no zswap faults)

3) The intent is to have an external monitoring software handle
   outrunning run-away decompression/hotness by promoting that data.

So basically we want a zswap-like interface for allocation, but to
retain the `struct page` in page tables such that no faults are incurred
on access.  Then if the page becomes hot, depend on some kind of HMU
tiering system to get it off the device.

I think we all understand there's some bear we have to outrun to deal
with problem #3 - and many of us are skeptical that the bear won't catch
up with our pants down.  Let's ignore this for the moment.

If such a device's memory is added to the default page allocator, then
the question becomes one of *isolation* - such that the kernel will
provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER
be used except under very explicit scenarios.

There are only 3 mechanisms with which to restrict this (presently):

1) ZONE membership (to disallow GFP_KERNEL)
2) cgroups->cpusets->mems_allowed
3) task/vma mempolicy
(obvious #4: Don't put it in the default page allocator)

cpusets and mempolicy are not sufficient to provide full isolation
- cgroups have the opposite hierarchical relationship than desired.
  The parent cgroup will lock out all children cgroups from using nodes
  not present in the parent mems_allowed. e.g. if you lock out access
  from the root cgroup, no cgroup on the entire system is eligible to
  allocate the memory.  If you don't lock out the root cgroup - any root
  cgroup task is eligible.  This isn't tractible.

- task/vma mempolicy gets ignored in many cases and is closer to a
  suggestion than enforcible.  It's also subject to rebinding as a
  task's cgroups.cpuset.mems_allowed changes.

I haven't read up enough on ZONE_DEVICE to understand the implications
of membership there, but have you explored this as an option?  I don't
see the work i'm doing intersecting well with your efforts - except
maybe on the vmscan.c work around allocation on demotion.

The work i'm doing is more aligned with - hey, filesystems are a global
resource, why are we using cgroup/task/vma policies to dictate whether a
filesystem's cache is eligible to land in remote nodes? i.e. drawing
better boundaries and controls around what can land in some set of
remote nodes "by default".  You're looking for *strong isolation*
controls, which implies a different kind of allocator interface.

~Gregory

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 14:41             ` Gregory Price
@ 2025-10-16 11:48               ` Yiannis Nikolakopoulos
  0 siblings, 0 replies; 53+ messages in thread
From: Yiannis Nikolakopoulos @ 2025-10-16 11:48 UTC (permalink / raw)
  To: Gregory Price
  Cc: Jonathan Cameron, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

Hi Gregory,

Thanks for all the feedback. I am finally getting some time to come
back to this.

On Thu, Sep 25, 2025 at 4:41 PM Gregory Price <gourry@gourry.net> wrote:
>
> On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote:
> > >
> > > For the hardware compression devices how are you dealing with capacity variation
> > > / overcommit?
> ...
> > What is different from standard tiering is that the control plane is
> > checked on demotion to make sure there is still capacity left. If not, the
> > demotion fails. While this seems stable so far, a missing piece is to
> > ensure that this tier is mainly written by demotions and not arbitrary kernel
> > allocations (at least as a starting point). I want to explore how mempolicies
> > can help there, or something of the sort that Gregory described.
> >
>
> Writing back the description as i understand it:
>
> 1) The intent is to only have this memory allocable via demotion
>    (i.e. no fault or direct allocation from userland possible)
Yes that is what looks to me like the "safe" way to begin with. In
theory you could have userland apps/middleware that is aware of this
memory and its quirks and are ok to use it but I guess we can leave
that for later and it feels like it could be provided by a separate
driver.
>
> 2) The intent is to still have this memory accessible directly (DMA),
>    while compressed, not trigger a fault/promotion on access
>    (i.e. no zswap faults)
Correct. One of the big advantages of CXL.mem is the cache-line access
granularity and our customers don't want to lose that.
>
> 3) The intent is to have an external monitoring software handle
>    outrunning run-away decompression/hotness by promoting that data.
External is not strictly necessary. E.g. it could be an additional
source of input to the kpromote/kmigrate solution.
>
> So basically we want a zswap-like interface for allocation, but to
If by "zswap-like interface" you mean something that can reject the
demote (or store according to the zswap semantics) then yes.
I just want to be careful when comparing with zswap.
> retain the `struct page` in page tables such that no faults are incurred
> on access.  Then if the page becomes hot, depend on some kind of HMU
> tiering system to get it off the device.
Correct.
>
> I think we all understand there's some bear we have to outrun to deal
> with problem #3 - and many of us are skeptical that the bear won't catch
> up with our pants down.  Let's ignore this for the moment.
Agreed.
>
> If such a device's memory is added to the default page allocator, then
> the question becomes one of *isolation* - such that the kernel will
> provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER
> be used except under very explicit scenarios.
>
> There are only 3 mechanisms with which to restrict this (presently):
>
> 1) ZONE membership (to disallow GFP_KERNEL)
> 2) cgroups->cpusets->mems_allowed
> 3) task/vma mempolicy
> (obvious #4: Don't put it in the default page allocator)
>
> cpusets and mempolicy are not sufficient to provide full isolation
> - cgroups have the opposite hierarchical relationship than desired.
>   The parent cgroup will lock out all children cgroups from using nodes
>   not present in the parent mems_allowed. e.g. if you lock out access
>   from the root cgroup, no cgroup on the entire system is eligible to
>   allocate the memory.  If you don't lock out the root cgroup - any root
>   cgroup task is eligible.  This isn't tractible.
>
> - task/vma mempolicy gets ignored in many cases and is closer to a
>   suggestion than enforcible.  It's also subject to rebinding as a
>   task's cgroups.cpuset.mems_allowed changes.
>
> I haven't read up enough on ZONE_DEVICE to understand the implications
> of membership there, but have you explored this as an option?  I don't
> see the work i'm doing intersecting well with your efforts - except
> maybe on the vmscan.c work around allocation on demotion.
Thanks for the very helpful breakdown. Your take on #2 & #3 seems
reasonable. About #1, I've skimmed through the rest of the thread and
I'll continue addressing your responses there.

Yiannis
>
> The work i'm doing is more aligned with - hey, filesystems are a global
> resource, why are we using cgroup/task/vma policies to dictate whether a
> filesystem's cache is eligible to land in remote nodes? i.e. drawing
> better boundaries and controls around what can land in some set of
> remote nodes "by default".  You're looking for *strong isolation*
> controls, which implies a different kind of allocator interface.
>
> ~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 14:03           ` Yiannis Nikolakopoulos
  2025-09-25 14:41             ` Gregory Price
@ 2025-09-25 15:00             ` Jonathan Cameron
  2025-09-25 15:08               ` Gregory Price
  2025-10-16 16:16               ` Yiannis Nikolakopoulos
  1 sibling, 2 replies; 53+ messages in thread
From: Jonathan Cameron @ 2025-09-25 15:00 UTC (permalink / raw)
  To: Yiannis Nikolakopoulos
  Cc: Wei Xu, David Rientjes, Gregory Price, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, 25 Sep 2025 16:03:46 +0200
Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote:

Hi Yiannis,


> > On 17 Sep 2025, at 18:49, Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> > 
> > On Tue, 16 Sep 2025 17:30:46 -0700
> > Wei Xu <weixugc@google.com> wrote:
> >   
> >> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote:  
> >>> 
> >>> On Wed, 10 Sep 2025, Gregory Price wrote:
> >>>   
> >>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:    
> >>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:    
> >>>>>> This patchset introduces a new subsystem for hot page tracking
> >>>>>> and promotion (pghot) that consolidates memory access information
> >>>>>> from various sources and enables centralized promotion of hot
> >>>>>> pages across memory tiers.    
> >>>>> 
> >>>>> Just to be clear, I continue to believe this is a terrible idea and we
> >>>>> should not do this.  If systems will be built with CXL (and given the
> >>>>> horrendous performance, I cannot see why they would be), the kernel
> >>>>> should not be migrating memory around like this.    
> >>>> 
> >>>> I've been considered this problem from the opposite approach since LSFMM.
> >>>> 
> >>>> Rather than decide how to move stuff around, what if instead we just
> >>>> decide not to ever put certain classes of memory on CXL.  Right now, so
> >>>> long as CXL is in the page allocator, it's the wild west - any page can
> >>>> end up anywhere.
> >>>> 
> >>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
> >>>> workloads to show local CXL expansion is valuable and performant enough
> >>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
> >>>> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
> >>>> CXL, but allows any given user-driven page allocation (including page
> >>>> cache, file, and anon mappings) to land there.
> >>>>   
> >>>   
> [snip]
> >>> There's also some feature support that is possible with these CXL memory
> >>> expansion devices that have started to pop up in labs that can also
> >>> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
> >>> chime in as well.
> >>> 
> >>> This topic seems due for an alignment session as well, so will look to get
> >>> that scheduled in the coming weeks if people are up for it.    
> >> 
> >> Our experience is that workloads in hyper-scalar data centers such as
> >> Google often have significant cold memory. Offloading this to CXL memory
> >> devices, backed by cheaper, lower-performance media (e.g. DRAM with
> >> hardware compression), can be a practical approach to reduce overall
> >> TCO. Page promotion and demotion are then critical for such a tiered
> >> memory system.  
> > 
> > For the hardware compression devices how are you dealing with capacity variation
> > / overcommit?    
> I understand that this is indeed one of the key questions from the upstream
> kernel’s perspective.
> So, I am jumping in to answer w.r.t. what we do in ZeroPoint; obviously I can
> not speak of other solutions/deployments. However, our HW interface follows 
> existing open specifications from OCP [1], so what I am describing below is
> more widely applicable.
> 
> At a very high level, the way our HW works is that the DPA is indeed
> overcommitted. Then, there is a control plane over CXL.io (PCIe) which
> exposes the real remaining capacity, as well as some configurable
> MSI-X interrupts that raise warnings when the capacity crosses over
> certain configurable thresholds.
> 
> Last year I presented this interface in LSF/MM [2]. Based on the feedback I
> got there, we have an early prototype that acts as the *last* memory tier
> before reclaim (kind of "compressed tier in lieu of discard" as was
> suggested to me by Dan).
> 
> What is different from standard tiering is that the control plane is
> checked on demotion to make sure there is still capacity left. If not, the
> demotion fails. While this seems stable so far, a missing piece is to
> ensure that this tier is mainly written by demotions and not arbitrary kernel
> allocations (at least as a starting point). I want to explore how mempolicies
> can help there, or something of the sort that Gregory described.
> 
> This early prototype still needs quite some work in order to find the right
> abstractions. Hopefully, I will be able to push an RFC in the near future
> (a couple of months).
> 
> > Whilst there have been some discussions on that but without a
> > backing store of flash or similar it seems to be challenging to use
> > compressed memory in a tiering system (so as 'normalish' memory) unless you
> > don't mind occasionally and unexpectedly running out of memory (in nasty
> > async ways as dirty cache lines get written back).  
> There are several things that may be done on the device side. For now, I
> think the kernel should be unaware of these. But with what I described
> above, the goal is to have the capacity thresholds configured in a way
> that we can absorb the occasional dirty cache lines that are written back.

In worst case they are far from occasional. It's not hard to imagine a malicious
program that ensures that all L3 in a system (say 256MiB+) is full of cache lines
from the far compressed memory all of which are changed in a fashion that makes
the allocation much less compressible.  If you are doing compression at cache line
granularity that's not so bad because it would only be 256MiB margin needed.
If the system in question is doing large block side compression, say 4KiB.
Then we have a 64x write amplification multiplier. If the virus is streaming over
memory the evictions we are seeing at the result of new lines being fetched
to be made much less compressible.

Add a accelerator (say DPDK or other zero copy into userspace buffers) into the
mix and you have a mess. You'll need to be extremely careful with what goes
in this compressed memory or hold enormous buffer capacity against fast
changes in compressability.

Key is that all software is potentially malicious (sometimes accidentally so ;)

Now, if we can put this into a special pool where it is acceptable to drop the writes
and return poison (so the application crashes) then that may be fine.

Or block writes.   Running compressed memory as read only CoW is one way to
avoid this problem.


> > 
> > Or do you mean zswap type use with a hardware offload of the actual
> > compression?  
> I would categorize this as a completely different discussion (and product
> line for us).
> 
> [1] https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf
> [2] https://www.youtube.com/watch?v=tXWEbaJmZ_s
> 
> Thanks,
> Yiannis
> 
> PS: Sending from a personal email address to avoid issues with
> confidentiality footers of the corporate domain.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 15:00             ` Jonathan Cameron
@ 2025-09-25 15:08               ` Gregory Price
  2025-09-25 15:18                 ` Gregory Price
  2025-09-25 15:24                 ` Jonathan Cameron
  2025-10-16 16:16               ` Yiannis Nikolakopoulos
  1 sibling, 2 replies; 53+ messages in thread
From: Gregory Price @ 2025-09-25 15:08 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, Sep 25, 2025 at 04:00:58PM +0100, Jonathan Cameron wrote:
> Now, if we can put this into a special pool where it is acceptable to drop the writes
> and return poison (so the application crashes) then that may be fine.
> 
> Or block writes.   Running compressed memory as read only CoW is one way to
> avoid this problem.
>

This is an interesting thought.  If you drop a write and return poison,
can you instead handle the poison message as a fault and promote on
fault?  Then you might just be able to turn this whole thing into a
zswap backend that promotes on write.

Then you don't particular care about stronger isolation controls
(except maybe keeping kernel memory out of those regions).

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 15:08               ` Gregory Price
@ 2025-09-25 15:18                 ` Gregory Price
  2025-09-25 15:24                 ` Jonathan Cameron
  1 sibling, 0 replies; 53+ messages in thread
From: Gregory Price @ 2025-09-25 15:18 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, Sep 25, 2025 at 11:08:59AM -0400, Gregory Price wrote:
> On Thu, Sep 25, 2025 at 04:00:58PM +0100, Jonathan Cameron wrote:
> > Now, if we can put this into a special pool where it is acceptable to drop the writes
> > and return poison (so the application crashes) then that may be fine.
> > 
> > Or block writes.   Running compressed memory as read only CoW is one way to
> > avoid this problem.
> >
> 
> This is an interesting thought.  If you drop a write and return poison,
> can you instead handle the poison message as a fault and promote on
> fault?  Then you might just be able to turn this whole thing into a
> zswap backend that promotes on write.
>

I just realized this would require some mechanism to re-issue the write.

So yeah, you'd have to do this via some some heroic page table
enforcement.  The key observation here is that zswap hacks off all the
page table entries - rather than leave them present and readable.  In
this design, you want to leave them present and readable, and therefore
need some way to prevent entries from changing out from under you.

> Then you don't particular care about stronger isolation controls
> (except maybe keeping kernel memory out of those regions).
> 
> ~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 15:08               ` Gregory Price
  2025-09-25 15:18                 ` Gregory Price
@ 2025-09-25 15:24                 ` Jonathan Cameron
  2025-09-25 16:06                   ` Gregory Price
  1 sibling, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-09-25 15:24 UTC (permalink / raw)
  To: Gregory Price
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, 25 Sep 2025 11:08:59 -0400
Gregory Price <gourry@gourry.net> wrote:

> On Thu, Sep 25, 2025 at 04:00:58PM +0100, Jonathan Cameron wrote:
> > Now, if we can put this into a special pool where it is acceptable to drop the writes
> > and return poison (so the application crashes) then that may be fine.
> > 
> > Or block writes.   Running compressed memory as read only CoW is one way to
> > avoid this problem.
> >  
> 
> This is an interesting thought.  If you drop a write and return poison,
> can you instead handle the poison message as a fault and promote on
> fault?  Then you might just be able to turn this whole thing into a
> zswap backend that promotes on write.

Poison only comes on subsequent read so you don't see anything
at write (which are inherently asynchronous due to cache write back).
There are only few ways to do writes that are allowed to fail (the 64 byte
atomic deferrable write stuff) and I think on all architectures where
they can even be pointed at main memory, they only defer if on uncacheable
memory.

Seeing poison on subsequent read is far too late to promote the page,
you've lost the data.  The poison only works as ultimate safety gate. Also
once you've tripped it the device probably needs to drop all write
and return poison on all reads, not just the problem one (otherwise
things might fail much later).

The CoW thing only works because it's a permissions fault at point of
asking for permission to write (so way before it goes into the cache).
Then you can check margins to make sure you can still sink all outstanding
writes if they become uncompressible and only let the write through if safe
- if not promote some stuff before letting it proceed.
Or you just promote on write and rely on the demotion path performing those
careful checks later.

Jonathan


> 
> Then you don't particular care about stronger isolation controls
> (except maybe keeping kernel memory out of those regions).
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 15:24                 ` Jonathan Cameron
@ 2025-09-25 16:06                   ` Gregory Price
  2025-09-25 17:23                     ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-09-25 16:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, Sep 25, 2025 at 04:24:26PM +0100, Jonathan Cameron wrote:
> The CoW thing only works because it's a permissions fault at point of
> asking for permission to write (so way before it goes into the cache).
> Then you can check margins to make sure you can still sink all outstanding
> writes if they become uncompressible and only let the write through if safe
> - if not promote some stuff before letting it proceed.
> Or you just promote on write and rely on the demotion path performing those
> careful checks later.
>

Agreed.  The question is now whether you can actually enforce page table
bits not changing.  I think you'd need your own fault handling
infrastructure / driver for these pages.

This does smell a lot like a kernel-internal dax allocation interface.
There was a bunch of talk about virtualizing zswap backends, so that
might be a nice place to look to insert this kind of hook.

Then the device driver (which it will definitely need) would have to
field page faults accordingly.

It feels much more natural to put this as a zswap/zram backend.

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 16:06                   ` Gregory Price
@ 2025-09-25 17:23                     ` Jonathan Cameron
  2025-09-25 19:02                       ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-09-25 17:23 UTC (permalink / raw)
  To: Gregory Price
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, 25 Sep 2025 12:06:28 -0400
Gregory Price <gourry@gourry.net> wrote:

> On Thu, Sep 25, 2025 at 04:24:26PM +0100, Jonathan Cameron wrote:
> > The CoW thing only works because it's a permissions fault at point of
> > asking for permission to write (so way before it goes into the cache).
> > Then you can check margins to make sure you can still sink all outstanding
> > writes if they become uncompressible and only let the write through if safe
> > - if not promote some stuff before letting it proceed.
> > Or you just promote on write and rely on the demotion path performing those
> > careful checks later.
> >  
> 
> Agreed.  The question is now whether you can actually enforce page table
> bits not changing.  I think you'd need your own fault handling
> infrastructure / driver for these pages.
> 
> This does smell a lot like a kernel-internal dax allocation interface.
> There was a bunch of talk about virtualizing zswap backends, so that
> might be a nice place to look to insert this kind of hook.
> 
> Then the device driver (which it will definitely need) would have to
> field page faults accordingly.
> 
> It feels much more natural to put this as a zswap/zram backend.
> 
Agreed.  I currently see two paths that are generic (ish).

1. zswap route - faulting as you describe on writes.
2. Fail safe route - Map compressible memory it into a VM (or application)
   you don't mind killing if we loose that promotion race due to
   pathological application.  The attacker only disturbs memory allocated
   to that application / VM so the blast radius is contained.

Jonathan

> ~Gregory



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 17:23                     ` Jonathan Cameron
@ 2025-09-25 19:02                       ` Gregory Price
  2025-10-01  7:22                         ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-09-25 19:02 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> On Thu, 25 Sep 2025 12:06:28 -0400
> Gregory Price <gourry@gourry.net> wrote:
> 
> > It feels much more natural to put this as a zswap/zram backend.
> > 
> Agreed.  I currently see two paths that are generic (ish).
> 
> 1. zswap route - faulting as you describe on writes.

aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub

The interposition point for zswap/zram is the PTE present bit being 
hacked off to generate access faults.

If you want any random VMA to be eligible for demotion into the
tier, then you need to override that VMA's fault/protect hooks in its
vm_area_struct.  This is idea is a non-starter.

What you'd have to do is have those particular vm_area_struct's be
provided by some special allocator that says the memory is eligible for
demotion to compressed memory, and to route all faults through it.

That looks a lot like hacking up mm internals to support a single
hardware case.  Hard to justify.

This may quite literally only be possible to do for for unmapped pages,
which would limit the application to things like mm/filemap.c and making
IO (read/write) calls faster.

which - hey - maybe that's the best use-case anyway.  Have all the
read-only compressible filecache you want.  At least you avoid touching
disk.

~Gregory

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 19:02                       ` Gregory Price
@ 2025-10-01  7:22                         ` Gregory Price
  2025-10-17  9:53                           ` Yiannis Nikolakopoulos
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-10-01  7:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > On Thu, 25 Sep 2025 12:06:28 -0400
> > Gregory Price <gourry@gourry.net> wrote:
> > 
> > > It feels much more natural to put this as a zswap/zram backend.
> > > 
> > Agreed.  I currently see two paths that are generic (ish).
> > 
> > 1. zswap route - faulting as you describe on writes.
> 
> aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
> 
> The interposition point for zswap/zram is the PTE present bit being 
> hacked off to generate access faults.
> 

I went digging around a bit.

Not only this, but the PTE is used to store the swap entry ID, so you
can't just use a swap backend and keep the mapping. It's just not a
compatible abstraction - so as a zswap-backend this is DOA.

Even if you could figure out a way to re-use the abstraction and just
take a hard-fault to fault it back in as read-only, you lose the swap
entry on fault.  That just gets nasty trying to reconcile the
differences between this interface and swap at that point.

So here's a fun proposal.  I'm not sure of how NUMA nodes for devices
get determined - 

1. Carve out an explicit proximity domain (NUMA node) for the compressed
   region via SRAT.
   https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html

2. Make sure this proximity domain (NUMA node) has separate data in the
   HMAT so it can be an explicit demotion target for higher tiers
   https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html

3. Create a node-to-zone-allocator registration and retrieval function
   device_folio_alloc = nid_to_alloc(nid)

4. Create a DAX extension that registers the above allocator interface

5. in `alloc_migration_target()` mm/migrate.c
   Since nid is not a valid buddy-allocator target, everything here
   will fail.  So we can simply append the following to the bottom

   device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
   if (device_folio_alloc)
       folio = device_folio_alloc(...)
   return folio;

6. in `struct migration_target_control` add a new .no_writable value
   - This will say the new mapping replacements should have the
     writable bit chopped off.

7. On write-fault, extent mm/memory.c:do_numa_page to detect this
   and simply promote the page to allow writes.  Write faults will
   be expensive, but you'll have pretty strong guarantees around
   not unexpectedly running out of space.

   You can then loosen the .no_writable restriction with settings if
   you have high confidence that your system will outrun your ability
   to promote/evict/whatever if device memory becomes hot.

The only thing I don't know off hand is how shared pages will work in
this setup.  For VMAs with a mapping that exist at demotion time, this
all works wonderfully - less so if the mapping doesn't exist or a new
VMA is created after a demotion has occurred.

I don't know what will happen there.

I think this would also sate the desire for a "separate CXL allocator"
for integration into other paths as well.

~Gregory

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-01  7:22                         ` Gregory Price
@ 2025-10-17  9:53                           ` Yiannis Nikolakopoulos
  2025-10-17 14:15                             ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Yiannis Nikolakopoulos @ 2025-10-17  9:53 UTC (permalink / raw)
  To: Gregory Price
  Cc: Jonathan Cameron, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> > On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > > On Thu, 25 Sep 2025 12:06:28 -0400
> > > Gregory Price <gourry@gourry.net> wrote:
> > >
> > > > It feels much more natural to put this as a zswap/zram backend.
> > > >
> > > Agreed.  I currently see two paths that are generic (ish).
> > >
> > > 1. zswap route - faulting as you describe on writes.
> >
> > aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
> >
> > The interposition point for zswap/zram is the PTE present bit being
> > hacked off to generate access faults.
> >
>
> I went digging around a bit.
>
> Not only this, but the PTE is used to store the swap entry ID, so you
> can't just use a swap backend and keep the mapping. It's just not a
> compatible abstraction - so as a zswap-backend this is DOA.
>
> Even if you could figure out a way to re-use the abstraction and just
> take a hard-fault to fault it back in as read-only, you lose the swap
> entry on fault.  That just gets nasty trying to reconcile the
> differences between this interface and swap at that point.
>
> So here's a fun proposal.  I'm not sure of how NUMA nodes for devices
> get determined -
>
> 1. Carve out an explicit proximity domain (NUMA node) for the compressed
>    region via SRAT.
>    https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
>
> 2. Make sure this proximity domain (NUMA node) has separate data in the
>    HMAT so it can be an explicit demotion target for higher tiers
>    https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
This makes sense. I've done a dirty hardcoding trick in my prototype
so that my node is always the last target. I'll have a look on how to
make this right.
>
> 3. Create a node-to-zone-allocator registration and retrieval function
>    device_folio_alloc = nid_to_alloc(nid)
>
> 4. Create a DAX extension that registers the above allocator interface
>
> 5. in `alloc_migration_target()` mm/migrate.c
>    Since nid is not a valid buddy-allocator target, everything here
>    will fail.  So we can simply append the following to the bottom
>
>    device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
>    if (device_folio_alloc)
>        folio = device_folio_alloc(...)
>    return folio;
In my current prototype alloc_migration_target was working (naively).
Steps 3, 4 and 5 seem like an interesting thing to try after all this
discussion.
>
> 6. in `struct migration_target_control` add a new .no_writable value
>    - This will say the new mapping replacements should have the
>      writable bit chopped off.
>
> 7. On write-fault, extent mm/memory.c:do_numa_page to detect this
>    and simply promote the page to allow writes.  Write faults will
>    be expensive, but you'll have pretty strong guarantees around
>    not unexpectedly running out of space.
>
>    You can then loosen the .no_writable restriction with settings if
>    you have high confidence that your system will outrun your ability
>    to promote/evict/whatever if device memory becomes hot.
That looks modular enough that will allow me to test both writable and
no_writable and being able to compare.
>
> The only thing I don't know off hand is how shared pages will work in
> this setup.  For VMAs with a mapping that exist at demotion time, this
> all works wonderfully - less so if the mapping doesn't exist or a new
> VMA is created after a demotion has occurred.
I'll keep that in mind.
>
> I don't know what will happen there.
>
> I think this would also sate the desire for a "separate CXL allocator"
> for integration into other paths as well.
>
> ~Gregory
Thanks a lot for all the discussion and the input. I can move my
prototype towards this direction and will get back with what I 've
learned and an RFC if it makes sense. Please keep me in the loop in
any related discussions.

Best,
/Yiannis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-17  9:53                           ` Yiannis Nikolakopoulos
@ 2025-10-17 14:15                             ` Gregory Price
  2025-10-17 14:36                               ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-10-17 14:15 UTC (permalink / raw)
  To: Yiannis Nikolakopoulos
  Cc: Jonathan Cameron, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Fri, Oct 17, 2025 at 11:53:31AM +0200, Yiannis Nikolakopoulos wrote:
> On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@gourry.net> wrote:
> > 1. Carve out an explicit proximity domain (NUMA node) for the compressed
> >    region via SRAT.
> >    https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
> >
> > 2. Make sure this proximity domain (NUMA node) has separate data in the
> >    HMAT so it can be an explicit demotion target for higher tiers
> >    https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
> This makes sense. I've done a dirty hardcoding trick in my prototype
> so that my node is always the last target. I'll have a look on how to
> make this right.

I think it's probably a CEDT/CDAT/HMAT/SRAT/etc negotiation.

Essentially the platform needs to allow a single device to expose
multiple numa nodes based on different expected performance.  From
those ranges.  Then software needs to program the HDM decoders
appropriately.

> > 5. in `alloc_migration_target()` mm/migrate.c
> >    Since nid is not a valid buddy-allocator target, everything here
> >    will fail.  So we can simply append the following to the bottom
> >
> >    device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
> >    if (device_folio_alloc)
> >        folio = device_folio_alloc(...)
> >    return folio;
> In my current prototype alloc_migration_target was working (naively).
> Steps 3, 4 and 5 seem like an interesting thing to try after all this
> discussion.
> >

Right because the memory is directly accessible to the buddy allocator.
What i'm proposing would remove this memory from the buddy allocator and
force more explicit integration (in this case with this function).

more explicitly: in this design __folio_alloc can never access this
                 memory.

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-17 14:15                             ` Gregory Price
@ 2025-10-17 14:36                               ` Jonathan Cameron
  2025-10-17 14:59                                 ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-17 14:36 UTC (permalink / raw)
  To: Gregory Price
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Fri, 17 Oct 2025 10:15:57 -0400
Gregory Price <gourry@gourry.net> wrote:

> On Fri, Oct 17, 2025 at 11:53:31AM +0200, Yiannis Nikolakopoulos wrote:
> > On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@gourry.net> wrote:  
> > > 1. Carve out an explicit proximity domain (NUMA node) for the compressed
> > >    region via SRAT.
> > >    https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
> > >
> > > 2. Make sure this proximity domain (NUMA node) has separate data in the
> > >    HMAT so it can be an explicit demotion target for higher tiers
> > >    https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html  
> > This makes sense. I've done a dirty hardcoding trick in my prototype
> > so that my node is always the last target. I'll have a look on how to
> > make this right.  
> 
> I think it's probably a CEDT/CDAT/HMAT/SRAT/etc negotiation.
> 
> Essentially the platform needs to allow a single device to expose
> multiple numa nodes based on different expected performance.  From
> those ranges.  Then software needs to program the HDM decoders
> appropriately.

It's a bit 'fuzzy' to justify but maybe (for CXL) a CFWMS flag (so CEDT
as you mention) to say this host memory region may be backed by
compressed memory?

Might be able to justify it from spec point of view by arguing that
compression is a QoS related characteristic. Always possible host
hardware will want to handle it differently before it even hits the
bus even if it's just a case throttling writing differently.

That then ends up in it's own NUMA node.  Whether we take on the
splitting CFMWS entries into multiple NUMA nodes depending on what
backing devices end up in them is something we kicked into the long
grass originally, but that can definitely be revisited.  That
doesn't matter for initial support of compressed memory though if
we can do it via a seperate CXL Fixed Memory Window Structure (CFMWS)
in CEDT.

> 
> > > 5. in `alloc_migration_target()` mm/migrate.c
> > >    Since nid is not a valid buddy-allocator target, everything here
> > >    will fail.  So we can simply append the following to the bottom
> > >
> > >    device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
> > >    if (device_folio_alloc)
> > >        folio = device_folio_alloc(...)
> > >    return folio;  
> > In my current prototype alloc_migration_target was working (naively).
> > Steps 3, 4 and 5 seem like an interesting thing to try after all this
> > discussion.  
> > >  
> 
> Right because the memory is directly accessible to the buddy allocator.
> What i'm proposing would remove this memory from the buddy allocator and
> force more explicit integration (in this case with this function).
> 
> more explicitly: in this design __folio_alloc can never access this
>                  memory.
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-17 14:36                               ` Jonathan Cameron
@ 2025-10-17 14:59                                 ` Gregory Price
  2025-10-20 14:05                                   ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-10-17 14:59 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Fri, Oct 17, 2025 at 03:36:13PM +0100, Jonathan Cameron wrote:
> On Fri, 17 Oct 2025 10:15:57 -0400
> Gregory Price <gourry@gourry.net> wrote:
> > 
> > Essentially the platform needs to allow a single device to expose
> > multiple numa nodes based on different expected performance.  From
> > those ranges.  Then software needs to program the HDM decoders
> > appropriately.
> 
> It's a bit 'fuzzy' to justify but maybe (for CXL) a CFWMS flag (so CEDT
> as you mention) to say this host memory region may be backed by
> compressed memory?
>
> Might be able to justify it from spec point of view by arguing that
> compression is a QoS related characteristic. Always possible host
> hardware will want to handle it differently before it even hits the
> bus even if it's just a case throttling writing differently.
>

That's a Consortium discussion to have (and I am not of the
consortium :P), but yeah you could do it that way.

More generally could have a "Not-for-general-consumption bit" instead
of specifically a compressed bit.  Maybe both a "No-Consume" and a
"Special Node" bit would be useful separately.

Of course then platforms need to be made to understand all these:

"No-Consume" -> force EFI_MEMORY_SP or leave it reserved
"Special Node" -> allocate its own PXM / Provide discrete CFMWS

Naming obviously non-instructive here, may as well call them Nancy and
Bob bits.

> That then ends up in it's own NUMA node.  Whether we take on the
> splitting CFMWS entries into multiple NUMA nodes depending on what
> backing devices end up in them is something we kicked into the long
> grass originally, but that can definitely be revisited.  That
> doesn't matter for initial support of compressed memory though if
> we can do it via a seperate CXL Fixed Memory Window Structure (CFMWS)
> in CEDT.
>

This is the way I would initially approach it tbh - but i'm also not a
hardware/firmware person, so i don't know exactly what bits a device
would set to tell BIOS/EFI "Hey, give this chunk its own CFMWS", or if
that lies solely with BIOS/EFI.

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-17 14:59                                 ` Gregory Price
@ 2025-10-20 14:05                                   ` Jonathan Cameron
  2025-10-21 18:52                                     ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-20 14:05 UTC (permalink / raw)
  To: Gregory Price
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Fri, 17 Oct 2025 10:59:01 -0400
Gregory Price <gourry@gourry.net> wrote:

> On Fri, Oct 17, 2025 at 03:36:13PM +0100, Jonathan Cameron wrote:
> > On Fri, 17 Oct 2025 10:15:57 -0400
> > Gregory Price <gourry@gourry.net> wrote:  
> > > 
> > > Essentially the platform needs to allow a single device to expose
> > > multiple numa nodes based on different expected performance.  From
> > > those ranges.  Then software needs to program the HDM decoders
> > > appropriately.  
> > 
> > It's a bit 'fuzzy' to justify but maybe (for CXL) a CFWMS flag (so CEDT
> > as you mention) to say this host memory region may be backed by
> > compressed memory?
> >
> > Might be able to justify it from spec point of view by arguing that
> > compression is a QoS related characteristic. Always possible host
> > hardware will want to handle it differently before it even hits the
> > bus even if it's just a case throttling writing differently.
> >  
> 
> That's a Consortium discussion to have (and I am not of the
> consortium :P), but yeah you could do it that way.

The moment I know it's raised there I (and others involved in consortium)
can't talk about it in public. (I love standards org IP rules!)
So it's useful to have a pre discussion before that happens.  We've
done this before for other topics and it can be very productive.

> 
> More generally could have a "Not-for-general-consumption bit" instead
> of specifically a compressed bit.  Maybe both a "No-Consume" and a
> "Special Node" bit would be useful separately.
> 
> Of course then platforms need to be made to understand all these:
> 
> "No-Consume" -> force EFI_MEMORY_SP or leave it reserved
> "Special Node" -> allocate its own PXM / Provide discrete CFMWS
> 
> Naming obviously non-instructive here, may as well call them Nancy and
> Bob bits.

For compression specifically I think there is value in making it
explicitly compression because the host hardware might handle that
differently. The other bits might be worth having as well
though. SPM was all about 'you could' use it as normal memory but
someone put it there for something else. This more a case of
SPOM. Specific Purpose Only Memory - eats babies if you don't know
the extra rules for each instance of that.

> 
> > That then ends up in it's own NUMA node.  Whether we take on the
> > splitting CFMWS entries into multiple NUMA nodes depending on what
> > backing devices end up in them is something we kicked into the long
> > grass originally, but that can definitely be revisited.  That
> > doesn't matter for initial support of compressed memory though if
> > we can do it via a seperate CXL Fixed Memory Window Structure (CFMWS)
> > in CEDT.
> >  
> 
> This is the way I would initially approach it tbh - but i'm also not a
> hardware/firmware person, so i don't know exactly what bits a device
> would set to tell BIOS/EFI "Hey, give this chunk its own CFMWS", or if
> that lies solely with BIOS/EFI.

It's not a device thing wrt to nodes today (and there are good reasons
why it should not be at that granularity e.g. node explosion has costs).
The BIOS might pre setup the decoders and even lock them, but I'd expect
we'll move away from that to fully OS managed over time (to get flexibility)
- exception to that being when confidential compute is making its
usual mess of things.

Maybe the BIOS would have a look at devices and decide to enable a
compressed memory CFMWS if it finds devices that need it and not do
so otherwise, though not doing so breaks hotplug of compressed memory devices.

So my guess is either we need to fix Linux to allow splitting a fixed
memory window up into multiple NUMA nodes, or platforms have to spin
extra fixed memory windows (host side PA ranges with a NUMA node for each).

Which option depends a bit on whether we expect host hardware to either
handle compressed differently from normal ram, or at least separate it
for QoS reasons.

What fun.

J
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-20 14:05                                   ` Jonathan Cameron
@ 2025-10-21 18:52                                     ` Gregory Price
  2025-10-21 18:57                                       ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-10-21 18:52 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Mon, Oct 20, 2025 at 03:05:26PM +0100, Jonathan Cameron wrote:
> > More generally could have a "Not-for-general-consumption bit" instead
> > of specifically a compressed bit.  Maybe both a "No-Consume" and a
> > "Special Node" bit would be useful separately.
> > 
> > Of course then platforms need to be made to understand all these:
> > 
> > "No-Consume" -> force EFI_MEMORY_SP or leave it reserved
> > "Special Node" -> allocate its own PXM / Provide discrete CFMWS
> > 
> > Naming obviously non-instructive here, may as well call them Nancy and
> > Bob bits.
> 
> For compression specifically I think there is value in making it
> explicitly compression because the host hardware might handle that
> differently. The other bits might be worth having as well
> though. SPM was all about 'you could' use it as normal memory but
> someone put it there for something else. This more a case of
> SPOM. Specific Purpose Only Memory - eats babies if you don't know
> the extra rules for each instance of that.
> 

This is a fair point.  Something like a SPOM bit that says some other
bit-field is valid and you get to add new extensions about how the
memory should be used?  :shrug: probably sufficiently extensible but
maybe never used for anything more than compression.

> Maybe the BIOS would have a look at devices and decide to enable a
> compressed memory CFMWS if it finds devices that need it and not do
> so otherwise, though not doing so breaks hotplug of compressed memory devices.
> 
> So my guess is either we need to fix Linux to allow splitting a fixed
> memory window up into multiple NUMA nodes, or platforms have to spin
> extra fixed memory windows (host side PA ranges with a NUMA node for each).
> 

I don't think splitting a CFMW into multiple nodes is feasible, but I
also haven't looked at that region of ACPI code since i finished the
docs.  I can look into that.

I would prefer the former, since this is already what's done for
hostbridge interleave vs non-interleave setups, where the host may
expose multiple CFMW for the same devices depending on how the OS.

> Which option depends a bit on whether we expect host hardware to either
> handle compressed differently from normal ram, or at least separate it
> for QoS reasons.
> 

There's only a handful of folks discussing this at the moment, but so
far we've all be consistent in our gut telling us it should be handled
differently for reliability reasons.  But also, more opinions always
welcome :]

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-21 18:52                                     ` Gregory Price
@ 2025-10-21 18:57                                       ` Gregory Price
  2025-10-22  9:09                                         ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-10-21 18:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Tue, Oct 21, 2025 at 02:52:40PM -0400, Gregory Price wrote:
> I would prefer the former, since this is already what's done for
> hostbridge interleave vs non-interleave setups, where the host may
> expose multiple CFMW for the same devices depending on how the OS.

bah, got distracted

"Depending on how the OS may choose to configure things at some unknown
point in the future"


~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-21 18:57                                       ` Gregory Price
@ 2025-10-22  9:09                                         ` Jonathan Cameron
  2025-10-22 15:05                                           ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-22  9:09 UTC (permalink / raw)
  To: Gregory Price, hannes
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, mgorman,
	mingo, peterz, raghavendra.kt, riel, sj, ying.huang, ziy, dave,
	nifan.cxl, xuezhengchu, akpm, david, byungchul, kinseyho,
	joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Tue, 21 Oct 2025 14:57:26 -0400
Gregory Price <gourry@gourry.net> wrote:

> On Tue, Oct 21, 2025 at 02:52:40PM -0400, Gregory Price wrote:
> > I would prefer the former, since this is already what's done for
> > hostbridge interleave vs non-interleave setups, where the host may
> > expose multiple CFMW for the same devices depending on how the OS.  
> 
> bah, got distracted
> 
> "Depending on how the OS may choose to configure things at some unknown
> point in the future"

My gut feeling is the need to do dynamic NUMA nodes will not be driven
but this but more by large scale fabrics (if that ever happens)
and trade offs of host PA space vs QoS in the hardware.  Those
trade offs might put memory with very different performance
characteristics behind one window.

Maybe it'll become a thing that can be used for compression.
Otherwise compression from host hardware point of view might be
like the question of share or separate fixed memory windows for
persistent / volatile. Ideally they'd be separate but if Host PA space
is limited, someone might build a system where a single fixed memory
window is used to support both.

Possible virtualization of some of this stuff will make it more complex
again.  Any crazy mess can share a fake fixed memory window as the QoS is
all behind some page tables.

Meh. Let's suggest people burn host PA space for now.  If anyone hits
that limit they can solve it (crosses fingers it's not my lot :)

Jonathan

> 
> 
> ~Gregory

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-22  9:09                                         ` Jonathan Cameron
@ 2025-10-22 15:05                                           ` Gregory Price
  2025-10-23 15:29                                             ` Jonathan Cameron
  0 siblings, 1 reply; 53+ messages in thread
From: Gregory Price @ 2025-10-22 15:05 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: hannes, Yiannis Nikolakopoulos, Wei Xu, David Rientjes,
	Matthew Wilcox, Bharata B Rao, linux-kernel, linux-mm,
	dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel, sj,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, yiannis, Adam Manzanares

On Wed, Oct 22, 2025 at 10:09:50AM +0100, Jonathan Cameron wrote:
> 
> My gut feeling is the need to do dynamic NUMA nodes will not be driven
> but this but more by large scale fabrics (if that ever happens)
> and trade offs of host PA space vs QoS in the hardware.  Those
> trade offs might put memory with very different performance
> characteristics behind one window.
> 

I can't believe we live in a world where "We have to think about the
scenario where we actually need all 256 TB of 48-bit phys-addressing"
is not a tongue in cheek joke o_o

That's a paltry 2048 128GB DIMMs... and whatever monstrosity you have to
build to host it all but that's at least a fun engineering problem :V

Bring on the 128-bit CPUs!

What do we name those x86 registers though? Slap the E back on for ERAX?

> Meh. Let's suggest people burn host PA space for now.  If anyone hits
> that limit they can solve it (crosses fingers it's not my lot :)
> 

+1

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-22 15:05                                           ` Gregory Price
@ 2025-10-23 15:29                                             ` Jonathan Cameron
  0 siblings, 0 replies; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-23 15:29 UTC (permalink / raw)
  To: Gregory Price
  Cc: hannes, Yiannis Nikolakopoulos, Wei Xu, David Rientjes,
	Matthew Wilcox, Bharata B Rao, linux-kernel, linux-mm,
	dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel, sj,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, yiannis, Adam Manzanares

On Wed, 22 Oct 2025 11:05:16 -0400
Gregory Price <gourry@gourry.net> wrote:

> On Wed, Oct 22, 2025 at 10:09:50AM +0100, Jonathan Cameron wrote:
> > 
> > My gut feeling is the need to do dynamic NUMA nodes will not be driven
> > but this but more by large scale fabrics (if that ever happens)
> > and trade offs of host PA space vs QoS in the hardware.  Those
> > trade offs might put memory with very different performance
> > characteristics behind one window.
> >   
> 
> I can't believe we live in a world where "We have to think about the
> scenario where we actually need all 256 TB of 48-bit phys-addressing"
> is not a tongue in cheek joke o_o

You think everyone wires all the 48 bits?  Certainly not everyone does.

> 
> That's a paltry 2048 128GB DIMMs... and whatever monstrosity you have to
> build to host it all but that's at least a fun engineering problem :V
> 
> Bring on the 128-bit CPUs!
> 
> What do we name those x86 registers though? Slap the E back on for ERAX?
> 
> > Meh. Let's suggest people burn host PA space for now.  If anyone hits
> > that limit they can solve it (crosses fingers it's not my lot :)
> >   
> 
> +1
> 
> ~Gregory



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-25 15:00             ` Jonathan Cameron
  2025-09-25 15:08               ` Gregory Price
@ 2025-10-16 16:16               ` Yiannis Nikolakopoulos
  2025-10-20 14:23                 ` Jonathan Cameron
  1 sibling, 1 reply; 53+ messages in thread
From: Yiannis Nikolakopoulos @ 2025-10-16 16:16 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Wei Xu, David Rientjes, Gregory Price, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, Sep 25, 2025 at 5:01 PM Jonathan Cameron
<jonathan.cameron@huawei.com> wrote:
>
> On Thu, 25 Sep 2025 16:03:46 +0200
> Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote:
>
> Hi Yiannis,
Hi Jonathan! Thanks for your response!

[snip]
> > There are several things that may be done on the device side. For now, I
> > think the kernel should be unaware of these. But with what I described
> > above, the goal is to have the capacity thresholds configured in a way
> > that we can absorb the occasional dirty cache lines that are written back.
>
> In worst case they are far from occasional. It's not hard to imagine a malicious
This is correct. Any simplification on my end is mainly based on the
empirical evidence of the use cases we are testing for (tiering). But
I fully respect that we need to be proactive and assume the worst case
scenario.
> program that ensures that all L3 in a system (say 256MiB+) is full of cache lines
> from the far compressed memory all of which are changed in a fashion that makes
> the allocation much less compressible.  If you are doing compression at cache line
> granularity that's not so bad because it would only be 256MiB margin needed.
> If the system in question is doing large block side compression, say 4KiB.
> Then we have a 64x write amplification multiplier. If the virus is streaming over
This is insightful indeed :). However, even in the case of the 64x
amplification, you implicitly assume that each of the cachelines in
the L3 belongs to a different page. But then one cache-line would not
deteriorate the compressed size of the entire page that much (the
bandwidth amplification on the device is a different -performance-
story). So even in the 4K case the two ends of the spectrum are to
either have big amplification with low compression ratio impact, or
small amplification with higher compression ratio impact.
Another practical assumption here, is that the different HMU
mechanisms would help promote the contended pages before this becomes
a big issue. Which of course might still not be enough on the
malicious streaming writes workload.
Overall, I understand these are heuristics and I do see your point
that this needs to be robust even for the maliciously behaving
programs.
> memory the evictions we are seeing at the result of new lines being fetched
> to be made much less compressible.
>
> Add a accelerator (say DPDK or other zero copy into userspace buffers) into the
> mix and you have a mess. You'll need to be extremely careful with what goes
Good point about the zero copy stuff.
> in this compressed memory or hold enormous buffer capacity against fast
> changes in compressability.
To my experience the factor of buffer capacity would be closer to the
benefit that you get from the compression (e.g. 2x the cache size in
your example).
But I understand the burden of proof is on our end. As we move further
with this I will try to provide data as well.
>
> Key is that all software is potentially malicious (sometimes accidentally so ;)
>
> Now, if we can put this into a special pool where it is acceptable to drop the writes
> and return poison (so the application crashes) then that may be fine.
>
> Or block writes.   Running compressed memory as read only CoW is one way to
> avoid this problem.
These could be good starting points, as I see in the rest of the thread.

Thanks,
Yiannis


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-16 16:16               ` Yiannis Nikolakopoulos
@ 2025-10-20 14:23                 ` Jonathan Cameron
  2025-10-20 15:05                   ` Gregory Price
  0 siblings, 1 reply; 53+ messages in thread
From: Jonathan Cameron @ 2025-10-20 14:23 UTC (permalink / raw)
  To: Yiannis Nikolakopoulos
  Cc: Wei Xu, David Rientjes, Gregory Price, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Thu, 16 Oct 2025 18:16:31 +0200
Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote:

> On Thu, Sep 25, 2025 at 5:01 PM Jonathan Cameron
> <jonathan.cameron@huawei.com> wrote:
> >
> > On Thu, 25 Sep 2025 16:03:46 +0200
> > Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote:
> >
> > Hi Yiannis,  
> Hi Jonathan! Thanks for your response!
> 
Hi Yiannis,

This is way more fun than doing real work ;)

> [snip]
> > > There are several things that may be done on the device side. For now, I
> > > think the kernel should be unaware of these. But with what I described
> > > above, the goal is to have the capacity thresholds configured in a way
> > > that we can absorb the occasional dirty cache lines that are written back.  
> >
> > In worst case they are far from occasional. It's not hard to imagine a malicious  
> This is correct. Any simplification on my end is mainly based on the
> empirical evidence of the use cases we are testing for (tiering). But
> I fully respect that we need to be proactive and assume the worst case
> scenario.
> > program that ensures that all L3 in a system (say 256MiB+) is full of cache lines
> > from the far compressed memory all of which are changed in a fashion that makes
> > the allocation much less compressible.  If you are doing compression at cache line
> > granularity that's not so bad because it would only be 256MiB margin needed.
> > If the system in question is doing large block side compression, say 4KiB.
> > Then we have a 64x write amplification multiplier. If the virus is streaming over  
> This is insightful indeed :). However, even in the case of the 64x
> amplification, you implicitly assume that each of the cachelines in
> the L3 belongs to a different page. But then one cache-line would not
> deteriorate the compressed size of the entire page that much (the
> bandwidth amplification on the device is a different -performance-
> story).

This is putting limits on what compression algorithm is used. We could do
that but then we'd have to never support anything different. Maybe if the
device itself provided the worse case amplification numbers that would do
Any device that gets this wrong is buggy - but it might be hard to detect
that if people don't publish their compression algs and the proofs of worst
case blow up of compression blocks.

I guess we could do the maths on what the device manufacturer says and
if we don't believe them or they haven't provided enough info to check,
double it :)

> So even in the 4K case the two ends of the spectrum are to
> either have big amplification with low compression ratio impact, or
> small amplification with higher compression ratio impact.
> Another practical assumption here, is that the different HMU
> mechanisms would help promote the contended pages before this becomes
> a big issue. Which of course might still not be enough on the
> malicious streaming writes workload.

Using promotion to get you out of this is a non starter unless you have
a backstop because we'll have annoying things like pinning going on or
bandwidth bottlenecks at the promotion target.
Promotion might massively reduce the performance impact of course under
normal conditions.

> Overall, I understand these are heuristics and I do see your point
> that this needs to be robust even for the maliciously behaving
> programs.
> > memory the evictions we are seeing at the result of new lines being fetched
> > to be made much less compressible.
> >
> > Add a accelerator (say DPDK or other zero copy into userspace buffers) into the
> > mix and you have a mess. You'll need to be extremely careful with what goes  
> Good point about the zero copy stuff.
> > in this compressed memory or hold enormous buffer capacity against fast
> > changes in compressability.  
> To my experience the factor of buffer capacity would be closer to the
> benefit that you get from the compression (e.g. 2x the cache size in
> your example).
> But I understand the burden of proof is on our end. As we move further
> with this I will try to provide data as well.

If we are aiming for generality the nasty problem is that either we have to
write rules on what Linux will cope with, or design it to cope with the
worse possible implementation :(

I can think of lots of plausible sounding cases that have horrendous
multiplication factors if done in a naive fashion. 
* De-duplication
* Metadata flag for all 0s
* Some general purpose compression algs are very vulnerable to the tails
  of the probability distributions.  Some will flip between multiple modes
  with very different characteristics, perhaps to meet latency guarantees.

Would be fun to ask an information theorist / compression expert to lay
out an algorithm with the worst possible tail performance but with good
average.



> >
> > Key is that all software is potentially malicious (sometimes accidentally so ;)
> >
> > Now, if we can put this into a special pool where it is acceptable to drop the writes
> > and return poison (so the application crashes) then that may be fine.
> >
> > Or block writes.   Running compressed memory as read only CoW is one way to
> > avoid this problem.  
> These could be good starting points, as I see in the rest of the thread.
> 
Fun problems.  Maybe we start with very conservative handling and then
argue for relaxations later.

Jonathan

> Thanks,
> Yiannis



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-10-20 14:23                 ` Jonathan Cameron
@ 2025-10-20 15:05                   ` Gregory Price
  0 siblings, 0 replies; 53+ messages in thread
From: Gregory Price @ 2025-10-20 15:05 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Yiannis Nikolakopoulos, Wei Xu, David Rientjes, Matthew Wilcox,
	Bharata B Rao, linux-kernel, linux-mm, dave.hansen, hannes,
	mgorman, mingo, peterz, raghavendra.kt, riel, sj, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, yiannis,
	Adam Manzanares

On Mon, Oct 20, 2025 at 03:23:45PM +0100, Jonathan Cameron wrote:
> On Thu, 16 Oct 2025 18:16:31 +0200
> Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote:
> 
> > These could be good starting points, as I see in the rest of the thread.
> > 
> Fun problems.  Maybe we start with very conservative handling and then
> argue for relaxations later.
> 

Not to pile on, but if we can't even manage the conservative handling
due to other design issues - then it doesn't bode well for the rest.
So getting that right should be the priority - not a maybe.

~Gregory


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
  2025-09-16 19:45     ` David Rientjes
  2025-09-16 22:02       ` Gregory Price
  2025-09-17  0:30       ` Wei Xu
@ 2025-10-08 17:59       ` Vinicius Petrucci
  2 siblings, 0 replies; 53+ messages in thread
From: Vinicius Petrucci @ 2025-10-08 17:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Gregory Price, Matthew Wilcox, Bharata B Rao, linux-kernel,
	linux-mm, Jonathan.Cameron, dave.hansen, hannes, mgorman, mingo,
	peterz, raghavendra.kt, riel, sj, weixugc, ying.huang, ziy, dave,
	nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul,
	kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore

Hi David,

On Tue, Sep 16, 2025 at 1:28 PM David Rientjes <rientjes@google.com> wrote:
>
> I've been pretty focused on the promotion story here rather than demotion
> because of how responsive it needs to be.  Harvesting the page table
> accessed bits or waiting on a sliding window through NUMA Balancing (even
> NUMAB=2) is not as responsive as needed for very fast promotion to top
> tier memory, hence things like the CHMU (or PEBS or IBS etc).

First, thanks for sharing your thoughts on the promotion
responsiveness challenges, definitely a critical aspect for tiering
strategies.

We recently put together a preliminary report using our experimental
HW that I believe could be relevant to the ongoing discussions:
A Limits Study of Memory-side Tiering Telemetry:
https://arxiv.org/abs/2508.09351

It's essentially an initial step toward quantifying the benefits of
HMU on the memory side, aiming to compare promotion quality (e.g.,
hotness coverage and accuracy) across HMU, PEBS-based promotion, and
NUMA balancing (promotion path).
Hopefully, this kind of work can help us better understand some of the
trade-offs being discussed, support more data-driven comparisons, and
spark more fruitful discussions...

Best,
Vinicius


^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2025-10-23 15:29 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36   ` Jonathan Cameron
2025-10-03 11:02     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17   ` Jonathan Cameron
2025-10-06  4:13     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19   ` Jonathan Cameron
2025-10-06  4:28     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38   ` Jonathan Cameron
2025-10-06  5:57     ` Bharata B Rao
2025-10-06  9:53       ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01   ` Gregory Price
2025-09-16 19:45     ` David Rientjes
2025-09-16 22:02       ` Gregory Price
2025-09-17  0:30       ` Wei Xu
2025-09-17  3:20         ` Balbir Singh
2025-09-17  4:15           ` Bharata B Rao
2025-09-17 16:49         ` Jonathan Cameron
2025-09-25 14:03           ` Yiannis Nikolakopoulos
2025-09-25 14:41             ` Gregory Price
2025-10-16 11:48               ` Yiannis Nikolakopoulos
2025-09-25 15:00             ` Jonathan Cameron
2025-09-25 15:08               ` Gregory Price
2025-09-25 15:18                 ` Gregory Price
2025-09-25 15:24                 ` Jonathan Cameron
2025-09-25 16:06                   ` Gregory Price
2025-09-25 17:23                     ` Jonathan Cameron
2025-09-25 19:02                       ` Gregory Price
2025-10-01  7:22                         ` Gregory Price
2025-10-17  9:53                           ` Yiannis Nikolakopoulos
2025-10-17 14:15                             ` Gregory Price
2025-10-17 14:36                               ` Jonathan Cameron
2025-10-17 14:59                                 ` Gregory Price
2025-10-20 14:05                                   ` Jonathan Cameron
2025-10-21 18:52                                     ` Gregory Price
2025-10-21 18:57                                       ` Gregory Price
2025-10-22  9:09                                         ` Jonathan Cameron
2025-10-22 15:05                                           ` Gregory Price
2025-10-23 15:29                                             ` Jonathan Cameron
2025-10-16 16:16               ` Yiannis Nikolakopoulos
2025-10-20 14:23                 ` Jonathan Cameron
2025-10-20 15:05                   ` Gregory Price
2025-10-08 17:59       ` Vinicius Petrucci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox