[RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
@ 2025-03-06  5:45 Bharata B Rao
  2025-03-06  5:45 ` [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-06  5:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo, Bharata B Rao

Hi,

This is an attempt towards having a single subsystem that accumulates
hot page information from lower memory tiers and does hot page
promotion.

At the heart of this subsystem is a kernel daemon named kpromoted that
does the following:

1. Exposes an API that other subsystems which detect/generate memory
   access information can use to inform the daemon about memory
   accesses from lower memory tiers.
2. Maintains the list of hot pages and attempts to promote them to
   toptiers.

Currently I have added AMD IBS driver as one source that provides
page access information as an example. This driver feeds info to
kpromoted in this RFC patchset. More sources were discussed in a
similar context here at [1].

This is just an early attempt to check what it takes to maintain
a single source of page hotness info and also separate hot page
detection mechanisms from the promotion mechanism. There are too
many open ends right now and I have listed a few of them below.

- The API that is provided to register memory access expects
  the PFN, NID and time of access at the minimum. This is
  described more in patch 2/4. This API currently can be called
  only from contexts that allow sleeping and hence this rules
  out using it from PTE scanning paths. The API needs to be
  more flexible with respect to this.
- Some sources like PTE A bit scanning can't provide the precise
  time of access or the NID that is accessing the page. The latter
  has been an open problem to which I haven't come across a good
  and acceptable solution.
- The way the hot page information is maintained is pretty
  primitive right now. Ideally we would like to store hotness info
  in such a way that it should be easily possible to lookup say N
  most hot pages.
- If PTE A bit scanners are considered as hotness sources, we will
  be bombarded with accesses. Do we want to accomodate all those
  accesses or just go with hotness info for fixed number of pages
  (possibly as a ratio of lower tier memory capacity)?
- Undoubtedly the mechanism to classify a page as hot and subsequent
  promotion needs to be more sophisticated than what I have right now.

This is just an early RFC posted now to ignite some discussion
in the context of LSFMM [2].

I am also working with Raghu to integrate his kmmdscan [3] as the
hotness source and use kpromoted for migration.

Also, I had posted the IBS driver ealier as an alternative to
hint faults based NUMA Balancing [4]. However here I am using
it as generic page hotness source.

[1] https://lore.kernel.org/linux-mm/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
[2] https://lore.kernel.org/linux-mm/20250123105721.424117-1-raghavendra.kt@amd.com/
[3] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
[3] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/

Regards,
Bharata.

Bharata B Rao (4):
  mm: migrate: Allow misplaced migration without VMA too
  mm: kpromoted: Hot page info collection and promotion daemon
  x86: ibs: In-kernel IBS driver for memory access profiling
  x86: ibs: Enable IBS profiling for memory accesses

 arch/x86/events/amd/ibs.c           |  11 +
 arch/x86/include/asm/entry-common.h |   3 +
 arch/x86/include/asm/hardirq.h      |   2 +
 arch/x86/include/asm/ibs.h          |   9 +
 arch/x86/include/asm/msr-index.h    |  16 ++
 arch/x86/mm/Makefile                |   3 +-
 arch/x86/mm/ibs.c                   | 344 ++++++++++++++++++++++++++++
 include/linux/kpromoted.h           |  54 +++++
 include/linux/mmzone.h              |   4 +
 include/linux/vm_event_item.h       |  30 +++
 mm/Kconfig                          |   7 +
 mm/Makefile                         |   1 +
 mm/kpromoted.c                      | 305 ++++++++++++++++++++++++
 mm/migrate.c                        |   5 +-
 mm/mm_init.c                        |  10 +
 mm/vmstat.c                         |  30 +++
 16 files changed, 831 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c
 create mode 100644 include/linux/kpromoted.h
 create mode 100644 mm/kpromoted.c

-- 
2.34.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
@ 2025-03-06  5:45 ` Bharata B Rao
  2025-03-06 12:13   ` David Hildenbrand
                     ` (2 more replies)
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-06  5:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo, Bharata B Rao

migrate_misplaced_folio_prepare() can be called from a
context where VMA isn't available. Allow the migration
to work from such contexts too.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index fb19a18892c8..5b21856a0dd0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2639,7 +2639,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 
 /*
  * Prepare for calling migrate_misplaced_folio() by isolating the folio if
- * permitted. Must be called with the PTL still held.
+ * permitted. Must be called with the PTL still held if called with a non-NULL
+ * vma.
  */
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -2656,7 +2657,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_likely_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) &&
+		if (vma && (vma->vm_flags & VM_EXEC) &&
 		    folio_likely_mapped_shared(folio))
 			return -EACCES;
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
  2025-03-06  5:45 ` [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
@ 2025-03-06  5:45 ` Bharata B Rao
  2025-03-06 17:22   ` Mike Day
                     ` (5 more replies)
  2025-03-06  5:45 ` [RFC PATCH 3/4] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
                   ` (4 subsequent siblings)
  6 siblings, 6 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-06  5:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo, Bharata B Rao

kpromoted is a kernel daemon that accumulates hot page info
from different sources and tries to promote pages from slow
tiers to top tiers. One instance of this thread runs on each
node that has CPUs.

Subsystems that generate hot page access info can report that
to kpromoted via this API:

int kpromoted_record_access(u64 pfn, int nid, int src,
			    unsigned long time)

@pfn: The PFN of the memory accessed
@nid: The accessing NUMA node ID
@src: The temperature source (subsystem) that generated the
      access info
@time: The access time in jiffies

Some temperature sources may not provide the nid from which
the page was accessed. This is true for sources that use
page table scanning for PTE Accessed bit. Currently the toptier
node to which such pages should be promoted to is hard coded.

Also, the access time provided some sources may at best be
considered approximate. This is especially true for hot pages
detected by PTE A bit scanning.

kpromoted currently maintains the hot PFN records in hash lists
hashed by PFN value. Each record stores the following info:

struct page_hotness_info {
	unsigned long pfn;

	/* Time when this record was updated last */
	unsigned long last_update;

	/*
	 * Number of times this page was accessed in the
	 * current window
	 */
	int frequency;

	/* Most recent access time */
	unsigned long recency;

	/* Most recent access from this node */
	int hot_node;

	struct hlist_node hnode;
};

The way in which a page is categorized as hot enough to be
promoted is pretty primitive now.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 include/linux/kpromoted.h     |  54 ++++++
 include/linux/mmzone.h        |   4 +
 include/linux/vm_event_item.h |  13 ++
 mm/Kconfig                    |   7 +
 mm/Makefile                   |   1 +
 mm/kpromoted.c                | 305 ++++++++++++++++++++++++++++++++++
 mm/mm_init.c                  |  10 ++
 mm/vmstat.c                   |  13 ++
 8 files changed, 407 insertions(+)
 create mode 100644 include/linux/kpromoted.h
 create mode 100644 mm/kpromoted.c

diff --git a/include/linux/kpromoted.h b/include/linux/kpromoted.h
new file mode 100644
index 000000000000..2bef3d74f03a
--- /dev/null
+++ b/include/linux/kpromoted.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KPROMOTED_H
+#define _LINUX_KPROMOTED_H
+
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/workqueue_types.h>
+
+/* Page hotness temperature sources */
+enum kpromoted_src {
+	KPROMOTED_HW_HINTS,
+	KPROMOTED_PGTABLE_SCAN,
+};
+
+#ifdef CONFIG_KPROMOTED
+
+#define KPROMOTED_FREQ_WINDOW	(5 * MSEC_PER_SEC)
+
+/* 2 accesses within a window will make the page a promotion candidate */
+#define KPRMOTED_FREQ_THRESHOLD	2
+
+#define KPROMOTED_HASH_ORDER	16
+
+struct page_hotness_info {
+	unsigned long pfn;
+
+	/* Time when this record was updated last */
+	unsigned long last_update;
+
+	/*
+	 * Number of times this page was accessed in the
+	 * current window
+	 */
+	int frequency;
+
+	/* Most recent access time */
+	unsigned long recency;
+
+	/* Most recent access from this node */
+	int hot_node;
+	struct hlist_node hnode;
+};
+
+#define KPROMOTE_DELAY	MSEC_PER_SEC
+
+int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now);
+#else
+static inline int kpromoted_record_access(u64 pfn, int nid, int src,
+					  unsigned long now)
+{
+	return 0;
+}
+#endif /* CONFIG_KPROMOTED */
+#endif /* _LINUX_KPROMOTED_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9540b41894da..a5c4e789aa55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1459,6 +1459,10 @@ typedef struct pglist_data {
 #ifdef CONFIG_MEMORY_FAILURE
 	struct memory_failure_stats mf_stats;
 #endif
+#ifdef CONFIG_KPROMOTED
+	struct task_struct *kpromoted;
+	wait_queue_head_t kpromoted_wait;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..b5823b037883 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -182,6 +182,19 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSTACK_REST,
 #endif
 #endif /* CONFIG_DEBUG_STACK_USAGE */
+		KPROMOTED_RECORDED_ACCESSES,
+		KPROMOTED_RECORD_HWHINTS,
+		KPROMOTED_RECORD_PGTSCANS,
+		KPROMOTED_RECORD_TOPTIER,
+		KPROMOTED_RECORD_ADDED,
+		KPROMOTED_RECORD_EXISTS,
+		KPROMOTED_MIG_RIGHT_NODE,
+		KPROMOTED_MIG_NON_LRU,
+		KPROMOTED_MIG_COLD_OLD,
+		KPROMOTED_MIG_COLD_NOT_ACCESSED,
+		KPROMOTED_MIG_CANDIDATE,
+		KPROMOTED_MIG_PROMOTED,
+		KPROMOTED_MIG_DROPPED,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..ceaa462a0ce6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1358,6 +1358,13 @@ config PT_RECLAIM
 
 	  Note: now only empty user PTE page table pages will be reclaimed.
 
+config KPROMOTED
+	bool "Kernel hot page promotion daemon"
+	def_bool y
+	depends on NUMA && MIGRATION && MMU
+	help
+	  Promote hot pages from lower tier to top tier by using the
+	  memory access information provided by various sources.
 
 source "mm/damon/Kconfig"
 
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..bf4f5f18f1f9 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
 obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
+obj-$(CONFIG_KPROMOTED) += kpromoted.o
diff --git a/mm/kpromoted.c b/mm/kpromoted.c
new file mode 100644
index 000000000000..2a8b8495b6b3
--- /dev/null
+++ b/mm/kpromoted.c
@@ -0,0 +1,305 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * kpromoted is a kernel thread that runs on each node that has CPU i,e.,
+ * on regular nodes.
+ *
+ * Maintains list of hot pages from lower tiers and promotes them.
+ */
+#include <linux/kpromoted.h>
+#include <linux/kthread.h>
+#include <linux/mutex.h>
+#include <linux/mmzone.h>
+#include <linux/migrate.h>
+#include <linux/memory-tiers.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/cpuhotplug.h>
+#include <linux/hashtable.h>
+
+static DEFINE_HASHTABLE(page_hotness_hash, KPROMOTED_HASH_ORDER);
+static struct mutex page_hotness_lock[1UL << KPROMOTED_HASH_ORDER];
+
+static int kpromote_page(struct page_hotness_info *phi)
+{
+	struct page *page = pfn_to_page(phi->pfn);
+	struct folio *folio;
+	int ret;
+
+	if (!page)
+		return 1;
+
+	folio = page_folio(page);
+	ret = migrate_misplaced_folio_prepare(folio, NULL, phi->hot_node);
+	if (ret)
+		return 1;
+
+	return migrate_misplaced_folio(folio, phi->hot_node);
+}
+
+static int page_should_be_promoted(struct page_hotness_info *phi)
+{
+	struct page *page = pfn_to_online_page(phi->pfn);
+	unsigned long now = jiffies;
+	struct folio *folio;
+
+	if (!page || is_zone_device_page(page))
+		return false;
+
+	folio = page_folio(page);
+	if (!folio_test_lru(folio)) {
+		count_vm_event(KPROMOTED_MIG_NON_LRU);
+		return false;
+	}
+	if (folio_nid(folio) == phi->hot_node) {
+		count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
+		return false;
+	}
+
+	/* If the page was hot a while ago, don't promote */
+	if ((now - phi->last_update) > 2 * msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
+		count_vm_event(KPROMOTED_MIG_COLD_OLD);
+		return false;
+	}
+
+	/* If the page hasn't been accessed enough number of times, don't promote */
+	if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
+		count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
+		return false;
+	}
+	return true;
+}
+
+/*
+ * Go thro' page hotness information and migrate pages if required.
+ *
+ * Promoted pages are not longer tracked in the hot list.
+ * Cold pages are pruned from the list as well.
+ *
+ * TODO: Batching could be done
+ */
+static void kpromoted_migrate(pg_data_t *pgdat)
+{
+	int nid = pgdat->node_id;
+	struct page_hotness_info *phi;
+	struct hlist_node *tmp;
+	int nr_bkts = HASH_SIZE(page_hotness_hash);
+	int bkt;
+
+	for (bkt = 0; bkt < nr_bkts; bkt++) {
+		mutex_lock(&page_hotness_lock[bkt]);
+		hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], hnode) {
+			if (phi->hot_node != nid)
+				continue;
+
+			if (page_should_be_promoted(phi)) {
+				count_vm_event(KPROMOTED_MIG_CANDIDATE);
+				if (!kpromote_page(phi)) {
+					count_vm_event(KPROMOTED_MIG_PROMOTED);
+					hlist_del_init(&phi->hnode);
+					kfree(phi);
+				}
+			} else {
+				/*
+				 * Not a suitable page or cold page, stop tracking it.
+				 * TODO: Identify cold pages and drive demotion?
+				 */
+				count_vm_event(KPROMOTED_MIG_DROPPED);
+				hlist_del_init(&phi->hnode);
+				kfree(phi);
+			}
+		}
+		mutex_unlock(&page_hotness_lock[bkt]);
+	}
+}
+
+static struct page_hotness_info *__kpromoted_lookup(unsigned long pfn, int bkt)
+{
+	struct page_hotness_info *phi;
+
+	hlist_for_each_entry(phi, &page_hotness_hash[bkt], hnode) {
+		if (phi->pfn == pfn)
+			return phi;
+	}
+	return NULL;
+}
+
+static struct page_hotness_info *kpromoted_lookup(unsigned long pfn, int bkt, unsigned long now)
+{
+	struct page_hotness_info *phi;
+
+	phi = __kpromoted_lookup(pfn, bkt);
+	if (!phi) {
+		phi = kzalloc(sizeof(struct page_hotness_info), GFP_KERNEL);
+		if (!phi)
+			return ERR_PTR(-ENOMEM);
+
+		phi->pfn = pfn;
+		phi->frequency = 1;
+		phi->last_update = now;
+		phi->recency = now;
+		hlist_add_head(&phi->hnode, &page_hotness_hash[bkt]);
+		count_vm_event(KPROMOTED_RECORD_ADDED);
+	} else {
+		count_vm_event(KPROMOTED_RECORD_EXISTS);
+	}
+	return phi;
+}
+
+/*
+ * Called by subsystems that generate page hotness/access information.
+ *
+ * Records the memory access info for futher action by kpromoted.
+ */
+int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now)
+{
+	struct page_hotness_info *phi;
+	struct page *page;
+	struct folio *folio;
+	int ret, bkt;
+
+	count_vm_event(KPROMOTED_RECORDED_ACCESSES);
+
+	switch (src) {
+	case KPROMOTED_HW_HINTS:
+		count_vm_event(KPROMOTED_RECORD_HWHINTS);
+		break;
+	case KPROMOTED_PGTABLE_SCAN:
+		count_vm_event(KPROMOTED_RECORD_PGTSCANS);
+		break;
+	default:
+		break;
+	}
+
+	/*
+	 * Record only accesses from lower tiers.
+	 * Assuming node having CPUs as toptier for now.
+	 */
+	if (node_is_toptier(pfn_to_nid(pfn))) {
+		count_vm_event(KPROMOTED_RECORD_TOPTIER);
+		return 0;
+	}
+
+	page = pfn_to_online_page(pfn);
+	if (!page || is_zone_device_page(page))
+		return 0;
+
+	folio = page_folio(page);
+	if (!folio_test_lru(folio))
+		return 0;
+
+	bkt = hash_min(pfn, KPROMOTED_HASH_ORDER);
+	mutex_lock(&page_hotness_lock[bkt]);
+	phi = kpromoted_lookup(pfn, bkt, now);
+	if (!phi) {
+		ret = PTR_ERR(phi);
+		goto out;
+	}
+
+	if ((phi->last_update - now) > msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
+		/* New window */
+		phi->frequency = 1; /* TODO: Factor in the history */
+		phi->last_update = now;
+	} else {
+		phi->frequency++;
+	}
+	phi->recency = now;
+
+	/*
+	 * TODOs:
+	 * 1. Source nid is hard-coded for some temperature sources
+	 * 2. Take action if hot_node changes - may be a shared page?
+	 * 3. Maintain node info for every access within the window?
+	 */
+	phi->hot_node = (nid == NUMA_NO_NODE) ? 1 : nid;
+	mutex_unlock(&page_hotness_lock[bkt]);
+out:
+	return 0;
+}
+
+/*
+ * Go through the accumulated mem_access_info and migrate
+ * pages if required.
+ */
+static void kpromoted_do_work(pg_data_t *pgdat)
+{
+	kpromoted_migrate(pgdat);
+}
+
+static inline bool kpromoted_work_requested(pg_data_t *pgdat)
+{
+	return false;
+}
+
+static int kpromoted(void *p)
+{
+	pg_data_t *pgdat = (pg_data_t *)p;
+	struct task_struct *tsk = current;
+	long timeout = msecs_to_jiffies(KPROMOTE_DELAY);
+
+	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+
+	if (!cpumask_empty(cpumask))
+		set_cpus_allowed_ptr(tsk, cpumask);
+
+	while (!kthread_should_stop()) {
+		wait_event_timeout(pgdat->kpromoted_wait,
+				   kpromoted_work_requested(pgdat), timeout);
+		kpromoted_do_work(pgdat);
+	}
+	return 0;
+}
+
+static void kpromoted_run(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	if (pgdat->kpromoted)
+		return;
+
+	pgdat->kpromoted = kthread_run(kpromoted, pgdat, "kpromoted%d", nid);
+	if (IS_ERR(pgdat->kpromoted)) {
+		pr_err("Failed to start kpromoted on node %d\n", nid);
+		pgdat->kpromoted = NULL;
+	}
+}
+
+static int kpromoted_cpu_online(unsigned int cpu)
+{
+	int nid;
+
+	for_each_node_state(nid, N_CPU) {
+		pg_data_t *pgdat = NODE_DATA(nid);
+		const struct cpumask *mask;
+
+		mask = cpumask_of_node(pgdat->node_id);
+
+		if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
+			/* One of our CPUs online: restore mask */
+			if (pgdat->kpromoted)
+				set_cpus_allowed_ptr(pgdat->kpromoted, mask);
+	}
+	return 0;
+}
+
+static int __init kpromoted_init(void)
+{
+	int nid, ret, i;
+
+	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+					"mm/promotion:online",
+					kpromoted_cpu_online, NULL);
+	if (ret < 0) {
+		pr_err("kpromoted: failed to register hotplug callbacks.\n");
+		return ret;
+	}
+
+	for (i = 0; i < (1UL << KPROMOTED_HASH_ORDER); i++)
+		mutex_init(&page_hotness_lock[i]);
+
+	for_each_node_state(nid, N_CPU)
+		kpromoted_run(nid);
+
+	return 0;
+}
+
+subsys_initcall(kpromoted_init)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2630cc30147e..d212df24f89b 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1362,6 +1362,15 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
 static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
 #endif
 
+#ifdef CONFIG_KPROMOTED
+static void pgdat_init_kpromoted(struct pglist_data *pgdat)
+{
+	init_waitqueue_head(&pgdat->kpromoted_wait);
+}
+#else
+static void pgdat_init_kpromoted(struct pglist_data *pgdat) {}
+#endif
+
 static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 {
 	int i;
@@ -1371,6 +1380,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 
 	pgdat_init_split_queue(pgdat);
 	pgdat_init_kcompactd(pgdat);
+	pgdat_init_kpromoted(pgdat);
 
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..618f44bae5c8 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1466,6 +1466,19 @@ const char * const vmstat_text[] = {
 	"kstack_rest",
 #endif
 #endif
+	"kpromoted_recorded_accesses",
+	"kpromoted_recorded_hwhints",
+	"kpromoted_recorded_pgtscans",
+	"kpromoted_record_toptier",
+	"kpromoted_record_added",
+	"kpromoted_record_exists",
+	"kpromoted_mig_right_node",
+	"kpromoted_mig_non_lru",
+	"kpromoted_mig_cold_old",
+	"kpromoted_mig_cold_not_accessed",
+	"kpromoted_mig_candidate",
+	"kpromoted_mig_promoted",
+	"kpromoted_mig_dropped",
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 3/4] x86: ibs: In-kernel IBS driver for memory access profiling
  2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
  2025-03-06  5:45 ` [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
@ 2025-03-06  5:45 ` Bharata B Rao
  2025-03-14 15:38   ` Jonathan Cameron
  2025-03-06  5:45 ` [RFC PATCH 4/4] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Bharata B Rao @ 2025-03-06  5:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo, Bharata B Rao

Use IBS (Instruction Based Sampling) feature present
in AMD processors for memory access tracking. The access
information obtained from IBS via NMI is fed to kpromoted
daemon for futher action.

In addition to many other information related to the memory
access, IBS provides physical (and virtual) address of the access
and indicates if the access came from slower tier. Only memory
accesses originating from slower tiers are further acted upon
by this driver.

The samples are initially accumulated in percpu buffers which
are flushed to kpromoted using irq_work.

About IBS
---------
IBS can be programmed to provide data about instruction
execution periodically. This is done by programming a desired
sample count (number of ops) in a control register. When the
programmed number of ops are dispatched, a micro-op gets tagged,
various information about the tagged micro-op's execution is
populated in IBS execution MSRs and an interrupt is raised.
While IBS provides a lot of data for each sample, for the
purpose of  memory access profiling, we are interested in
linear and physical address of the memory access that reached
DRAM. Recent AMD processors provide further filtering where
it is possible to limit the sampling to those ops that had
an L3 miss which greately reduces the non-useful samples.

While IBS provides capability to sample instruction fetch
and execution, only IBS execution sampling is used here
to collect data about memory accesses that occur during
the instruction execution.

More information about IBS is available in Sec 13.3 of
AMD64 Architecture Programmer's Manual, Volume 2:System
Programming which is present at:
https://bugzilla.kernel.org/attachment.cgi?id=288923

Information about MSRs used for programming IBS can be
found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h
Model 11h B1 which is currently present at:
https://www.amd.com/system/files/TechDocs/55901_0.25.zip

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/events/amd/ibs.c        |  11 ++
 arch/x86/include/asm/ibs.h       |   7 +
 arch/x86/include/asm/msr-index.h |  16 ++
 arch/x86/mm/Makefile             |   3 +-
 arch/x86/mm/ibs.c                | 312 +++++++++++++++++++++++++++++++
 include/linux/vm_event_item.h    |  17 ++
 mm/vmstat.c                      |  17 ++
 7 files changed, 382 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/ibs.h
 create mode 100644 arch/x86/mm/ibs.c

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index e7a8b8758e08..35497e8c0846 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -13,8 +13,10 @@
 #include <linux/ptrace.h>
 #include <linux/syscore_ops.h>
 #include <linux/sched/clock.h>
+#include <linux/kpromoted.h>
 
 #include <asm/apic.h>
+#include <asm/ibs.h>
 
 #include "../perf_event.h"
 
@@ -1539,6 +1541,15 @@ static __init int amd_ibs_init(void)
 {
 	u32 caps;
 
+	/*
+	 * TODO: Find a clean way to disable perf IBS so that IBS
+	 * can be used for memory access profiling.
+	 */
+	if (arch_hw_access_profiling) {
+		pr_info("IBS isn't available for perf use\n");
+		return 0;
+	}
+
 	caps = __get_ibs_caps();
 	if (!caps)
 		return -ENODEV;	/* ibs not supported by the cpu */
diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h
new file mode 100644
index 000000000000..b5a4f2ca6330
--- /dev/null
+++ b/arch/x86/include/asm/ibs.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_IBS_H
+#define _ASM_X86_IBS_H
+
+extern bool arch_hw_access_profiling;
+
+#endif /* _ASM_X86_IBS_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 72765b2fe0d8..12291e362b01 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -719,6 +719,22 @@
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
 
+/* AMD IBS MSR bits */
+#define MSR_AMD64_IBSOPDATA2_DATASRC			0x7
+#define MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE		0x1
+#define MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR	0x2
+#define MSR_AMD64_IBSOPDATA2_DATASRC_DRAM		0x3
+#define MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE	0x5
+#define MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM		0x8
+#define	MSR_AMD64_IBSOPDATA2_RMTNODE			0x10
+
+#define MSR_AMD64_IBSOPDATA3_LDOP		BIT_ULL(0)
+#define MSR_AMD64_IBSOPDATA3_STOP		BIT_ULL(1)
+#define MSR_AMD64_IBSOPDATA3_DCMISS		BIT_ULL(7)
+#define MSR_AMD64_IBSOPDATA3_LADDR_VALID	BIT_ULL(17)
+#define MSR_AMD64_IBSOPDATA3_PADDR_VALID	BIT_ULL(18)
+#define MSR_AMD64_IBSOPDATA3_L2MISS		BIT_ULL(20)
+
 /* Zen4 */
 #define MSR_ZEN4_BP_CFG                 0xc001102e
 #define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 690fbf48e853..3b1a5dbbac64 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -26,7 +26,8 @@ CFLAGS_REMOVE_pgprot.o			= -pg
 endif
 
 obj-y				:=  init.o init_$(BITS).o fault.o ioremap.o extable.o mmap.o \
-				    pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o
+				    pgtable.o physaddr.o tlb.o cpu_entry_area.o maccess.o pgprot.o \
+				    ibs.o
 
 obj-y				+= pat/
 
diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
new file mode 100644
index 000000000000..5c966050ad86
--- /dev/null
+++ b/arch/x86/mm/ibs.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/init.h>
+#include <linux/kpromoted.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+#include <linux/irq_work.h>
+
+#include <asm/nmi.h>
+#include <asm/perf_event.h> /* TODO: Move defns like IBS_OP_ENABLE into non-perf header */
+#include <asm/apic.h>
+#include <asm/ibs.h>
+
+bool arch_hw_access_profiling;
+static u64 ibs_config __read_mostly;
+static u32 ibs_caps;
+
+#define IBS_NR_SAMPLES	50
+
+/*
+ * Basic access info captured for each memory access.
+ */
+struct ibs_sample {
+	unsigned long pfn;
+	unsigned long time;	/* jiffies when accessed */
+	int nid;		/* Accessing node ID, if known */
+};
+
+/*
+ * Percpu buffer of access samples. Samples are accumulated here
+ * before pushing them to kpromoted for further action.
+ */
+struct ibs_sample_pcpu {
+	struct ibs_sample samples[IBS_NR_SAMPLES];
+	int head, tail;
+};
+
+struct ibs_sample_pcpu __percpu *ibs_s;
+
+/*
+ * The workqueue for pushing the percpu access samples to kpromoted.
+ */
+static struct work_struct ibs_work;
+static struct irq_work ibs_irq_work;
+
+/*
+ * Record the IBS-reported access sample in percpu buffer.
+ * Called from IBS NMI handler.
+ */
+static int ibs_push_sample(unsigned long pfn, int nid, unsigned long time)
+{
+	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
+	int next = ibs_pcpu->head + 1;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	if (next == ibs_pcpu->tail)
+		return 0;
+
+	ibs_pcpu->samples[ibs_pcpu->head].pfn = pfn;
+	ibs_pcpu->samples[ibs_pcpu->head].time = time;
+	ibs_pcpu->head = next;
+	return 1;
+}
+
+static int ibs_pop_sample(struct ibs_sample *s)
+{
+	struct ibs_sample_pcpu *ibs_pcpu = raw_cpu_ptr(ibs_s);
+
+	int next = ibs_pcpu->tail + 1;
+
+	if (ibs_pcpu->head == ibs_pcpu->tail)
+		return 0;
+
+	if (next >= IBS_NR_SAMPLES)
+		next = 0;
+
+	*s = ibs_pcpu->samples[ibs_pcpu->tail];
+	ibs_pcpu->tail = next;
+	return 1;
+}
+
+/*
+ * Remove access samples from percpu buffer and send them
+ * to kpromoted for further action.
+ */
+static void ibs_work_handler(struct work_struct *work)
+{
+	struct ibs_sample s;
+
+	while (ibs_pop_sample(&s))
+		kpromoted_record_access(s.pfn, s.nid, KPROMOTED_HW_HINTS,
+					s.time);
+}
+
+static void ibs_irq_handler(struct irq_work *i)
+{
+	schedule_work_on(smp_processor_id(), &ibs_work);
+}
+
+/*
+ * IBS NMI handler: Process the memory access info reported by IBS.
+ *
+ * Reads the MSRs to collect all the information about the reported
+ * memory access, validates the access, stores the valid sample and
+ * schedules the work on this CPU to further process the sample.
+ */
+static int ibs_overflow_handler(unsigned int cmd, struct pt_regs *regs)
+{
+	struct mm_struct *mm = current->mm;
+	u64 ops_ctl, ops_data3, ops_data2;
+	u64 laddr = -1, paddr = -1;
+	u64 data_src, rmt_node;
+	struct page *page;
+	unsigned long pfn;
+
+	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
+
+	/*
+	 * When IBS sampling period is reprogrammed via read-modify-update
+	 * of MSR_AMD64_IBSOPCTL, overflow NMIs could be generated with
+	 * IBS_OP_ENABLE not set. For such cases, return as HANDLED.
+	 *
+	 * With this, the handler will say "handled" for all NMIs that
+	 * aren't related to this NMI.  This stems from the limitation of
+	 * having both status and control bits in one MSR.
+	 */
+	if (!(ops_ctl & IBS_OP_VAL))
+		goto handled;
+
+	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_VAL);
+
+	count_vm_event(HWHINT_NR_EVENTS);
+
+	if (!user_mode(regs)) {
+		count_vm_event(HWHINT_KERNEL);
+		goto handled;
+	}
+
+	if (!mm) {
+		count_vm_event(HWHINT_KTHREAD);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA3, ops_data3);
+
+	/* Load/Store ops only */
+	/* TODO: DataSrc isn't valid for stores, so filter out stores? */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_LDOP |
+			   MSR_AMD64_IBSOPDATA3_STOP))) {
+		count_vm_event(HWHINT_NON_LOAD_STORES);
+		goto handled;
+	}
+
+	/* Discard the sample if it was L1 or L2 hit */
+	if (!(ops_data3 & (MSR_AMD64_IBSOPDATA3_DCMISS |
+			   MSR_AMD64_IBSOPDATA3_L2MISS))) {
+		count_vm_event(HWHINT_DC_L2_HITS);
+		goto handled;
+	}
+
+	rdmsrl(MSR_AMD64_IBSOPDATA2, ops_data2);
+	data_src = ops_data2 & MSR_AMD64_IBSOPDATA2_DATASRC;
+	if (ibs_caps & IBS_CAPS_ZEN4)
+		data_src |= ((ops_data2 & 0xC0) >> 3);
+
+	switch (data_src) {
+	case MSR_AMD64_IBSOPDATA2_DATASRC_LCL_CACHE:
+		count_vm_event(HWHINT_LOCAL_L3L1L2);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_PEER_CACHE_NEAR:
+		count_vm_event(HWHINT_LOCAL_PEER_CACHE_NEAR);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_DRAM:
+		count_vm_event(HWHINT_DRAM_ACCESSES);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_EXT_MEM:
+		count_vm_event(HWHINT_CXL_ACCESSES);
+		break;
+	case MSR_AMD64_IBSOPDATA2_DATASRC_FAR_CCX_CACHE:
+		count_vm_event(HWHINT_FAR_CACHE_HITS);
+		break;
+	}
+
+	rmt_node = ops_data2 & MSR_AMD64_IBSOPDATA2_RMTNODE;
+	if (rmt_node)
+		count_vm_event(HWHINT_REMOTE_NODE);
+
+	/* Is linear addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_LADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCLINAD, laddr);
+	else {
+		count_vm_event(HWHINT_LADDR_INVALID);
+		goto handled;
+	}
+
+	/* Discard kernel address accesses */
+	if (laddr & (1UL << 63)) {
+		count_vm_event(HWHINT_KERNEL_ADDR);
+		goto handled;
+	}
+
+	/* Is phys addr valid? */
+	if (ops_data3 & MSR_AMD64_IBSOPDATA3_PADDR_VALID)
+		rdmsrl(MSR_AMD64_IBSDCPHYSAD, paddr);
+	else {
+		count_vm_event(HWHINT_PADDR_INVALID);
+		goto handled;
+	}
+
+	pfn = PHYS_PFN(paddr);
+	page = pfn_to_online_page(pfn);
+	if (!page)
+		goto handled;
+
+	if (!PageLRU(page)) {
+		count_vm_event(HWHINT_NON_LRU);
+		goto handled;
+	}
+
+	if (!ibs_push_sample(pfn, numa_node_id(), jiffies)) {
+		count_vm_event(HWHINT_BUFFER_FULL);
+		goto handled;
+	}
+
+	irq_work_queue(&ibs_irq_work);
+	count_vm_event(HWHINT_USEFUL_SAMPLES);
+
+handled:
+	return NMI_HANDLED;
+}
+
+static inline int get_ibs_lvt_offset(void)
+{
+	u64 val;
+
+	rdmsrl(MSR_AMD64_IBSCTL, val);
+	if (!(val & IBSCTL_LVT_OFFSET_VALID))
+		return -EINVAL;
+
+	return val & IBSCTL_LVT_OFFSET_MASK;
+}
+
+static void setup_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset < 0)
+		goto failed;
+
+	if (!setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_NMI, 0))
+		return;
+failed:
+	pr_warn("IBS APIC setup failed on cpu #%d\n",
+		smp_processor_id());
+}
+
+static void clear_APIC_ibs(void)
+{
+	int offset;
+
+	offset = get_ibs_lvt_offset();
+	if (offset >= 0)
+		setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1);
+}
+
+static int x86_amd_ibs_access_profile_startup(unsigned int cpu)
+{
+	setup_APIC_ibs();
+	return 0;
+}
+
+static int x86_amd_ibs_access_profile_teardown(unsigned int cpu)
+{
+	clear_APIC_ibs();
+	return 0;
+}
+
+static int __init ibs_access_profiling_init(void)
+{
+	if (!boot_cpu_has(X86_FEATURE_IBS)) {
+		pr_info("IBS capability is unavailable for access profiling\n");
+		return 0;
+	}
+
+	ibs_s = alloc_percpu_gfp(struct ibs_sample_pcpu, __GFP_ZERO);
+	if (!ibs_s)
+		return 0;
+
+	INIT_WORK(&ibs_work, ibs_work_handler);
+	init_irq_work(&ibs_irq_work, ibs_irq_handler);
+
+	/* Uses IBS Op sampling */
+	ibs_config = IBS_OP_CNT_CTL | IBS_OP_ENABLE;
+	ibs_caps = cpuid_eax(IBS_CPUID_FEATURES);
+	if (ibs_caps & IBS_CAPS_ZEN4)
+		ibs_config |= IBS_OP_L3MISSONLY;
+
+	register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs");
+
+	cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
+			  "x86/amd/ibs_access_profile:starting",
+			  x86_amd_ibs_access_profile_startup,
+			  x86_amd_ibs_access_profile_teardown);
+
+	pr_info("IBS setup for memory access profiling\n");
+	return 0;
+}
+
+arch_initcall(ibs_access_profiling_init);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index b5823b037883..24279c46054c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -195,6 +195,23 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KPROMOTED_MIG_CANDIDATE,
 		KPROMOTED_MIG_PROMOTED,
 		KPROMOTED_MIG_DROPPED,
+		HWHINT_NR_EVENTS,
+		HWHINT_KERNEL,
+		HWHINT_KTHREAD,
+		HWHINT_NON_LOAD_STORES,
+		HWHINT_DC_L2_HITS,
+		HWHINT_LOCAL_L3L1L2,
+		HWHINT_LOCAL_PEER_CACHE_NEAR,
+		HWHINT_FAR_CACHE_HITS,
+		HWHINT_DRAM_ACCESSES,
+		HWHINT_CXL_ACCESSES,
+		HWHINT_REMOTE_NODE,
+		HWHINT_LADDR_INVALID,
+		HWHINT_KERNEL_ADDR,
+		HWHINT_PADDR_INVALID,
+		HWHINT_NON_LRU,
+		HWHINT_BUFFER_FULL,
+		HWHINT_USEFUL_SAMPLES,
 		NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 618f44bae5c8..a21d3118d6f6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1479,6 +1479,23 @@ const char * const vmstat_text[] = {
 	"kpromoted_mig_candidate",
 	"kpromoted_mig_promoted",
 	"kpromoted_mig_dropped",
+	"hwhint_nr_events",
+	"hwhint_kernel",
+	"hwhint_kthread",
+	"hwhint_non_load_stores",
+	"hwhint_dc_l2_hits",
+	"hwhint_local_l3l1l2",
+	"hwhint_local_peer_cache_near",
+	"hwhint_far_cache_hits",
+	"hwhint_dram_accesses",
+	"hwhint_cxl_accesses",
+	"hwhint_remote_node",
+	"hwhint_invalid_laddr",
+	"hwhint_kernel_addr",
+	"hwhint_invalid_paddr",
+	"hwhint_non_lru",
+	"hwhint_buffer_full",
+	"hwhint_useful_samples",
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 4/4] x86: ibs: Enable IBS profiling for memory accesses
  2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
                   ` (2 preceding siblings ...)
  2025-03-06  5:45 ` [RFC PATCH 3/4] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
@ 2025-03-06  5:45 ` Bharata B Rao
  2025-03-16 22:00 ` [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages SeongJae Park
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-06  5:45 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo, Bharata B Rao

Enable IBS memory access data collection for user memory
accesses by programming the required MSRs. The profiling
is turned ON only for user mode execution and turned OFF
for kernel mode execution. Profiling is explicitly disabled
for NMI handler too.

TODOs:

- IBS sampling rate is kept fixed for now.
- Arch/vendor separation/isolation of the code needs relook.

Signed-off-by: Bharata B Rao <bharata@amd.com>
---
 arch/x86/include/asm/entry-common.h |  3 +++
 arch/x86/include/asm/hardirq.h      |  2 ++
 arch/x86/include/asm/ibs.h          |  2 ++
 arch/x86/mm/ibs.c                   | 32 +++++++++++++++++++++++++++++
 4 files changed, 39 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 77d20555e04d..8127111c6ad3 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -9,10 +9,12 @@
 #include <asm/io_bitmap.h>
 #include <asm/fpu/api.h>
 #include <asm/fred.h>
+#include <asm/ibs.h>
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
 {
+	hw_access_profiling_stop();
 	if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
 		/*
 		 * Make sure that the entry code gave us a sensible EFLAGS
@@ -98,6 +100,7 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 static __always_inline void arch_exit_to_user_mode(void)
 {
 	amd_clear_divider();
+	hw_access_profiling_start();
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 6ffa8b75f4cd..b928fbbcf3e5 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -91,4 +91,6 @@ static __always_inline bool kvm_get_cpu_l1tf_flush_l1d(void)
 static __always_inline void kvm_set_cpu_l1tf_flush_l1d(void) { }
 #endif /* IS_ENABLED(CONFIG_KVM_INTEL) */
 
+#define arch_nmi_enter()	hw_access_profiling_stop()
+#define arch_nmi_exit()		hw_access_profiling_start()
 #endif /* _ASM_X86_HARDIRQ_H */
diff --git a/arch/x86/include/asm/ibs.h b/arch/x86/include/asm/ibs.h
index b5a4f2ca6330..6b480958534e 100644
--- a/arch/x86/include/asm/ibs.h
+++ b/arch/x86/include/asm/ibs.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_IBS_H
 #define _ASM_X86_IBS_H
 
+void hw_access_profiling_start(void);
+void hw_access_profiling_stop(void);
 extern bool arch_hw_access_profiling;
 
 #endif /* _ASM_X86_IBS_H */
diff --git a/arch/x86/mm/ibs.c b/arch/x86/mm/ibs.c
index 5c966050ad86..961d0c67ca50 100644
--- a/arch/x86/mm/ibs.c
+++ b/arch/x86/mm/ibs.c
@@ -15,6 +15,7 @@ bool arch_hw_access_profiling;
 static u64 ibs_config __read_mostly;
 static u32 ibs_caps;
 
+#define IBS_SAMPLE_PERIOD      10000
 #define IBS_NR_SAMPLES	50
 
 /*
@@ -99,6 +100,36 @@ static void ibs_irq_handler(struct irq_work *i)
 	schedule_work_on(smp_processor_id(), &ibs_work);
 }
 
+void hw_access_profiling_stop(void)
+{
+	u64 ops_ctl;
+
+	if (!arch_hw_access_profiling)
+		return;
+
+	rdmsrl(MSR_AMD64_IBSOPCTL, ops_ctl);
+	wrmsrl(MSR_AMD64_IBSOPCTL, ops_ctl & ~IBS_OP_ENABLE);
+}
+
+void hw_access_profiling_start(void)
+{
+	u64 config = 0;
+	unsigned int period = IBS_SAMPLE_PERIOD;
+
+	if (!arch_hw_access_profiling)
+		return;
+
+	/* Disable IBS for kernel thread */
+	if (!current->mm)
+		goto out;
+
+	config = (period >> 4)  & IBS_OP_MAX_CNT;
+	config |= (period & IBS_OP_MAX_CNT_EXT_MASK);
+	config |= ibs_config;
+out:
+	wrmsrl(MSR_AMD64_IBSOPCTL, config);
+}
+
 /*
  * IBS NMI handler: Process the memory access info reported by IBS.
  *
@@ -305,6 +336,7 @@ static int __init ibs_access_profiling_init(void)
 			  x86_amd_ibs_access_profile_startup,
 			  x86_amd_ibs_access_profile_teardown);
 
+	arch_hw_access_profiling = true;
 	pr_info("IBS setup for memory access profiling\n");
 	return 0;
 }
-- 
2.34.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06  5:45 ` [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
@ 2025-03-06 12:13   ` David Hildenbrand
  2025-03-07  3:00     ` Bharata B Rao
  2025-03-06 17:24   ` Gregory Price
  2025-03-24  2:55   ` Balbir Singh
  2 siblings, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2025-03-06 12:13 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On 06.03.25 06:45, Bharata B Rao wrote:
> migrate_misplaced_folio_prepare() can be called from a
> context where VMA isn't available. Allow the migration
> to work from such contexts too.

I was initially confused about "can be called", because it can't

Consider phrasing it as "We want to make use of 
alloc_misplaced_dst_folio() in context where we don't have VMA 
information available. To prepare for that ..."

> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>   mm/migrate.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index fb19a18892c8..5b21856a0dd0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2639,7 +2639,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>   
>   /*
>    * Prepare for calling migrate_misplaced_folio() by isolating the folio if
> - * permitted. Must be called with the PTL still held.
> + * permitted. Must be called with the PTL still held if called with a non-NULL
> + * vma.
>    */
>   int migrate_misplaced_folio_prepare(struct folio *folio,
>   		struct vm_area_struct *vma, int node)
> @@ -2656,7 +2657,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
>   		 * See folio_likely_mapped_shared() on possible imprecision
>   		 * when we cannot easily detect if a folio is shared.
>   		 */
> -		if ((vma->vm_flags & VM_EXEC) &&
> +		if (vma && (vma->vm_flags & VM_EXEC) &&
>   		    folio_likely_mapped_shared(folio))
>   			return -EACCES;
>   


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
@ 2025-03-06 17:22   ` Mike Day
  2025-03-07  3:27     ` Bharata B Rao
  2025-03-13 16:44   ` Davidlohr Bueso
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Mike Day @ 2025-03-06 17:22 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron, akpm,
	dave.hansen, david, feng.tang, gourry, hannes, honggyu.kim,
	hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo



On 3/5/25 23:45, Bharata B Rao wrote:
> +static void kpromoted_migrate(pg_data_t *pgdat)
> +{
> +	int nid = pgdat->node_id;
> +	struct page_hotness_info *phi;
> +	struct hlist_node *tmp;
> +	int nr_bkts = HASH_SIZE(page_hotness_hash);
> +	int bkt;
> +
> +	for (bkt = 0; bkt < nr_bkts; bkt++) {
> +		mutex_lock(&page_hotness_lock[bkt]);
> +		hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], hnode) {
> +			if (phi->hot_node != nid)
> +				continue;
> +
> +			if (page_should_be_promoted(phi)) {
> +				count_vm_event(KPROMOTED_MIG_CANDIDATE);
> +				if (!kpromote_page(phi)) {
> +					count_vm_event(KPROMOTED_MIG_PROMOTED);
> +					hlist_del_init(&phi->hnode);
> +					kfree(phi);
> +				}
> +			} else {
> +				/*
> +				 * Not a suitable page or cold page, stop tracking it.
> +				 * TODO: Identify cold pages and drive demotion?
> +				 */
> +				count_vm_event(KPROMOTED_MIG_DROPPED);
> +				hlist_del_init(&phi->hnode);
> +				kfree(phi);
> +			}
> +		}
> +		mutex_unlock(&page_hotness_lock[bkt]);
> +	}
> +}
> +
> +static struct page_hotness_info *__kpromoted_lookup(unsigned long pfn, int bkt)
> +{
> +	struct page_hotness_info *phi;
> +
> +	hlist_for_each_entry(phi, &page_hotness_hash[bkt], hnode) {

Should this be hlist_for_each_entry_safe(), given that kpromoted_migrate() may be
running concurrently?

Mike
> +		if (phi->pfn == pfn)
> +			return phi;
> +	}
> +	return NULL;
> +}


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06  5:45 ` [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
  2025-03-06 12:13   ` David Hildenbrand
@ 2025-03-06 17:24   ` Gregory Price
  2025-03-06 17:45     ` Matthew Wilcox
  2025-03-24  2:55   ` Balbir Singh
  2 siblings, 1 reply; 38+ messages in thread
From: Gregory Price @ 2025-03-06 17:24 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, hannes, honggyu.kim, hughd, jhubbard, k.shutemov,
	kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
	mgorman, mingo, nadav.amit, nphamcs, peterz, raghavendra.kt,
	riel, rientjes, rppt, shivankg, shy828301, sj, vbabka, weixugc,
	willy, ying.huang, ziy, dave, yuanchu, hyeonggon.yoo

On Thu, Mar 06, 2025 at 11:15:29AM +0530, Bharata B Rao wrote:
> migrate_misplaced_folio_prepare() can be called from a
> context where VMA isn't available. Allow the migration
> to work from such contexts too.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>

I have a similar patch in the unmapped pagecache RFC

we may also need this:
https://lore.kernel.org/linux-mm/20250107000346.1338481-4-gourry@gourry.net/

May be worth just pulling these ahead to avoid conflict.

~Gregory


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06 17:24   ` Gregory Price
@ 2025-03-06 17:45     ` Matthew Wilcox
  2025-03-06 18:19       ` Gregory Price
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2025-03-06 17:45 UTC (permalink / raw)
  To: Gregory Price
  Cc: Bharata B Rao, linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil,
	Hasan.Maruf, Jonathan.Cameron, Michael.Day, akpm, dave.hansen,
	david, feng.tang, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, ying.huang, ziy, dave, yuanchu, hyeonggon.yoo

On Thu, Mar 06, 2025 at 12:24:16PM -0500, Gregory Price wrote:
> On Thu, Mar 06, 2025 at 11:15:29AM +0530, Bharata B Rao wrote:
> > migrate_misplaced_folio_prepare() can be called from a
> > context where VMA isn't available. Allow the migration
> > to work from such contexts too.
> > 
> > Signed-off-by: Bharata B Rao <bharata@amd.com>
> 
> I have a similar patch in the unmapped pagecache RFC
> 
> we may also need this:
> https://lore.kernel.org/linux-mm/20250107000346.1338481-4-gourry@gourry.net/
> 
> May be worth just pulling these ahead to avoid conflict.

Or not putting them in at all because this whole thing is a magnificent
waste of time?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06 17:45     ` Matthew Wilcox
@ 2025-03-06 18:19       ` Gregory Price
  2025-03-06 18:42         ` Matthew Wilcox
  0 siblings, 1 reply; 38+ messages in thread
From: Gregory Price @ 2025-03-06 18:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bharata B Rao, linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil,
	Hasan.Maruf, Jonathan.Cameron, Michael.Day, akpm, dave.hansen,
	david, feng.tang, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, ying.huang, ziy, dave, yuanchu, hyeonggon.yoo

On Thu, Mar 06, 2025 at 05:45:34PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 06, 2025 at 12:24:16PM -0500, Gregory Price wrote:
> > we may also need this:
> > https://lore.kernel.org/linux-mm/20250107000346.1338481-4-gourry@gourry.net/
> > 
> > May be worth just pulling these ahead to avoid conflict.
> 
> Or not putting them in at all because this whole thing is a magnificent
> waste of time?

Divorced from the tiering mechanisms, is making misplaced migration able
to migrate unmapped pages not generally useful?

~Gregory


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06 18:19       ` Gregory Price
@ 2025-03-06 18:42         ` Matthew Wilcox
  2025-03-06 20:03           ` Gregory Price
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2025-03-06 18:42 UTC (permalink / raw)
  To: Gregory Price
  Cc: Bharata B Rao, linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil,
	Hasan.Maruf, Jonathan.Cameron, Michael.Day, akpm, dave.hansen,
	david, feng.tang, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, ying.huang, ziy, dave, yuanchu, hyeonggon.yoo

On Thu, Mar 06, 2025 at 01:19:41PM -0500, Gregory Price wrote:
> On Thu, Mar 06, 2025 at 05:45:34PM +0000, Matthew Wilcox wrote:
> > On Thu, Mar 06, 2025 at 12:24:16PM -0500, Gregory Price wrote:
> > > we may also need this:
> > > https://lore.kernel.org/linux-mm/20250107000346.1338481-4-gourry@gourry.net/
> > > 
> > > May be worth just pulling these ahead to avoid conflict.
> > 
> > Or not putting them in at all because this whole thing is a magnificent
> > waste of time?
> 
> Divorced from the tiering mechanisms, is making misplaced migration able
> to migrate unmapped pages not generally useful?

The only thing I can think of is if you have a process or set of
processes on node A calling read() and the file is cached on node B.
But in order to decide if the page is on the wrong node, you'd need
to track a lot of information about which nodes the page is being
accessed from.  Which is probably why we've never bothered to do it.

This is not a large patch for you to carry as part of your patchset.
There's nothing intrinsically wrong with it; it just has no users in
mainline and no real prospect of any being added soon.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06 18:42         ` Matthew Wilcox
@ 2025-03-06 20:03           ` Gregory Price
  0 siblings, 0 replies; 38+ messages in thread
From: Gregory Price @ 2025-03-06 20:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bharata B Rao, linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil,
	Hasan.Maruf, Jonathan.Cameron, Michael.Day, akpm, dave.hansen,
	david, feng.tang, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, ying.huang, ziy, dave, yuanchu, hyeonggon.yoo

On Thu, Mar 06, 2025 at 06:42:10PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 06, 2025 at 01:19:41PM -0500, Gregory Price wrote:
> > Divorced from the tiering mechanisms, is making misplaced migration able
> > to migrate unmapped pages not generally useful?
> 
> The only thing I can think of is if you have a process or set of
> processes on node A calling read() and the file is cached on node B.
> But in order to decide if the page is on the wrong node, you'd need
> to track a lot of information about which nodes the page is being
> accessed from.  Which is probably why we've never bothered to do it.
> 
> This is not a large patch for you to carry as part of your patchset.
> There's nothing intrinsically wrong with it; it just has no users in
> mainline and no real prospect of any being added soon.

That's fair, I'm just tracking 3-4 different RFCs that are going to butt
up against this, so wanted to assess whether getting the patches out
ahead would save some strife.

~Gregory


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06 12:13   ` David Hildenbrand
@ 2025-03-07  3:00     ` Bharata B Rao
  0 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-07  3:00 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On 06-Mar-25 5:43 PM, David Hildenbrand wrote:
> On 06.03.25 06:45, Bharata B Rao wrote:
>> migrate_misplaced_folio_prepare() can be called from a
>> context where VMA isn't available. Allow the migration
>> to work from such contexts too.
> 
> I was initially confused about "can be called", because it can't
> 
> Consider phrasing it as "We want to make use of 
> alloc_misplaced_dst_folio() in context where we don't have VMA 
> information available. To prepare for that ..."

Yes, that would be the right wording.

Thanks,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06 17:22   ` Mike Day
@ 2025-03-07  3:27     ` Bharata B Rao
  0 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-07  3:27 UTC (permalink / raw)
  To: michael.day, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron, akpm,
	dave.hansen, david, feng.tang, gourry, hannes, honggyu.kim,
	hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu

On 06-Mar-25 10:52 PM, Mike Day wrote:
> 
> 
> On 3/5/25 23:45, Bharata B Rao wrote:
>> +static void kpromoted_migrate(pg_data_t *pgdat)
>> +{
>> +    int nid = pgdat->node_id;
>> +    struct page_hotness_info *phi;
>> +    struct hlist_node *tmp;
>> +    int nr_bkts = HASH_SIZE(page_hotness_hash);
>> +    int bkt;
>> +
>> +    for (bkt = 0; bkt < nr_bkts; bkt++) {
>> +        mutex_lock(&page_hotness_lock[bkt]);
>> +        hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], 
>> hnode) {
>> +            if (phi->hot_node != nid)
>> +                continue;
>> +
>> +            if (page_should_be_promoted(phi)) {
>> +                count_vm_event(KPROMOTED_MIG_CANDIDATE);
>> +                if (!kpromote_page(phi)) {
>> +                    count_vm_event(KPROMOTED_MIG_PROMOTED);
>> +                    hlist_del_init(&phi->hnode);
>> +                    kfree(phi);
>> +                }
>> +            } else {
>> +                /*
>> +                 * Not a suitable page or cold page, stop tracking it.
>> +                 * TODO: Identify cold pages and drive demotion?
>> +                 */
>> +                count_vm_event(KPROMOTED_MIG_DROPPED);
>> +                hlist_del_init(&phi->hnode);
>> +                kfree(phi);
>> +            }
>> +        }
>> +        mutex_unlock(&page_hotness_lock[bkt]);
>> +    }
>> +}
>> +
>> +static struct page_hotness_info *__kpromoted_lookup(unsigned long 
>> pfn, int bkt)
>> +{
>> +    struct page_hotness_info *phi;
>> +
>> +    hlist_for_each_entry(phi, &page_hotness_hash[bkt], hnode) {
> 
> Should this be hlist_for_each_entry_safe(), given that 
> kpromoted_migrate() may be
> running concurrently?

I don't think so because the migration path can't walk the list 
concurrently as the lists are protected by mutex.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
  2025-03-06 17:22   ` Mike Day
@ 2025-03-13 16:44   ` Davidlohr Bueso
  2025-03-17  3:39     ` Bharata B Rao
  2025-03-13 20:36   ` Davidlohr Bueso
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Davidlohr Bueso @ 2025-03-13 16:44 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, yuanchu, hyeonggon.yoo

On Thu, 06 Mar 2025, Bharata B Rao wrote:

>+static int page_should_be_promoted(struct page_hotness_info *phi)
>+{
>+	struct page *page = pfn_to_online_page(phi->pfn);
>+	unsigned long now = jiffies;
>+	struct folio *folio;
>+
>+	if (!page || is_zone_device_page(page))
>+		return false;
>+
>+	folio = page_folio(page);
>+	if (!folio_test_lru(folio)) {
>+		count_vm_event(KPROMOTED_MIG_NON_LRU);
>+		return false;
>+	}
>+	if (folio_nid(folio) == phi->hot_node) {
>+		count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
>+		return false;
>+	}

How about using the LRU age itself:

if (folio_test_active())
    return true;

>+
>+	/* If the page was hot a while ago, don't promote */
>+	if ((now - phi->last_update) > 2 * msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
>+		count_vm_event(KPROMOTED_MIG_COLD_OLD);
>+		return false;
>+	}
>+
>+	/* If the page hasn't been accessed enough number of times, don't promote */
>+	if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
>+		count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
>+		return false;
>+	}
>+	return true;
>+}

...

>+static int kpromoted(void *p)
>+{
>+	pg_data_t *pgdat = (pg_data_t *)p;
>+	struct task_struct *tsk = current;
>+	long timeout = msecs_to_jiffies(KPROMOTE_DELAY);
>+
>+	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>+
>+	if (!cpumask_empty(cpumask))
>+		set_cpus_allowed_ptr(tsk, cpumask);

Explicit cpumasks are not needed if you use kthread_create_on_node().

See https://web.git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=c6a566f6c1b4d5dff659acd221f95a72923f4085

>+
>+	while (!kthread_should_stop()) {
>+		wait_event_timeout(pgdat->kpromoted_wait,
>+				   kpromoted_work_requested(pgdat), timeout);
>+		kpromoted_do_work(pgdat);
>+	}
>+	return 0;
>+}
>+
>+static void kpromoted_run(int nid)
>+{
>+	pg_data_t *pgdat = NODE_DATA(nid);
>+
>+	if (pgdat->kpromoted)
>+		return;
>+
>+	pgdat->kpromoted = kthread_run(kpromoted, pgdat, "kpromoted%d", nid);
>+	if (IS_ERR(pgdat->kpromoted)) {
>+		pr_err("Failed to start kpromoted on node %d\n", nid);
>+		pgdat->kpromoted = NULL;
>+	}
>+}
>+


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
  2025-03-06 17:22   ` Mike Day
  2025-03-13 16:44   ` Davidlohr Bueso
@ 2025-03-13 20:36   ` Davidlohr Bueso
  2025-03-17  3:49     ` Bharata B Rao
  2025-03-14 15:28   ` Jonathan Cameron
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Davidlohr Bueso @ 2025-03-13 20:36 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, yuanchu, hyeonggon.yoo

On Thu, 06 Mar 2025, Bharata B Rao wrote:

>+/*
>+ * Go thro' page hotness information and migrate pages if required.
>+ *
>+ * Promoted pages are not longer tracked in the hot list.
>+ * Cold pages are pruned from the list as well.
>+ *
>+ * TODO: Batching could be done
>+ */
>+static void kpromoted_migrate(pg_data_t *pgdat)
>+{
>+	int nid = pgdat->node_id;
>+	struct page_hotness_info *phi;
>+	struct hlist_node *tmp;
>+	int nr_bkts = HASH_SIZE(page_hotness_hash);
>+	int bkt;
>+
>+	for (bkt = 0; bkt < nr_bkts; bkt++) {
>+		mutex_lock(&page_hotness_lock[bkt]);
>+		hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], hnode) {
>+			if (phi->hot_node != nid)
>+				continue;
>+
>+			if (page_should_be_promoted(phi)) {
>+				count_vm_event(KPROMOTED_MIG_CANDIDATE);
>+				if (!kpromote_page(phi)) {
>+					count_vm_event(KPROMOTED_MIG_PROMOTED);
>+					hlist_del_init(&phi->hnode);
>+					kfree(phi);
>+				}
>+			} else {
>+				/*
>+				 * Not a suitable page or cold page, stop tracking it.
>+				 * TODO: Identify cold pages and drive demotion?
>+				 */

I don't think kpromoted should drive demotion at all. No one is complaining about migrate
in lieu of discard, and there is also proactive reclaim which users can trigger. All the
in-kernel problems are wrt promotion. The simpler any of these kthreads are the better.

>+				count_vm_event(KPROMOTED_MIG_DROPPED);
>+				hlist_del_init(&phi->hnode);
>+				kfree(phi);
>+			}
>+		}
>+		mutex_unlock(&page_hotness_lock[bkt]);
>+	}
>+}


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
                     ` (2 preceding siblings ...)
  2025-03-13 20:36   ` Davidlohr Bueso
@ 2025-03-14 15:28   ` Jonathan Cameron
  2025-03-18  4:09     ` Bharata B Rao
  2025-03-24  3:35   ` Balbir Singh
  2025-03-24 13:43   ` Gregory Price
  5 siblings, 1 reply; 38+ messages in thread
From: Jonathan Cameron @ 2025-03-14 15:28 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On Thu, 6 Mar 2025 11:15:30 +0530
Bharata B Rao <bharata@amd.com> wrote:

> kpromoted is a kernel daemon that accumulates hot page info
> from different sources and tries to promote pages from slow
> tiers to top tiers. One instance of this thread runs on each
> node that has CPUs.
> 

Firstly, nice work. Much easier to discuss things with an
implementation to look at.

I'm looking at this with my hardware hotness unit "hammer" in hand :)

> Subsystems that generate hot page access info can report that
> to kpromoted via this API:
> 
> int kpromoted_record_access(u64 pfn, int nid, int src,
> 			    unsigned long time)

This perhaps works as an interface for aggregating methods
that produce per access events.  Any hardware counter solution
is going to give you data that is closer to what you used for
the promotion decision.

We might need to aggregate at different levels.  So access
counting promotes to a hot list and we can inject other events
at that level.  The data I have from the CXL HMU is typically
after an epoch (period of time) these N pages were accessed more
than M times.  I can sort of map that to the internal storage
you have.

Would be good to evaluate approximate trackers on top of access
counts. I've no idea if sketches or similar would be efficient
enough (they have a bit of a write amplification problem) but
they may give good answers with much lower storage cost at the
risk of occasionally saying something is hot when it's not.

> 
> @pfn: The PFN of the memory accessed
> @nid: The accessing NUMA node ID
> @src: The temperature source (subsystem) that generated the
>       access info
> @time: The access time in jiffies
> 
> Some temperature sources may not provide the nid from which
> the page was accessed. This is true for sources that use
> page table scanning for PTE Accessed bit. Currently the toptier
> node to which such pages should be promoted to is hard coded.

For those cases (CXL HMU included) maybe we need to
consider how to fill in missing node info with at least a vague chance
of getting a reasonable target for migration.  We can always fall
back to random top tier node, or nearest one to where we are coming
from (on basis we maybe landed in this node based on a fallback
list when the top tier was under memory pressure).

From an interface point of view is that a problem for this layer,
or for the underlying tracking mechanism? (maybe with some helpers)
Also, see later discussion of consistency of hotness tracking and
that the best solution for that differs from that to get
potential targets.  The answer to Is this page consistently hot?" can be
approximated with "Was this page once hot and is it not now cold?"

Access time is something some measurement techniques will only
give you wrt to a measurement was in a window (potentially a long
one if you are looking for consistent hotness over minutes).

> 
> Also, the access time provided some sources may at best be
> considered approximate. This is especially true for hot pages
> detected by PTE A bit scanning.
> 
> kpromoted currently maintains the hot PFN records in hash lists
> hashed by PFN value. Each record stores the following info:
> 
> struct page_hotness_info {
> 	unsigned long pfn;
> 
> 	/* Time when this record was updated last */
> 	unsigned long last_update;
> 
> 	/*
> 	 * Number of times this page was accessed in the
> 	 * current window
I'd express here how that window was defined (I read on
to answer the question I had here at first!)

> 	 */
> 	int frequency;
> 
> 	/* Most recent access time */
> 	unsigned long recency;

Put next to the last_update so all the times are together

> 
> 	/* Most recent access from this node */
> 	int hot_node;

Probably want to relax the most recent part.  I'd guess
the ideal here would be if this is the node accessing it the most
'recently'.

> 
> 	struct hlist_node hnode;
> };
> 
> The way in which a page is categorized as hot enough to be
> promoted is pretty primitive now.

That bit is very hard even if we solve everything else and heavily dependent
on workload access pattern stability and migration impact.  Maybe for
'very hot' pages a fairly short consistency of hotness period is
good enough, but it gets much messier if we care about warm pages.
I guess we solve the 'very hot' first though and maybe avoid the phase
transition from an application starting to when it is at steady state
by considering a wait time for any new userspace process before we
consider moving anything?

Also worth noting that the mechanism that makes sense to check if a
detected hot page is 'stable hot' might use entirely different tracking
approach to that used to find it as a candidate.

Whether that requires passing data between hotness trackers is an
interesting question, or whether there is a natural ordering to trackers.

> diff --git a/mm/kpromoted.c b/mm/kpromoted.c
> new file mode 100644
> index 000000000000..2a8b8495b6b3
> --- /dev/null
> +++ b/mm/kpromoted.c

> +static int page_should_be_promoted(struct page_hotness_info *phi)
> +{
> +	struct page *page = pfn_to_online_page(phi->pfn);
> +	unsigned long now = jiffies;
> +	struct folio *folio;
> +
> +	if (!page || is_zone_device_page(page))
> +		return false;
> +
> +	folio = page_folio(page);
> +	if (!folio_test_lru(folio)) {
> +		count_vm_event(KPROMOTED_MIG_NON_LRU);
> +		return false;
> +	}
> +	if (folio_nid(folio) == phi->hot_node) {
> +		count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
> +		return false;
> +	}
> +
> +	/* If the page was hot a while ago, don't promote */

	/* If the known record of hotness is old, don't promote */ ?

Otherwise this says don't move a page just because it was hot a long time
back. Maybe it is still hot and we just don't have an update yet?

> +	if ((now - phi->last_update) > 2 * msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
> +		count_vm_event(KPROMOTED_MIG_COLD_OLD);
> +		return false;
> +	}
> +
> +	/* If the page hasn't been accessed enough number of times, don't promote */
> +	if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
> +		count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +/*
> + * Go thro' page hotness information and migrate pages if required.
> + *
> + * Promoted pages are not longer tracked in the hot list.
> + * Cold pages are pruned from the list as well.

When we say cold here why did we ever see them?

> + *
> + * TODO: Batching could be done
> + */
> +static void kpromoted_migrate(pg_data_t *pgdat)
> +{
> +	int nid = pgdat->node_id;
> +	struct page_hotness_info *phi;
> +	struct hlist_node *tmp;
> +	int nr_bkts = HASH_SIZE(page_hotness_hash);
> +	int bkt;
> +
> +	for (bkt = 0; bkt < nr_bkts; bkt++) {
> +		mutex_lock(&page_hotness_lock[bkt]);
> +		hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], hnode) {
> +			if (phi->hot_node != nid)
> +				continue;
> +
> +			if (page_should_be_promoted(phi)) {
> +				count_vm_event(KPROMOTED_MIG_CANDIDATE);
> +				if (!kpromote_page(phi)) {
> +					count_vm_event(KPROMOTED_MIG_PROMOTED);
> +					hlist_del_init(&phi->hnode);
> +					kfree(phi);
> +				}
> +			} else {
> +				/*
> +				 * Not a suitable page or cold page, stop tracking it.
> +				 * TODO: Identify cold pages and drive demotion?

Coldness tracking is really different from hotness as we need to track what we
didn't see to get the really cold pages. Maybe there is some hint to be had
from the exit of this tracker but I'd definitely not try to tackle both ends
with one approach!

> +				 */
> +				count_vm_event(KPROMOTED_MIG_DROPPED);
> +				hlist_del_init(&phi->hnode);
> +				kfree(phi);
> +			}
> +		}
> +		mutex_unlock(&page_hotness_lock[bkt]);
> +	}
> +}

> +/*
> + * Called by subsystems that generate page hotness/access information.
> + *
> + * Records the memory access info for futher action by kpromoted.
> + */
> +int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now)
> +{

> +	bkt = hash_min(pfn, KPROMOTED_HASH_ORDER);
> +	mutex_lock(&page_hotness_lock[bkt]);
> +	phi = kpromoted_lookup(pfn, bkt, now);
> +	if (!phi) {
> +		ret = PTR_ERR(phi);
> +		goto out;
> +	}
> +
> +	if ((phi->last_update - now) > msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
> +		/* New window */
> +		phi->frequency = 1; /* TODO: Factor in the history */
> +		phi->last_update = now;
> +	} else {
> +		phi->frequency++;
> +	}
> +	phi->recency = now;
> +
> +	/*
> +	 * TODOs:
> +	 * 1. Source nid is hard-coded for some temperature sources

Hard coded rather than unknown? I'm curious, what source has that issue?

> +	 * 2. Take action if hot_node changes - may be a shared page?
> +	 * 3. Maintain node info for every access within the window?

I guess some sort of saturating counter set might not be too bad.

> +	 */
> +	phi->hot_node = (nid == NUMA_NO_NODE) ? 1 : nid;
> +	mutex_unlock(&page_hotness_lock[bkt]);
> +out:
> +	return 0;

why store ret and not return it?

> +}
> +

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 3/4] x86: ibs: In-kernel IBS driver for memory access profiling
  2025-03-06  5:45 ` [RFC PATCH 3/4] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
@ 2025-03-14 15:38   ` Jonathan Cameron
  0 siblings, 0 replies; 38+ messages in thread
From: Jonathan Cameron @ 2025-03-14 15:38 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On Thu, 6 Mar 2025 11:15:31 +0530
Bharata B Rao <bharata@amd.com> wrote:

> Use IBS (Instruction Based Sampling) feature present
> in AMD processors for memory access tracking. The access
> information obtained from IBS via NMI is fed to kpromoted
> daemon for futher action.
> 
> In addition to many other information related to the memory
> access, IBS provides physical (and virtual) address of the access
> and indicates if the access came from slower tier. Only memory
> accesses originating from slower tiers are further acted upon
> by this driver.
> 
> The samples are initially accumulated in percpu buffers which
> are flushed to kpromoted using irq_work.
> 
> About IBS
> ---------
> IBS can be programmed to provide data about instruction
> execution periodically. This is done by programming a desired
> sample count (number of ops) in a control register. When the
> programmed number of ops are dispatched, a micro-op gets tagged,
> various information about the tagged micro-op's execution is
> populated in IBS execution MSRs and an interrupt is raised.
> While IBS provides a lot of data for each sample, for the
> purpose of  memory access profiling, we are interested in
> linear and physical address of the memory access that reached
> DRAM. Recent AMD processors provide further filtering where
> it is possible to limit the sampling to those ops that had
> an L3 miss which greately reduces the non-useful samples.
> 
> While IBS provides capability to sample instruction fetch
> and execution, only IBS execution sampling is used here
> to collect data about memory accesses that occur during
> the instruction execution.
> 
> More information about IBS is available in Sec 13.3 of
> AMD64 Architecture Programmer's Manual, Volume 2:System
> Programming which is present at:
> https://bugzilla.kernel.org/attachment.cgi?id=288923
> 
> Information about MSRs used for programming IBS can be
> found in Sec 2.1.14.4 of PPR Vol 1 for AMD Family 19h
> Model 11h B1 which is currently present at:
> https://www.amd.com/system/files/TechDocs/55901_0.25.zip
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---

Trivial comments inline. I'd love to find a clean way to steal stuff
perf is using though.

>  arch/x86/events/amd/ibs.c        |  11 ++
>  arch/x86/include/asm/ibs.h       |   7 +
>  arch/x86/include/asm/msr-index.h |  16 ++
>  arch/x86/mm/Makefile             |   3 +-
>  arch/x86/mm/ibs.c                | 312 +++++++++++++++++++++++++++++++
>  include/linux/vm_event_item.h    |  17 ++
>  mm/vmstat.c                      |  17 ++
>  7 files changed, 382 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/asm/ibs.h
>  create mode 100644 arch/x86/mm/ibs.c
> 
> diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
> index e7a8b8758e08..35497e8c0846 100644
> --- a/arch/x86/events/amd/ibs.c
> +++ b/arch/x86/events/amd/ibs.c
> @@ -13,8 +13,10 @@
>  #include <linux/ptrace.h>
>  #include <linux/syscore_ops.h>
>  #include <linux/sched/clock.h>
> +#include <linux/kpromoted.h>
>  
>  #include <asm/apic.h>
> +#include <asm/ibs.h>
>  
>  #include "../perf_event.h"
>  
> @@ -1539,6 +1541,15 @@ static __init int amd_ibs_init(void)
>  {
>  	u32 caps;
>  
> +	/*
> +	 * TODO: Find a clean way to disable perf IBS so that IBS
> +	 * can be used for memory access profiling.

Yeah.  That bit us in a number of similar cases.  Does anyone
have a good solution for this?  For my hammer (CXL HMU) the
perf case is probably the niche one so I'm less worried, but for
SPE, IBS, PEBS etc we need to figure out how to elegantly back off
on promotion if a user wants to use tracing.

> +	 */
> +	if (arch_hw_access_profiling) {
> +		pr_info("IBS isn't available for perf use\n");
> +		return 0;
> +	}
> +
>  	caps = __get_ibs_caps();
>  	if (!caps)
>  		return -ENODEV;	/* ibs not supported by the cpu */


> +
> +static void clear_APIC_ibs(void)
> +{
> +	int offset;
> +
> +	offset = get_ibs_lvt_offset();

Trivial but I'd flip condition and deal with the error
out of line.  Ah I see this is cut and paste from existing
code I'll stop pointing this stuff out!

	if (offset < 0)
		return;

	setup_APIC_eivt();

> +	if (offset >= 0)
> +		setup_APIC_eilvt(offset, 0, APIC_EILVT_MSG_FIX, 1);
> +}

> +
> +static int __init ibs_access_profiling_init(void)
> +{
> +	if (!boot_cpu_has(X86_FEATURE_IBS)) {
> +		pr_info("IBS capability is unavailable for access profiling\n");

Probably worth saying that is because the chip doesn't have it!
This reads to similar to the perf case above where we just pinched it
for other usecases.

> +		return 0;
> +	}
> +
> +	ibs_s = alloc_percpu_gfp(struct ibs_sample_pcpu, __GFP_ZERO);
> +	if (!ibs_s)
> +		return 0;
> +
> +	INIT_WORK(&ibs_work, ibs_work_handler);
> +	init_irq_work(&ibs_irq_work, ibs_irq_handler);
> +
> +	/* Uses IBS Op sampling */
> +	ibs_config = IBS_OP_CNT_CTL | IBS_OP_ENABLE;
> +	ibs_caps = cpuid_eax(IBS_CPUID_FEATURES);
> +	if (ibs_caps & IBS_CAPS_ZEN4)
> +		ibs_config |= IBS_OP_L3MISSONLY;
> +
> +	register_nmi_handler(NMI_LOCAL, ibs_overflow_handler, 0, "ibs");
> +
> +	cpuhp_setup_state(CPUHP_AP_PERF_X86_AMD_IBS_STARTING,
> +			  "x86/amd/ibs_access_profile:starting",
> +			  x86_amd_ibs_access_profile_startup,
> +			  x86_amd_ibs_access_profile_teardown);
> +
> +	pr_info("IBS setup for memory access profiling\n");
> +	return 0;
> +}
> +
> +arch_initcall(ibs_access_profiling_init);



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
  2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
                   ` (3 preceding siblings ...)
  2025-03-06  5:45 ` [RFC PATCH 4/4] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
@ 2025-03-16 22:00 ` SeongJae Park
  2025-03-18  6:33   ` Raghavendra K T
  2025-03-18 10:45   ` Bharata B Rao
  2025-03-18  5:28 ` Balbir Singh
  2025-03-25  8:18 ` Bharata B Rao
  6 siblings, 2 replies; 38+ messages in thread
From: SeongJae Park @ 2025-03-16 22:00 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: SeongJae Park, linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil,
	Hasan.Maruf, Jonathan.Cameron, Michael.Day, akpm, dave.hansen,
	david, feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301,
	vbabka, weixugc, willy, ying.huang, ziy, dave, yuanchu,
	hyeonggon.yoo, Harry Yoo

+ Harry, who was called Hyeonggon before.

Hello,

Thank you very much for sharing this great patchset.

On Thu, 6 Mar 2025 11:15:28 +0530 Bharata B Rao <bharata@amd.com> wrote:

> Hi,
> 
> This is an attempt towards having a single subsystem that accumulates
> hot page information from lower memory tiers and does hot page
> promotion.

That is one of DAMON's goal, too.  DAMON aims to be a kernel subsystem that can
provide access information that accumulated from multiple sources and can be
useful for multiple use cases including profiling and access aware system
operations.

Hot pages information and promotioning those are one of such information and
operations.  SK hynix developed their CXL memory tiering solution[1] using
DAMON.  I also shared auto-tuning based memory tiering solution idea[2] before.
On LSFMMBPF 2025, I may share its prototype implementation and evaluation
results on CXL memory devices that I recentily gained access.

Of course, DAMON is still in the middle of its journey towards the northern
star.  I'm looking for what are really required to DAMON for the goal, what are
[not] available with today's DAMON, and what should be the good future plans.
My LSFMMBPF 2025 topic proposals are for those.

Hence, this patchset is very helpful to me at showing what can be added and
improved on DAMON.  I specifically understand support of access information
sources other than Page tables' accessed bits such as AMD IBS as the main
thing.  I admit the fact that DAMON of today is supporting only page tables'
accessed bit as the primary source of the information.  But DAMON of future
would be different.  Let me share more thoughts below.

> 
> At the heart of this subsystem is a kernel daemon named kpromoted that
> does the following:
> 
> 1. Exposes an API that other subsystems which detect/generate memory
>    access information can use to inform the daemon about memory
>    accesses from lower memory tiers.

DAMON also provides such API, namely, its monitoring operations set layer
interface[3].  Nevertheless, only page tables accessed bit use cases exist
today.  Hence the interface may have have hidden problems at extending for
other sources.

> 2. Maintains the list of hot pages and attempts to promote them to
>    toptiers.

DAMON provides its another half, DAMOS[4], for this kind of usages.

> 
> Currently I have added AMD IBS driver as one source that provides
> page access information as an example. This driver feeds info to
> kpromoted in this RFC patchset. More sources were discussed in a
> similar context here at [1].

I was imagining how I would be able to do this with DAMON via operations set
layer interface.  And I find thee current interface is not very optimized for
AMD IBS like sources that catches the access on the line.  That is, in a way,
we could say AMD IBS like primitives as push-oriented, while page tables'
accessed bits information are pull-oriented.  DAMON operations set layer
interface is easier to be used in pull-oriented case.  I don't think it cannot
be used for push-oriented case, but definitely the interface would better to be
more optimized for the use case.

I'm curious if you also tried doing this by extending DAMON, and if some hidden
problems you found.

> 
> This is just an early attempt to check what it takes to maintain
> a single source of page hotness info and also separate hot page
> detection mechanisms from the promotion mechanism. There are too
> many open ends right now and I have listed a few of them below.
> 
> - The API that is provided to register memory access expects
>   the PFN, NID and time of access at the minimum. This is
>   described more in patch 2/4. This API currently can be called
>   only from contexts that allow sleeping and hence this rules
>   out using it from PTE scanning paths. The API needs to be
>   more flexible with respect to this.
> - Some sources like PTE A bit scanning can't provide the precise
>   time of access or the NID that is accessing the page. The latter
>   has been an open problem to which I haven't come across a good
>   and acceptable solution.

Agree.  PTE A bit scanning could be useful in many cases, but not every case.
There was an RFC patchset[7] that extends DAMON for NID.  I'm planning to do
that again using DAMON operations layer interface.  My current plan is to
implement the prototype using prot_none page faults, and later extend for AMD
IBS like h/w features.  Hopefully I will share a prototype or at least more
detailed idea on LSFMMBPF 2025.

> - The way the hot page information is maintained is pretty
>   primitive right now. Ideally we would like to store hotness info
>   in such a way that it should be easily possible to lookup say N
>   most hot pages.

DAMON provides a feature for lookup of N most hotpages, namely DAMOS quotas'
access pattern based regions prioritization[5].

> - If PTE A bit scanners are considered as hotness sources, we will
>   be bombarded with accesses. Do we want to accomodate all those
>   accesses or just go with hotness info for fixed number of pages
>   (possibly as a ratio of lower tier memory capacity)?

I understand you're saying about memory space overhead.  Correct me if I'm
wrong, please.

Isn't same issue exists for current implementation of the sampling frequency is
high, and/or aggregation window is long?

To me, hence, this looks like not a problem of the information source, but how
to maintain the information.  Current implementation maintains it per page, so
I think the problem is inherent.

DAMON maintains the information in region abstraction that can save multiple
pages with one data structure.  The maximum number of regions can be set by
users, so the space overhead can be controlled.

> - Undoubtedly the mechanism to classify a page as hot and subsequent
>   promotion needs to be more sophisticated than what I have right now.

DAMON provides aim-based DAMOS aggressiveness auto-tuning[6] and monitoring
intervals auto-tuning[8] for this purpose.

> 
> This is just an early RFC posted now to ignite some discussion
> in the context of LSFMM [2].

This is really helpful.  Appreciate, and looking forward to more discussions on
LSFMM and mailing lists.

> 
> I am also working with Raghu to integrate his kmmdscan [3] as the
> hotness source and use kpromoted for migration.

Raghu also mentioned he would try to take a time to look into DAMON if there is
anything that he could reuse for the purpose.  I'm curious if he was able to
find something there.

> 
> Also, I had posted the IBS driver ealier as an alternative to
> hint faults based NUMA Balancing [4]. However here I am using
> it as generic page hotness source.

This will also be very helpful for understanding how IBS can be used.
Appreciate!

> 
> [1] https://lore.kernel.org/linux-mm/de31971e-98fc-4baf-8f4f-09d153902e2e@amd.com/
> [2] https://lore.kernel.org/linux-mm/20250123105721.424117-1-raghavendra.kt@amd.com/
> [3] https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@amd.com/
> [3] https://lore.kernel.org/lkml/20230208073533.715-2-bharata@amd.com/

[1] https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion
[2] https://lore.kernel.org/all/20231112195602.61525-1-sj@kernel.org/
[3] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#operations-set-layer
[4] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#operation-schemes
[5] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#prioritization
[6] https://origin.kernel.org/doc/html/latest/mm/damon/design.html#aim-oriented-feedback-driven-auto-tuning
[7] https://lore.kernel.org/linux-mm/cover.1645024354.git.xhao@linux.alibaba.com/
[8] https://origin.kernel.org/doc/html/next/mm/damon/design.html#monitoring-intervals-auto-tuning

Thank,
SJ

> 
> Regards,
> Bharata.
> 
> Bharata B Rao (4):
>   mm: migrate: Allow misplaced migration without VMA too
>   mm: kpromoted: Hot page info collection and promotion daemon
>   x86: ibs: In-kernel IBS driver for memory access profiling
>   x86: ibs: Enable IBS profiling for memory accesses
> 
>  arch/x86/events/amd/ibs.c           |  11 +
>  arch/x86/include/asm/entry-common.h |   3 +
>  arch/x86/include/asm/hardirq.h      |   2 +
>  arch/x86/include/asm/ibs.h          |   9 +
>  arch/x86/include/asm/msr-index.h    |  16 ++
>  arch/x86/mm/Makefile                |   3 +-
>  arch/x86/mm/ibs.c                   | 344 ++++++++++++++++++++++++++++
>  include/linux/kpromoted.h           |  54 +++++
>  include/linux/mmzone.h              |   4 +
>  include/linux/vm_event_item.h       |  30 +++
>  mm/Kconfig                          |   7 +
>  mm/Makefile                         |   1 +
>  mm/kpromoted.c                      | 305 ++++++++++++++++++++++++
>  mm/migrate.c                        |   5 +-
>  mm/mm_init.c                        |  10 +
>  mm/vmstat.c                         |  30 +++
>  16 files changed, 831 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/include/asm/ibs.h
>  create mode 100644 arch/x86/mm/ibs.c
>  create mode 100644 include/linux/kpromoted.h
>  create mode 100644 mm/kpromoted.c
> 
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-13 16:44   ` Davidlohr Bueso
@ 2025-03-17  3:39     ` Bharata B Rao
  2025-03-17 15:05       ` Gregory Price
  0 siblings, 1 reply; 38+ messages in thread
From: Bharata B Rao @ 2025-03-17  3:39 UTC (permalink / raw)
  To: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, yuanchu

On 13-Mar-25 10:14 PM, Davidlohr Bueso wrote:
> On Thu, 06 Mar 2025, Bharata B Rao wrote:
> 
>> +static int page_should_be_promoted(struct page_hotness_info *phi)
>> +{
>> +    struct page *page = pfn_to_online_page(phi->pfn);
>> +    unsigned long now = jiffies;
>> +    struct folio *folio;
>> +
>> +    if (!page || is_zone_device_page(page))
>> +        return false;
>> +
>> +    folio = page_folio(page);
>> +    if (!folio_test_lru(folio)) {
>> +        count_vm_event(KPROMOTED_MIG_NON_LRU);
>> +        return false;
>> +    }
>> +    if (folio_nid(folio) == phi->hot_node) {
>> +        count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
>> +        return false;
>> +    }
> 
> How about using the LRU age itself:

Sounds like a good check for page hotness.

> 
> if (folio_test_active())
>     return true;

But the numbers I obtained with this check added, didn't really hit this 
condition all that much. I was running a multi-threaded application that 
allocates enough memory such that the allocation spills over from DRAM 
node to the CXL node. Threads keep touching the memory pages in random 
order.

kpromoted_recorded_accesses 960620 /* Number of recorded accesses */
kpromoted_recorded_hwhints 960620  /* Nr accesses via HW hints, IBS in 
this case */
kpromoted_recorded_pgtscans 0
kpromoted_record_toptier 638006 /* Nr toptier accesses */
kpromoted_record_added 321234 /* Nr (CXL) accesses that are tracked */
kpromoted_record_exists 1380
kpromoted_mig_right_node 0
kpromoted_mig_non_lru 226
kpromoted_mig_lru_active 47 /* Number of accesses considered for 
promotion as determined by folio_test_active() check */
kpromoted_mig_cold_old 0
kpromoted_mig_cold_not_accessed 1373
kpromoted_mig_candidate 319635
kpromoted_mig_promoted 319635
kpromoted_mig_dropped 1599

Need to check why is this the case.

> 
>> +
>> +    /* If the page was hot a while ago, don't promote */
>> +    if ((now - phi->last_update) > 2 * 
>> msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
>> +        count_vm_event(KPROMOTED_MIG_COLD_OLD);
>> +        return false;
>> +    }
>> +
>> +    /* If the page hasn't been accessed enough number of times, don't 
>> promote */
>> +    if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
>> +        count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
>> +        return false;
>> +    }
>> +    return true;
>> +}
> 
> ...
> 
>> +static int kpromoted(void *p)
>> +{
>> +    pg_data_t *pgdat = (pg_data_t *)p;
>> +    struct task_struct *tsk = current;
>> +    long timeout = msecs_to_jiffies(KPROMOTE_DELAY);
>> +
>> +    const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>> +
>> +    if (!cpumask_empty(cpumask))
>> +        set_cpus_allowed_ptr(tsk, cpumask);
> 
> Explicit cpumasks are not needed if you use kthread_create_on_node().

Thanks, will incorporate.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-13 20:36   ` Davidlohr Bueso
@ 2025-03-17  3:49     ` Bharata B Rao
  0 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-17  3:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301, sj,
	vbabka, weixugc, willy, ying.huang, ziy, yuanchu

On 14-Mar-25 2:06 AM, Davidlohr Bueso wrote:
> On Thu, 06 Mar 2025, Bharata B Rao wrote:
> 
>> +/*
>> + * Go thro' page hotness information and migrate pages if required.
>> + *
>> + * Promoted pages are not longer tracked in the hot list.
>> + * Cold pages are pruned from the list as well.
>> + *
>> + * TODO: Batching could be done
>> + */
>> +static void kpromoted_migrate(pg_data_t *pgdat)
>> +{
>> +    int nid = pgdat->node_id;
>> +    struct page_hotness_info *phi;
>> +    struct hlist_node *tmp;
>> +    int nr_bkts = HASH_SIZE(page_hotness_hash);
>> +    int bkt;
>> +
>> +    for (bkt = 0; bkt < nr_bkts; bkt++) {
>> +        mutex_lock(&page_hotness_lock[bkt]);
>> +        hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], 
>> hnode) {
>> +            if (phi->hot_node != nid)
>> +                continue;
>> +
>> +            if (page_should_be_promoted(phi)) {
>> +                count_vm_event(KPROMOTED_MIG_CANDIDATE);
>> +                if (!kpromote_page(phi)) {
>> +                    count_vm_event(KPROMOTED_MIG_PROMOTED);
>> +                    hlist_del_init(&phi->hnode);
>> +                    kfree(phi);
>> +                }
>> +            } else {
>> +                /*
>> +                 * Not a suitable page or cold page, stop tracking it.
>> +                 * TODO: Identify cold pages and drive demotion?
>> +                 */
> 
> I don't think kpromoted should drive demotion at all. No one is 
> complaining about migrate
> in lieu of discard, and there is also proactive reclaim which users can 
> trigger. All the
> in-kernel problems are wrt promotion. The simpler any of these kthreads 
> are the better.

I was testing on default kernel with NUMA balancing mode 2.

The multi-threaded application allocates memory on DRAM and the 
allocation spills over to CXL node. The threads keep accessing allocated 
memory pages in random order.

pgpromote_success 6
pgpromote_candidate 745387
pgdemote_kswapd 51085
pgdemote_direct 10481
pgdemote_khugepaged 0
numa_pte_updates 27249625
numa_huge_pte_updates 0
numa_hint_faults 9660745
numa_hint_faults_local 0
numa_pages_migrated 6
numa_node_full 745438
pgmigrate_success 2225458
pgmigrate_fail 1187349

I hardly see any promotion happening.

In order to check the number of times the toptier node was found to be 
full when attempting to promote, I added numa_node_full counter like below:

diff --git a/mm/migrate.c b/mm/migrate.c
index fb19a18892c8..4d049d896589 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2673,6 +2673,7 @@ int migrate_misplaced_folio_prepare(struct folio 
*folio,
         if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
                 int z;

+               count_vm_event(NUMA_NODE_FULL);
                 if (!(sysctl_numa_balancing_mode & 
NUMA_BALANCING_MEMORY_TIERING))
                         return -EAGAIN;
                 for (z = pgdat->nr_zones - 1; z >= 0; z--) {


As seen above, numa_node_full 745438. This matches pgpromote_candidate 
numbers.

I do see counters reporting kswapd-driven and direct demotion as well 
but does this mean that demotion isn't happening fast enough to cope up 
with promotion requirement in this high toptier memory pressure situation?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-17  3:39     ` Bharata B Rao
@ 2025-03-17 15:05       ` Gregory Price
  2025-03-17 16:22         ` Bharata B Rao
  0 siblings, 1 reply; 38+ messages in thread
From: Gregory Price @ 2025-03-17 15:05 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, hannes, honggyu.kim, hughd, jhubbard, k.shutemov,
	kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
	mgorman, mingo, nadav.amit, nphamcs, peterz, raghavendra.kt,
	riel, rientjes, rppt, shivankg, shy828301, sj, vbabka, weixugc,
	willy, ying.huang, ziy, yuanchu

On Mon, Mar 17, 2025 at 09:09:18AM +0530, Bharata B Rao wrote:
> On 13-Mar-25 10:14 PM, Davidlohr Bueso wrote:
> > On Thu, 06 Mar 2025, Bharata B Rao wrote:
> > 
> > > +static int page_should_be_promoted(struct page_hotness_info *phi)
> > > +{
> > > +    struct page *page = pfn_to_online_page(phi->pfn);
> > > +    unsigned long now = jiffies;
> > > +    struct folio *folio;
> > > +
> > > +    if (!page || is_zone_device_page(page))
> > > +        return false;
> > > +
> > > +    folio = page_folio(page);
> > > +    if (!folio_test_lru(folio)) {
> > > +        count_vm_event(KPROMOTED_MIG_NON_LRU);
> > > +        return false;
> > > +    }
> > > +    if (folio_nid(folio) == phi->hot_node) {
> > > +        count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
> > > +        return false;
> > > +    }
> > 
> > How about using the LRU age itself:
> 
> Sounds like a good check for page hotness.
> 
> > 
> > if (folio_test_active())
> >     return true;
> 
> But the numbers I obtained with this check added, didn't really hit this
> condition all that much. I was running a multi-threaded application that
> allocates enough memory such that the allocation spills over from DRAM node
> to the CXL node. Threads keep touching the memory pages in random order.
> 

Is demotion enabled by any chance?

i.e. are you sure it's actually allocating from CXL and not demoting
cold stuff to CXL?

> kpromoted_recorded_accesses 960620 /* Number of recorded accesses */
> kpromoted_recorded_hwhints 960620  /* Nr accesses via HW hints, IBS in this
> case */
> kpromoted_recorded_pgtscans 0
> kpromoted_record_toptier 638006 /* Nr toptier accesses */
> kpromoted_record_added 321234 /* Nr (CXL) accesses that are tracked */
> kpromoted_record_exists 1380
> kpromoted_mig_right_node 0
> kpromoted_mig_non_lru 226
> kpromoted_mig_lru_active 47 /* Number of accesses considered for promotion
> as determined by folio_test_active() check */
> kpromoted_mig_cold_old 0
> kpromoted_mig_cold_not_accessed 1373
> kpromoted_mig_candidate 319635
> kpromoted_mig_promoted 319635
> kpromoted_mig_dropped 1599
> 
> Need to check why is this the case.
> 
> > 
> > > +
> > > +    /* If the page was hot a while ago, don't promote */
> > > +    if ((now - phi->last_update) > 2 *
> > > msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
> > > +        count_vm_event(KPROMOTED_MIG_COLD_OLD);
> > > +        return false;
> > > +    }
> > > +
> > > +    /* If the page hasn't been accessed enough number of times,
> > > don't promote */
> > > +    if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
> > > +        count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
> > > +        return false;
> > > +    }
> > > +    return true;
> > > +}
> > 
> > ...
> > 
> > > +static int kpromoted(void *p)
> > > +{
> > > +    pg_data_t *pgdat = (pg_data_t *)p;
> > > +    struct task_struct *tsk = current;
> > > +    long timeout = msecs_to_jiffies(KPROMOTE_DELAY);
> > > +
> > > +    const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> > > +
> > > +    if (!cpumask_empty(cpumask))
> > > +        set_cpus_allowed_ptr(tsk, cpumask);
> > 
> > Explicit cpumasks are not needed if you use kthread_create_on_node().
> 
> Thanks, will incorporate.
> 
> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-17 15:05       ` Gregory Price
@ 2025-03-17 16:22         ` Bharata B Rao
  2025-03-17 18:24           ` Gregory Price
  0 siblings, 1 reply; 38+ messages in thread
From: Bharata B Rao @ 2025-03-17 16:22 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, hannes, honggyu.kim, hughd, jhubbard, k.shutemov,
	kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
	mgorman, mingo, nadav.amit, nphamcs, peterz, raghavendra.kt,
	riel, rientjes, rppt, shivankg, shy828301, sj, vbabka, weixugc,
	willy, ying.huang, ziy, yuanchu

On 17-Mar-25 8:35 PM, Gregory Price wrote:
> On Mon, Mar 17, 2025 at 09:09:18AM +0530, Bharata B Rao wrote:
>> On 13-Mar-25 10:14 PM, Davidlohr Bueso wrote:
>>> On Thu, 06 Mar 2025, Bharata B Rao wrote:
>>>
>>>> +static int page_should_be_promoted(struct page_hotness_info *phi)
>>>> +{
>>>> +    struct page *page = pfn_to_online_page(phi->pfn);
>>>> +    unsigned long now = jiffies;
>>>> +    struct folio *folio;
>>>> +
>>>> +    if (!page || is_zone_device_page(page))
>>>> +        return false;
>>>> +
>>>> +    folio = page_folio(page);
>>>> +    if (!folio_test_lru(folio)) {
>>>> +        count_vm_event(KPROMOTED_MIG_NON_LRU);
>>>> +        return false;
>>>> +    }
>>>> +    if (folio_nid(folio) == phi->hot_node) {
>>>> +        count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
>>>> +        return false;
>>>> +    }
>>>
>>> How about using the LRU age itself:
>>
>> Sounds like a good check for page hotness.
>>
>>>
>>> if (folio_test_active())
>>>      return true;
>>
>> But the numbers I obtained with this check added, didn't really hit this
>> condition all that much. I was running a multi-threaded application that
>> allocates enough memory such that the allocation spills over from DRAM node
>> to the CXL node. Threads keep touching the memory pages in random order.
>>
> 
> Is demotion enabled by any chance?

Yes, I thought enabling demotion is required to create enough room in 
the toptier to handle promotion.

> 
> i.e. are you sure it's actually allocating from CXL and not demoting
> cold stuff to CXL?

But then I realized that spill over was caused by demotion rather than 
initial allocation even when I used MPOL_BIND | MPOL_F_NUMA_BALANCING 
policy with both toptier and CXL node in the nodemask.

> 
>> kpromoted_recorded_accesses 960620 /* Number of recorded accesses */
>> kpromoted_recorded_hwhints 960620  /* Nr accesses via HW hints, IBS in this
>> case */
>> kpromoted_recorded_pgtscans 0
>> kpromoted_record_toptier 638006 /* Nr toptier accesses */
>> kpromoted_record_added 321234 /* Nr (CXL) accesses that are tracked */
>> kpromoted_record_exists 1380
>> kpromoted_mig_right_node 0
>> kpromoted_mig_non_lru 226
>> kpromoted_mig_lru_active 47 /* Number of accesses considered for promotion
>> as determined by folio_test_active() check */

However disabling demotion has no impact on this number (and hence the 
folio_test_active() check)

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-17 16:22         ` Bharata B Rao
@ 2025-03-17 18:24           ` Gregory Price
  0 siblings, 0 replies; 38+ messages in thread
From: Gregory Price @ 2025-03-17 18:24 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, hannes, honggyu.kim, hughd, jhubbard, k.shutemov,
	kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
	mgorman, mingo, nadav.amit, nphamcs, peterz, raghavendra.kt,
	riel, rientjes, rppt, shivankg, shy828301, sj, vbabka, weixugc,
	willy, ying.huang, ziy, yuanchu

On Mon, Mar 17, 2025 at 09:52:29PM +0530, Bharata B Rao wrote:
> > 
> > > kpromoted_recorded_accesses 960620 /* Number of recorded accesses */
> > > kpromoted_recorded_hwhints 960620  /* Nr accesses via HW hints, IBS in this
> > > case */
> > > kpromoted_recorded_pgtscans 0
> > > kpromoted_record_toptier 638006 /* Nr toptier accesses */
> > > kpromoted_record_added 321234 /* Nr (CXL) accesses that are tracked */
> > > kpromoted_record_exists 1380
> > > kpromoted_mig_right_node 0
> > > kpromoted_mig_non_lru 226
> > > kpromoted_mig_lru_active 47 /* Number of accesses considered for promotion
> > > as determined by folio_test_active() check */
> 
> However disabling demotion has no impact on this number (and hence the
> folio_test_active() check)
>

I've been mulling over what's likely to occur when the Low but not Min
watermark is hit and reclaim is invoked but without demotion enabled.

I'm wonder if kswapd pushes things like r/o pagecache out, only to have
them faulted back into CXL later, while new allocations stick on the
main memory.

You might try MPOL_PREFERRED with CXL node as the target instead of bind
w/ the local node to at least make sure the system is actually
identifying hotness correctly.

~Gregory


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-14 15:28   ` Jonathan Cameron
@ 2025-03-18  4:09     ` Bharata B Rao
  2025-03-18 14:17       ` Jonathan Cameron
  0 siblings, 1 reply; 38+ messages in thread
From: Bharata B Rao @ 2025-03-18  4:09 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On 14-Mar-25 8:58 PM, Jonathan Cameron wrote:
> On Thu, 6 Mar 2025 11:15:30 +0530
> Bharata B Rao <bharata@amd.com> wrote:
> 
>> Subsystems that generate hot page access info can report that
>> to kpromoted via this API:
>>
>> int kpromoted_record_access(u64 pfn, int nid, int src,
>> 			    unsigned long time)
> 
> This perhaps works as an interface for aggregating methods
> that produce per access events.  Any hardware counter solution
> is going to give you data that is closer to what you used for
> the promotion decision.

Right.

> 
> We might need to aggregate at different levels.  So access
> counting promotes to a hot list and we can inject other events
> at that level.  The data I have from the CXL HMU is typically
> after an epoch (period of time) these N pages were accessed more
> than M times.  I can sort of map that to the internal storage
> you have.

Even for IBS source, I am aggregating data in per-cpu buffers before 
presenting them one by one to kpromoted. Guess CXL HMU aggregated data 
could be presented in a similar manner.

> 
> Would be good to evaluate approximate trackers on top of access
> counts. I've no idea if sketches or similar would be efficient
> enough (they have a bit of a write amplification problem) but
> they may give good answers with much lower storage cost at the
> risk of occasionally saying something is hot when it's not.

Could me point me to some information about sketches??

> 
>>
>> @pfn: The PFN of the memory accessed
>> @nid: The accessing NUMA node ID
>> @src: The temperature source (subsystem) that generated the
>>        access info
>> @time: The access time in jiffies
>>
>> Some temperature sources may not provide the nid from which
>> the page was accessed. This is true for sources that use
>> page table scanning for PTE Accessed bit. Currently the toptier
>> node to which such pages should be promoted to is hard coded.
> 
> For those cases (CXL HMU included) maybe we need to
> consider how to fill in missing node info with at least a vague chance
> of getting a reasonable target for migration.  We can always fall
> back to random top tier node, or nearest one to where we are coming
> from (on basis we maybe landed in this node based on a fallback
> list when the top tier was under memory pressure).

Yes. For A-bit scanners, Raghu has devised a scheme to obtain the best 
possible list of target nodes for promotion. He should be sharing more 
about it soon.

> 
>  From an interface point of view is that a problem for this layer,
> or for the underlying tracking mechanism? (maybe with some helpers)

It is not a problem from this interface point of view as this interface 
expects a nid(or default value) and would use that for promotion. It is 
up to the underlying tracking mechanism to provide the most appropriate 
target nid.

> Also, see later discussion of consistency of hotness tracking and
> that the best solution for that differs from that to get
> potential targets.  The answer to Is this page consistently hot?" can be
> approximated with "Was this page once hot and is it not now cold?"
> 
> Access time is something some measurement techniques will only
> give you wrt to a measurement was in a window (potentially a long
> one if you are looking for consistent hotness over minutes).
> 
>>
>> Also, the access time provided some sources may at best be
>> considered approximate. This is especially true for hot pages
>> detected by PTE A bit scanning.
>>
>> kpromoted currently maintains the hot PFN records in hash lists
>> hashed by PFN value. Each record stores the following info:
>>
>> struct page_hotness_info {
>> 	unsigned long pfn;
>>
>> 	/* Time when this record was updated last */
>> 	unsigned long last_update;
>>
>> 	/*
>> 	 * Number of times this page was accessed in the
>> 	 * current window
> I'd express here how that window was defined (I read on
> to answer the question I had here at first!)

Currently the number of accesses that occur within an observation window 
of 5s are considered for hotness calculation and access count is reset 
when the window elapses. This needs to factor in history etc.

> 
>> 	 */
>> 	int frequency;
>>
>> 	/* Most recent access time */
>> 	unsigned long recency;
> 
> Put next to the last_update so all the times are together

Sure.

> 
>>
>> 	/* Most recent access from this node */
>> 	int hot_node;
> 
> Probably want to relax the most recent part.  I'd guess
> the ideal here would be if this is the node accessing it the most
> 'recently'.

You mean the node that did most number of accesses in the given 
observation window and not necessarily the last (or most recent) 
accessed node.

> 
>>
>> 	struct hlist_node hnode;
>> };
>>
>> The way in which a page is categorized as hot enough to be
>> promoted is pretty primitive now.
> 
> That bit is very hard even if we solve everything else and heavily dependent
> on workload access pattern stability and migration impact.  Maybe for
> 'very hot' pages a fairly short consistency of hotness period is
> good enough, but it gets much messier if we care about warm pages.
> I guess we solve the 'very hot' first though and maybe avoid the phase
> transition from an application starting to when it is at steady state
> by considering a wait time for any new userspace process before we
> consider moving anything?
> 
> Also worth noting that the mechanism that makes sense to check if a
> detected hot page is 'stable hot' might use entirely different tracking
> approach to that used to find it as a candidate.
> 
> Whether that requires passing data between hotness trackers is an
> interesting question, or whether there is a natural ordering to trackers.

I was envisioning that different hotness trackers would reinforce the 
page hotness by reporting the same to kpromoted and there would be no 
need to again pass data between hotness trackers.

> 
> 
> 
>> diff --git a/mm/kpromoted.c b/mm/kpromoted.c
>> new file mode 100644
>> index 000000000000..2a8b8495b6b3
>> --- /dev/null
>> +++ b/mm/kpromoted.c
> 
>> +static int page_should_be_promoted(struct page_hotness_info *phi)
>> +{
>> +	struct page *page = pfn_to_online_page(phi->pfn);
>> +	unsigned long now = jiffies;
>> +	struct folio *folio;
>> +
>> +	if (!page || is_zone_device_page(page))
>> +		return false;
>> +
>> +	folio = page_folio(page);
>> +	if (!folio_test_lru(folio)) {
>> +		count_vm_event(KPROMOTED_MIG_NON_LRU);
>> +		return false;
>> +	}
>> +	if (folio_nid(folio) == phi->hot_node) {
>> +		count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
>> +		return false;
>> +	}
>> +
>> +	/* If the page was hot a while ago, don't promote */
> 
> 	/* If the known record of hotness is old, don't promote */ ?
> 
> Otherwise this says don't move a page just because it was hot a long time
> back. Maybe it is still hot and we just don't have an update yet?

Agreed.

> 
>> +	if ((now - phi->last_update) > 2 * msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
>> +		count_vm_event(KPROMOTED_MIG_COLD_OLD);
>> +		return false;
>> +	}
>> +
>> +	/* If the page hasn't been accessed enough number of times, don't promote */
>> +	if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
>> +		count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
>> +		return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +/*
>> + * Go thro' page hotness information and migrate pages if required.
>> + *
>> + * Promoted pages are not longer tracked in the hot list.
>> + * Cold pages are pruned from the list as well.
> 
> When we say cold here why did we ever see them?

Those hot pages that couldn't be migrated for different reasons are no 
longer tracked by kpromoted and I called such pages as "cold". Guess not 
the right nomenclature to represent them.

> 
>> + *
>> + * TODO: Batching could be done
>> + */
>> +static void kpromoted_migrate(pg_data_t *pgdat)
>> +{
>> +	int nid = pgdat->node_id;
>> +	struct page_hotness_info *phi;
>> +	struct hlist_node *tmp;
>> +	int nr_bkts = HASH_SIZE(page_hotness_hash);
>> +	int bkt;
>> +
>> +	for (bkt = 0; bkt < nr_bkts; bkt++) {
>> +		mutex_lock(&page_hotness_lock[bkt]);
>> +		hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], hnode) {
>> +			if (phi->hot_node != nid)
>> +				continue;
>> +
>> +			if (page_should_be_promoted(phi)) {
>> +				count_vm_event(KPROMOTED_MIG_CANDIDATE);
>> +				if (!kpromote_page(phi)) {
>> +					count_vm_event(KPROMOTED_MIG_PROMOTED);
>> +					hlist_del_init(&phi->hnode);
>> +					kfree(phi);
>> +				}
>> +			} else {
>> +				/*
>> +				 * Not a suitable page or cold page, stop tracking it.
>> +				 * TODO: Identify cold pages and drive demotion?
> 
> Coldness tracking is really different from hotness as we need to track what we
> didn't see to get the really cold pages. Maybe there is some hint to be had
> from the exit of this tracker but I'd definitely not try to tackle both ends
> with one approach!

Okay.

> 
>> +				 */
>> +				count_vm_event(KPROMOTED_MIG_DROPPED);
>> +				hlist_del_init(&phi->hnode);
>> +				kfree(phi);
>> +			}
>> +		}
>> +		mutex_unlock(&page_hotness_lock[bkt]);
>> +	}
>> +}
> 
> 
>> +/*
>> + * Called by subsystems that generate page hotness/access information.
>> + *
>> + * Records the memory access info for futher action by kpromoted.
>> + */
>> +int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now)
>> +{
> 
>> +	bkt = hash_min(pfn, KPROMOTED_HASH_ORDER);
>> +	mutex_lock(&page_hotness_lock[bkt]);
>> +	phi = kpromoted_lookup(pfn, bkt, now);
>> +	if (!phi) {
>> +		ret = PTR_ERR(phi);
>> +		goto out;
>> +	}
>> +
>> +	if ((phi->last_update - now) > msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
>> +		/* New window */
>> +		phi->frequency = 1; /* TODO: Factor in the history */
>> +		phi->last_update = now;
>> +	} else {
>> +		phi->frequency++;
>> +	}
>> +	phi->recency = now;
>> +
>> +	/*
>> +	 * TODOs:
>> +	 * 1. Source nid is hard-coded for some temperature sources
> 
> Hard coded rather than unknown? I'm curious, what source has that issue?

I meant that source didn't provide a nid and hence kpromoted ended up 
promoting to a fixed (hard-coded for now) toptier node.

> 
>> +	 * 2. Take action if hot_node changes - may be a shared page?
>> +	 * 3. Maintain node info for every access within the window?
> 
> I guess some sort of saturating counter set might not be too bad.

Yes.

> 
>> +	 */
>> +	phi->hot_node = (nid == NUMA_NO_NODE) ? 1 : nid;
>> +	mutex_unlock(&page_hotness_lock[bkt]);
>> +out:
>> +	return 0;
> 
> why store ret and not return it?

Will fix.

Thanks for your review!

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
  2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
                   ` (4 preceding siblings ...)
  2025-03-16 22:00 ` [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages SeongJae Park
@ 2025-03-18  5:28 ` Balbir Singh
  2025-03-20  9:07   ` Bharata B Rao
  2025-03-25  8:18 ` Bharata B Rao
  6 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-18  5:28 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On 3/6/25 16:45, Bharata B Rao wrote:
> Hi,
> 
> This is an attempt towards having a single subsystem that accumulates
> hot page information from lower memory tiers and does hot page
> promotion.
> 
> At the heart of this subsystem is a kernel daemon named kpromoted that
> does the following:
> 
> 1. Exposes an API that other subsystems which detect/generate memory
>    access information can use to inform the daemon about memory
>    accesses from lower memory tiers.
> 2. Maintains the list of hot pages and attempts to promote them to
>    toptiers.
> 
> Currently I have added AMD IBS driver as one source that provides
> page access information as an example. This driver feeds info to
> kpromoted in this RFC patchset. More sources were discussed in a
> similar context here at [1].
> 

Is hot page promotion mandated or good to have? Memory tiers today
are a function of latency and bandwidth, specifically in 
mt_aperf_to_distance() 

adist ~ k * R(B)/R(L) where R(x) is relatively performance of the
memory w.r.t DRAM. Do we want hot pages in the top tier all the time?
Are we optimizing for bandwidth or latency?

> This is just an early attempt to check what it takes to maintain
> a single source of page hotness info and also separate hot page
> detection mechanisms from the promotion mechanism. There are too
> many open ends right now and I have listed a few of them below.
> 


<snip>

> This is just an early RFC posted now to ignite some discussion
> in the context of LSFMM [2].
> 

I look forward to any summary of the discussions

Balbir Singh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
  2025-03-16 22:00 ` [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages SeongJae Park
@ 2025-03-18  6:33   ` Raghavendra K T
  2025-03-18 10:45   ` Bharata B Rao
  1 sibling, 0 replies; 38+ messages in thread
From: Raghavendra K T @ 2025-03-18  6:33 UTC (permalink / raw)
  To: SeongJae Park, Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz, riel,
	rientjes, rppt, shivankg, shy828301, vbabka, weixugc, willy,
	ying.huang, ziy, dave, yuanchu, hyeonggon.yoo, Harry Yoo



On 3/17/2025 3:30 AM, SeongJae Park wrote:
> + Harry, who was called Hyeonggon before.
>>
>> I am also working with Raghu to integrate his kmmdscan [3] as the
>> hotness source and use kpromoted for migration.
> 
> Raghu also mentioned he would try to take a time to look into DAMON if there is
> anything that he could reuse for the purpose.  I'm curious if he was able to
> find something there.
> 
[...]
Hello SJ,

I did take a look at DAMON vaddr and paddr implementation. Also
wondering how can I optimize hotness data collected by kmmscand.

DAMON regions should be very helpful here, But I am not there yet.

will surely need help brainstorming session post my next RFC.

Thanks and Regards
- Raghu


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
  2025-03-16 22:00 ` [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages SeongJae Park
  2025-03-18  6:33   ` Raghavendra K T
@ 2025-03-18 10:45   ` Bharata B Rao
  1 sibling, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-18 10:45 UTC (permalink / raw)
  To: SeongJae Park
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, gourry, hannes, honggyu.kim, hughd, jhubbard,
	k.shutemov, kbusch, kmanaouil.dev, leesuyeon0506, leillc,
	liam.howlett, mgorman, mingo, nadav.amit, nphamcs, peterz,
	raghavendra.kt, riel, rientjes, rppt, shivankg, shy828301,
	vbabka, weixugc, willy, ying.huang, ziy, dave, yuanchu,
	hyeonggon.yoo, Harry Yoo

Hi SJ,

Thanks for your detailed points and this surely sets up a good context 
for discussion in LSFMM.

Please see my replies to a few of your questions below:

On 17-Mar-25 3:30 AM, SeongJae Park wrote:
>>
>> Currently I have added AMD IBS driver as one source that provides
>> page access information as an example. This driver feeds info to
>> kpromoted in this RFC patchset. More sources were discussed in a
>> similar context here at [1].
> 
> I was imagining how I would be able to do this with DAMON via operations set
> layer interface.  And I find thee current interface is not very optimized for
> AMD IBS like sources that catches the access on the line.  That is, in a way,
> we could say AMD IBS like primitives as push-oriented, while page tables'
> accessed bits information are pull-oriented.  DAMON operations set layer
> interface is easier to be used in pull-oriented case.  I don't think it cannot
> be used for push-oriented case, but definitely the interface would better to be
> more optimized for the use case.
> 
> I'm curious if you also tried doing this by extending DAMON, and if some hidden
> problems you found.

I remember discussing this with you during DAMON BoF in one of the 
earlier LPC events, but I didn't get to try it. Guess now is the time :-)

I see the challenge with the current DAMON interfaces to integrate IBS 
provided access info. If you check my IBS driver, I store the incoming 
access info from IBS into per-cpu buffers before pushing them on to the 
subsystem that act on them. I would think pull-based DAMON interfaces 
can consume those buffered samples rather than IBS pushing samples into 
DAMON. But I am yet to get clarity on how to honor the region based 
sampling that is inherent to DAMON's functioning. May be only using 
samples that are of interest to the region being tracked could be one way.

> 
>>
>> This is just an early attempt to check what it takes to maintain
>> a single source of page hotness info and also separate hot page
>> detection mechanisms from the promotion mechanism. There are too
>> many open ends right now and I have listed a few of them below.
>>
>> - The API that is provided to register memory access expects
>>    the PFN, NID and time of access at the minimum. This is
>>    described more in patch 2/4. This API currently can be called
>>    only from contexts that allow sleeping and hence this rules
>>    out using it from PTE scanning paths. The API needs to be
>>    more flexible with respect to this.
>> - Some sources like PTE A bit scanning can't provide the precise
>>    time of access or the NID that is accessing the page. The latter
>>    has been an open problem to which I haven't come across a good
>>    and acceptable solution.
> 
> Agree.  PTE A bit scanning could be useful in many cases, but not every case.
> There was an RFC patchset[7] that extends DAMON for NID.  I'm planning to do
> that again using DAMON operations layer interface.  My current plan is to
> implement the prototype using prot_none page faults, and later extend for AMD
> IBS like h/w features.  Hopefully I will share a prototype or at least more
> detailed idea on LSFMMBPF 2025.
> 
>> - The way the hot page information is maintained is pretty
>>    primitive right now. Ideally we would like to store hotness info
>>    in such a way that it should be easily possible to lookup say N
>>    most hot pages.
> 
> DAMON provides a feature for lookup of N most hotpages, namely DAMOS quotas'
> access pattern based regions prioritization[5].
> 
>> - If PTE A bit scanners are considered as hotness sources, we will
>>    be bombarded with accesses. Do we want to accomodate all those
>>    accesses or just go with hotness info for fixed number of pages
>>    (possibly as a ratio of lower tier memory capacity)?
> 
> I understand you're saying about memory space overhead.  Correct me if I'm
> wrong, please.

Correct and also the overhead of managing so much data. What I see is 
that if I start pushing all the access info obtained from LRU pgtable 
scanning, kpromoted would end up spending a lot of time in operations 
like lookup, walking the list of hot pages etc.

So may be it would be better to do some sort of early processing and/or 
filtering at the hotness source level itself before letting 
kpromoted-like subsystems to do further tracking and action.

> 
> Isn't same issue exists for current implementation of the sampling frequency is
> high, and/or aggregation window is long?
> 
> To me, hence, this looks like not a problem of the information source, but how
> to maintain the information.  Current implementation maintains it per page, so
> I think the problem is inherent.

Well yes, but we the goal could be do better than NUMAB=2 which does 
per-page level tracking.

> 
> DAMON maintains the information in region abstraction that can save multiple
> pages with one data structure.  The maximum number of regions can be set by
> users, so the space overhead can be controlled.

The granularity of tracking - per-page vs range/region is a topic of 
discussion I suppose.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-18  4:09     ` Bharata B Rao
@ 2025-03-18 14:17       ` Jonathan Cameron
  0 siblings, 0 replies; 38+ messages in thread
From: Jonathan Cameron @ 2025-03-18 14:17 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On Tue, 18 Mar 2025 09:39:17 +0530
Bharata B Rao <bharata@amd.com> wrote:

> On 14-Mar-25 8:58 PM, Jonathan Cameron wrote:
> > On Thu, 6 Mar 2025 11:15:30 +0530
> > Bharata B Rao <bharata@amd.com> wrote:
> >   
> >> Subsystems that generate hot page access info can report that
> >> to kpromoted via this API:
> >>
> >> int kpromoted_record_access(u64 pfn, int nid, int src,
> >> 			    unsigned long time)  
> > 
> > This perhaps works as an interface for aggregating methods
> > that produce per access events.  Any hardware counter solution
> > is going to give you data that is closer to what you used for
> > the promotion decision.  
> 
> Right.
> 
> > 
> > We might need to aggregate at different levels.  So access
> > counting promotes to a hot list and we can inject other events
> > at that level.  The data I have from the CXL HMU is typically
> > after an epoch (period of time) these N pages were accessed more
> > than M times.  I can sort of map that to the internal storage
> > you have.  
> 
> Even for IBS source, I am aggregating data in per-cpu buffers before 
> presenting them one by one to kpromoted. Guess CXL HMU aggregated data 
> could be presented in a similar manner.

The nature of the data maybe a bit different but certainly should be
able to find somewhere in the stack!

> 
> > 
> > Would be good to evaluate approximate trackers on top of access
> > counts. I've no idea if sketches or similar would be efficient
> > enough (they have a bit of a write amplification problem) but
> > they may give good answers with much lower storage cost at the
> > risk of occasionally saying something is hot when it's not.  
> 
> Could me point me to some information about sketches??

https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
Is a good starting point but there are lots of related techniques
that trade off good statistical properties against complexity etc.

Roughly speaking you combine a sorted list of the very hottest
with a small number of different hash tables (the sketch) that let you
get an estimate of how hot things are that have dropped off your
hottest list (or not yet gotten hot enough to get into it). 

Looking for literature for top-k algorithms will find you
more references though not all are light weight enough to be
of interest here.

> 
> >   
> >>
> >> @pfn: The PFN of the memory accessed
> >> @nid: The accessing NUMA node ID
> >> @src: The temperature source (subsystem) that generated the
> >>        access info
> >> @time: The access time in jiffies
> >>
> >> Some temperature sources may not provide the nid from which
> >> the page was accessed. This is true for sources that use
> >> page table scanning for PTE Accessed bit. Currently the toptier
> >> node to which such pages should be promoted to is hard coded.  
> > 
> > For those cases (CXL HMU included) maybe we need to
> > consider how to fill in missing node info with at least a vague chance
> > of getting a reasonable target for migration.  We can always fall
> > back to random top tier node, or nearest one to where we are coming
> > from (on basis we maybe landed in this node based on a fallback
> > list when the top tier was under memory pressure).  
> 
> Yes. For A-bit scanners, Raghu has devised a scheme to obtain the best 
> possible list of target nodes for promotion. He should be sharing more 
> about it soon.

Excellent - look forward to seeing that.  Can think of a few possibilities
on how to get that data efficiently so I'm curious what Raghu has chosen.

> 
> > 
> >  From an interface point of view is that a problem for this layer,
> > or for the underlying tracking mechanism? (maybe with some helpers)  
> 
> It is not a problem from this interface point of view as this interface 
> expects a nid(or default value) and would use that for promotion. It is 
> up to the underlying tracking mechanism to provide the most appropriate 
> target nid.

I was wondering if there is some sharing to do, so whether we push Raghu's
means of getting a target node down into the tracker implementation or use
it to fill in messing info at this layer.  Will depend a bit on how
that technique works perhaps.

> 
> > Also, see later discussion of consistency of hotness tracking and
> > that the best solution for that differs from that to get
> > potential targets.  The answer to Is this page consistently hot?" can be
> > approximated with "Was this page once hot and is it not now cold?"
> > 
> > Access time is something some measurement techniques will only
> > give you wrt to a measurement was in a window (potentially a long
> > one if you are looking for consistent hotness over minutes).
> >   
> >>
> >> Also, the access time provided some sources may at best be
> >> considered approximate. This is especially true for hot pages
> >> detected by PTE A bit scanning.
> >>
> >> kpromoted currently maintains the hot PFN records in hash lists
> >> hashed by PFN value. Each record stores the following info:
> >>
> >> struct page_hotness_info {
> >> 	unsigned long pfn;
> >>
> >> 	/* Time when this record was updated last */
> >> 	unsigned long last_update;
> >>
> >> 	/*
> >> 	 * Number of times this page was accessed in the
> >> 	 * current window  
> > I'd express here how that window was defined (I read on
> > to answer the question I had here at first!)  
> 
> Currently the number of accesses that occur within an observation window 
> of 5s are considered for hotness calculation and access count is reset 
> when the window elapses. This needs to factor in history etc.

Just add that to the comment here perhaps.


> 
> >   
> >>
> >> 	/* Most recent access from this node */
> >> 	int hot_node;  
> > 
> > Probably want to relax the most recent part.  I'd guess
> > the ideal here would be if this is the node accessing it the most
> > 'recently'.  
> 
> You mean the node that did most number of accesses in the given 
> observation window and not necessarily the last (or most recent) 
> accessed node.

yes. Though maybe weighted in some fashion for recency?  Something
cheap to do that approximates that such as small saturating counters
with aging..

> 
> >   
> >>
> >> 	struct hlist_node hnode;
> >> };
> >>
> >> The way in which a page is categorized as hot enough to be
> >> promoted is pretty primitive now.  
> > 
> > That bit is very hard even if we solve everything else and heavily dependent
> > on workload access pattern stability and migration impact.  Maybe for
> > 'very hot' pages a fairly short consistency of hotness period is
> > good enough, but it gets much messier if we care about warm pages.
> > I guess we solve the 'very hot' first though and maybe avoid the phase
> > transition from an application starting to when it is at steady state
> > by considering a wait time for any new userspace process before we
> > consider moving anything?
> > 
> > Also worth noting that the mechanism that makes sense to check if a
> > detected hot page is 'stable hot' might use entirely different tracking
> > approach to that used to find it as a candidate.
> > 
> > Whether that requires passing data between hotness trackers is an
> > interesting question, or whether there is a natural ordering to trackers.  
> 
> I was envisioning that different hotness trackers would reinforce the 
> page hotness by reporting the same to kpromoted and there would be no 
> need to again pass data between hotness trackers.

What makes me wonder about that is the question of stability of hotness.
It is a really bad idea to move data based on a short sample - cost is huge
and quite a bit of data is only briefly hot - moving it to fast memory too
early just results in bouncing.  There are probably heuristics we can apply
on process age etc that will help, but generally we can't assume programs
don't have multiple phases with very different access characteristics.

The different tracking approaches have different sweet spots for short vs long
tracking. So it might be a case of one method, e.g. a hotness tracker
is only suitable for monitoring a short time period
(in the simplest sense, counters saturate if you run too long).
Don't read that to generally though as it's not a universal characteristic
and depends on the implementation used, but it is definitely true of
some potential implementations.

Having gotten a list of 1000+ candidate pages that might be worth moving,
we could use access bits to check it's still accessed reasonsably frequently
over the next minute. That can be much lower cost than an access tracker
that is looking for 'hottest'.

Where all these trade offs with timing occur is tricky and workload
dependent.  So figuring out how to autotune will be a challenging.

> 
> > 
> > 
> >   
> >> diff --git a/mm/kpromoted.c b/mm/kpromoted.c
> >> new file mode 100644
> >> index 000000000000..2a8b8495b6b3
> >> --- /dev/null
> >> +++ b/mm/kpromoted.c  


> >> +	bkt = hash_min(pfn, KPROMOTED_HASH_ORDER);
> >> +	mutex_lock(&page_hotness_lock[bkt]);
> >> +	phi = kpromoted_lookup(pfn, bkt, now);
> >> +	if (!phi) {
> >> +		ret = PTR_ERR(phi);
> >> +		goto out;
> >> +	}
> >> +
> >> +	if ((phi->last_update - now) > msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
> >> +		/* New window */
> >> +		phi->frequency = 1; /* TODO: Factor in the history */
> >> +		phi->last_update = now;
> >> +	} else {
> >> +		phi->frequency++;
> >> +	}
> >> +	phi->recency = now;
> >> +
> >> +	/*
> >> +	 * TODOs:
> >> +	 * 1. Source nid is hard-coded for some temperature sources  
> > 
> > Hard coded rather than unknown? I'm curious, what source has that issue?  
> 
> I meant that source didn't provide a nid and hence kpromoted ended up 
> promoting to a fixed (hard-coded for now) toptier node.

Sure. Unknown nid makes sense here.


Thanks,

Jonathan




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
  2025-03-18  5:28 ` Balbir Singh
@ 2025-03-20  9:07   ` Bharata B Rao
  2025-03-21  6:19     ` Balbir Singh
  0 siblings, 1 reply; 38+ messages in thread
From: Bharata B Rao @ 2025-03-20  9:07 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu

Hi Balbir,

On 18-Mar-25 10:58 AM, Balbir Singh wrote:
> On 3/6/25 16:45, Bharata B Rao wrote:
>> Hi,
>>
>> This is an attempt towards having a single subsystem that accumulates
>> hot page information from lower memory tiers and does hot page
>> promotion.
>>
>> At the heart of this subsystem is a kernel daemon named kpromoted that
>> does the following:
>>
>> 1. Exposes an API that other subsystems which detect/generate memory
>>     access information can use to inform the daemon about memory
>>     accesses from lower memory tiers.
>> 2. Maintains the list of hot pages and attempts to promote them to
>>     toptiers.
>>
>> Currently I have added AMD IBS driver as one source that provides
>> page access information as an example. This driver feeds info to
>> kpromoted in this RFC patchset. More sources were discussed in a
>> similar context here at [1].
>>
> 
> Is hot page promotion mandated or good to have?

If you look at the current hot page promotion (NUMAB=2) logic, IIUC an 
accessed lower tier page is directly promoted to toptier if enough space 
exists in the toptier node. In such cases, it doesn't even bother about 
the hot threshold (measure of how recently it was accessed) or migration 
rate limiting. This tells me that it in a tiered memory setup, having an 
accessed page in toptier is preferrable.

> Memory tiers today
> are a function of latency and bandwidth, specifically in
> mt_aperf_to_distance()
> 
> adist ~ k * R(B)/R(L) where R(x) is relatively performance of the
> memory w.r.t DRAM. Do we want hot pages in the top tier all the time?
> Are we optimizing for bandwidth or latency?

When memory tiering code converts BW and latency numbers into an opaque 
metric adistance based on which the node gets placed at an appropriate 
position in the tiering hierarchy, I wonder if it is still possible to 
say if we are optimizing for bandwidth or latency separately?

>> This is just an early attempt to check what it takes to maintain
>> a single source of page hotness info and also separate hot page
>> detection mechanisms from the promotion mechanism. There are too
>> many open ends right now and I have listed a few of them below.
>>
> 
> 
> <snip>
> 
>> This is just an early RFC posted now to ignite some discussion
>> in the context of LSFMM [2].
>>
> 
> I look forward to any summary of the discussions

Sure. Thanks,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
  2025-03-20  9:07   ` Bharata B Rao
@ 2025-03-21  6:19     ` Balbir Singh
  0 siblings, 0 replies; 38+ messages in thread
From: Balbir Singh @ 2025-03-21  6:19 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu

On 3/20/25 20:07, Bharata B Rao wrote:
> Hi Balbir,
> 
> On 18-Mar-25 10:58 AM, Balbir Singh wrote:
>> On 3/6/25 16:45, Bharata B Rao wrote:
>>> Hi,
>>>
>>> This is an attempt towards having a single subsystem that accumulates
>>> hot page information from lower memory tiers and does hot page
>>> promotion.
>>>
>>> At the heart of this subsystem is a kernel daemon named kpromoted that
>>> does the following:
>>>
>>> 1. Exposes an API that other subsystems which detect/generate memory
>>>     access information can use to inform the daemon about memory
>>>     accesses from lower memory tiers.
>>> 2. Maintains the list of hot pages and attempts to promote them to
>>>     toptiers.
>>>
>>> Currently I have added AMD IBS driver as one source that provides
>>> page access information as an example. This driver feeds info to
>>> kpromoted in this RFC patchset. More sources were discussed in a
>>> similar context here at [1].
>>>
>>
>> Is hot page promotion mandated or good to have?
> 
> If you look at the current hot page promotion (NUMAB=2) logic, IIUC an accessed lower tier page is directly promoted to toptier if enough space exists in the toptier node. In such cases, it doesn't even bother about the hot threshold (measure of how recently it was accessed) or migration rate limiting. This tells me that it in a tiered memory setup, having an accessed page in toptier is preferrable.
> 

I'll review the patches, I don't agree with toptier, I think DRAM is the
right tier

>> Memory tiers today
>> are a function of latency and bandwidth, specifically in
>> mt_aperf_to_distance()
>>
>> adist ~ k * R(B)/R(L) where R(x) is relatively performance of the
>> memory w.r.t DRAM. Do we want hot pages in the top tier all the time?
>> Are we optimizing for bandwidth or latency?
> 
> When memory tiering code converts BW and latency numbers into an opaque metric adistance based on which the node gets placed at an appropriate position in the tiering hierarchy, I wonder if it is still possible to say if we are optimizing for bandwidth or latency separately?

I think we need a notion of that, just higher tiers may not be right.
IOW, I think we need to promote to at-most the DRAM tier, not above it.


Balbir Singh


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-06  5:45 ` [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
  2025-03-06 12:13   ` David Hildenbrand
  2025-03-06 17:24   ` Gregory Price
@ 2025-03-24  2:55   ` Balbir Singh
  2025-03-24 14:51     ` Bharata B Rao
  2 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-24  2:55 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On 3/6/25 16:45, Bharata B Rao wrote:
> migrate_misplaced_folio_prepare() can be called from a
> context where VMA isn't available. Allow the migration
> to work from such contexts too.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  mm/migrate.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index fb19a18892c8..5b21856a0dd0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2639,7 +2639,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>  
>  /*
>   * Prepare for calling migrate_misplaced_folio() by isolating the folio if
> - * permitted. Must be called with the PTL still held.
> + * permitted. Must be called with the PTL still held if called with a non-NULL
> + * vma.
>   */
>  int migrate_misplaced_folio_prepare(struct folio *folio,
>  		struct vm_area_struct *vma, int node)
> @@ -2656,7 +2657,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
>  		 * See folio_likely_mapped_shared() on possible imprecision
>  		 * when we cannot easily detect if a folio is shared.
>  		 */
> -		if ((vma->vm_flags & VM_EXEC) &&
> +		if (vma && (vma->vm_flags & VM_EXEC) &&
>  		    folio_likely_mapped_shared(folio))
>  			return -EACCES;
>  

In the worst case, the absence of the vma would mean that we try to isolate
and migrate a shared folio with executable pages. Are those a key target for the
hot page migration?

Balbir


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
                     ` (3 preceding siblings ...)
  2025-03-14 15:28   ` Jonathan Cameron
@ 2025-03-24  3:35   ` Balbir Singh
  2025-03-28  4:55     ` Bharata B Rao
  2025-03-24 13:43   ` Gregory Price
  5 siblings, 1 reply; 38+ messages in thread
From: Balbir Singh @ 2025-03-24  3:35 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

On 3/6/25 16:45, Bharata B Rao wrote:
> kpromoted is a kernel daemon that accumulates hot page info
> from different sources and tries to promote pages from slow
> tiers to top tiers. One instance of this thread runs on each
> node that has CPUs.
> 

Could you please elaborate on what is slow vs top tier? A top tier uses
adist (which is a combination of bandwidth and latency), so I am
not sure the terminology here holds.

> Subsystems that generate hot page access info can report that
> to kpromoted via this API:
> 
> int kpromoted_record_access(u64 pfn, int nid, int src,
> 			    unsigned long time)
> 
> @pfn: The PFN of the memory accessed
> @nid: The accessing NUMA node ID
> @src: The temperature source (subsystem) that generated the
>       access info
> @time: The access time in jiffies
> 
> Some temperature sources may not provide the nid from which

What is a temperature source?

> the page was accessed. This is true for sources that use
> page table scanning for PTE Accessed bit. Currently the toptier
> node to which such pages should be promoted to is hard coded.
> 

What would it take to make this flexible?

> Also, the access time provided some sources may at best be
> considered approximate. This is especially true for hot pages
> detected by PTE A bit scanning.
> 
> kpromoted currently maintains the hot PFN records in hash lists
> hashed by PFN value. Each record stores the following info:
> 
> struct page_hotness_info {
> 	unsigned long pfn;
> 
> 	/* Time when this record was updated last */
> 	unsigned long last_update;
> 
> 	/*
> 	 * Number of times this page was accessed in the
> 	 * current window
> 	 */
> 	int frequency;
> 
> 	/* Most recent access time */
> 	unsigned long recency;
> 
> 	/* Most recent access from this node */
> 	int hot_node;
> 
> 	struct hlist_node hnode;
> };
> 
> The way in which a page is categorized as hot enough to be
> promoted is pretty primitive now.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  include/linux/kpromoted.h     |  54 ++++++
>  include/linux/mmzone.h        |   4 +
>  include/linux/vm_event_item.h |  13 ++
>  mm/Kconfig                    |   7 +
>  mm/Makefile                   |   1 +
>  mm/kpromoted.c                | 305 ++++++++++++++++++++++++++++++++++
>  mm/mm_init.c                  |  10 ++
>  mm/vmstat.c                   |  13 ++
>  8 files changed, 407 insertions(+)
>  create mode 100644 include/linux/kpromoted.h
>  create mode 100644 mm/kpromoted.c
> 
> diff --git a/include/linux/kpromoted.h b/include/linux/kpromoted.h
> new file mode 100644
> index 000000000000..2bef3d74f03a
> --- /dev/null
> +++ b/include/linux/kpromoted.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_KPROMOTED_H
> +#define _LINUX_KPROMOTED_H
> +
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/workqueue_types.h>
> +
> +/* Page hotness temperature sources */
> +enum kpromoted_src {
> +	KPROMOTED_HW_HINTS,
> +	KPROMOTED_PGTABLE_SCAN,
> +};
> +
> +#ifdef CONFIG_KPROMOTED
> +
> +#define KPROMOTED_FREQ_WINDOW	(5 * MSEC_PER_SEC)
> +
> +/* 2 accesses within a window will make the page a promotion candidate */
> +#define KPRMOTED_FREQ_THRESHOLD	2
> +

Were these value derived empirically?


> +#define KPROMOTED_HASH_ORDER	16
> +
> +struct page_hotness_info {
> +	unsigned long pfn;
> +
> +	/* Time when this record was updated last */
> +	unsigned long last_update;
> +
> +	/*
> +	 * Number of times this page was accessed in the
> +	 * current window
> +	 */
> +	int frequency;
> +
> +	/* Most recent access time */
> +	unsigned long recency;
> +
> +	/* Most recent access from this node */
> +	int hot_node;
> +	struct hlist_node hnode;
> +};
> +
> +#define KPROMOTE_DELAY	MSEC_PER_SEC
> +
> +int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now);
> +#else
> +static inline int kpromoted_record_access(u64 pfn, int nid, int src,
> +					  unsigned long now)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_KPROMOTED */
> +#endif /* _LINUX_KPROMOTED_H */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9540b41894da..a5c4e789aa55 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1459,6 +1459,10 @@ typedef struct pglist_data {
>  #ifdef CONFIG_MEMORY_FAILURE
>  	struct memory_failure_stats mf_stats;
>  #endif
> +#ifdef CONFIG_KPROMOTED
> +	struct task_struct *kpromoted;
> +	wait_queue_head_t kpromoted_wait;
> +#endif
>  } pg_data_t;
>  
>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f70d0958095c..b5823b037883 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -182,6 +182,19 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		KSTACK_REST,
>  #endif
>  #endif /* CONFIG_DEBUG_STACK_USAGE */
> +		KPROMOTED_RECORDED_ACCESSES,
> +		KPROMOTED_RECORD_HWHINTS,
> +		KPROMOTED_RECORD_PGTSCANS,
> +		KPROMOTED_RECORD_TOPTIER,
> +		KPROMOTED_RECORD_ADDED,
> +		KPROMOTED_RECORD_EXISTS,
> +		KPROMOTED_MIG_RIGHT_NODE,
> +		KPROMOTED_MIG_NON_LRU,
> +		KPROMOTED_MIG_COLD_OLD,
> +		KPROMOTED_MIG_COLD_NOT_ACCESSED,
> +		KPROMOTED_MIG_CANDIDATE,
> +		KPROMOTED_MIG_PROMOTED,
> +		KPROMOTED_MIG_DROPPED,
>  		NR_VM_EVENT_ITEMS
>  };
>  
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 1b501db06417..ceaa462a0ce6 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1358,6 +1358,13 @@ config PT_RECLAIM
>  
>  	  Note: now only empty user PTE page table pages will be reclaimed.
>  
> +config KPROMOTED
> +	bool "Kernel hot page promotion daemon"
> +	def_bool y
> +	depends on NUMA && MIGRATION && MMU
> +	help
> +	  Promote hot pages from lower tier to top tier by using the
> +	  memory access information provided by various sources.
>  
>  source "mm/damon/Kconfig"
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 850386a67b3e..bf4f5f18f1f9 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
>  obj-$(CONFIG_EXECMEM) += execmem.o
>  obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
>  obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
> +obj-$(CONFIG_KPROMOTED) += kpromoted.o
> diff --git a/mm/kpromoted.c b/mm/kpromoted.c
> new file mode 100644
> index 000000000000..2a8b8495b6b3
> --- /dev/null
> +++ b/mm/kpromoted.c
> @@ -0,0 +1,305 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * kpromoted is a kernel thread that runs on each node that has CPU i,e.,
> + * on regular nodes.
> + *
> + * Maintains list of hot pages from lower tiers and promotes them.
> + */
> +#include <linux/kpromoted.h>
> +#include <linux/kthread.h>
> +#include <linux/mutex.h>
> +#include <linux/mmzone.h>
> +#include <linux/migrate.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/hashtable.h>
> +
> +static DEFINE_HASHTABLE(page_hotness_hash, KPROMOTED_HASH_ORDER);
> +static struct mutex page_hotness_lock[1UL << KPROMOTED_HASH_ORDER];
> +
> +static int kpromote_page(struct page_hotness_info *phi)
> +{

Why not just call it kpromote_folio?

> +	struct page *page = pfn_to_page(phi->pfn);
> +	struct folio *folio;
> +	int ret;
> +
> +	if (!page)
> +		return 1;

Do we need to check for is_zone_device_page() here?

> +
> +	folio = page_folio(page);
> +	ret = migrate_misplaced_folio_prepare(folio, NULL, phi->hot_node);
> +	if (ret)
> +		return 1;
> +
> +	return migrate_misplaced_folio(folio, phi->hot_node);
> +}


Could you please document the assumptions for kpromote_page(), what locks
should be held? Does the ref count need to be incremented?

> +
> +static int page_should_be_promoted(struct page_hotness_info *phi)
> +{
> +	struct page *page = pfn_to_online_page(phi->pfn);
> +	unsigned long now = jiffies;
> +	struct folio *folio;
> +
> +	if (!page || is_zone_device_page(page))
> +		return false;
> +
> +	folio = page_folio(page);
> +	if (!folio_test_lru(folio)) {
> +		count_vm_event(KPROMOTED_MIG_NON_LRU);
> +		return false;
> +	}
> +	if (folio_nid(folio) == phi->hot_node) {
> +		count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
> +		return false;
> +	}
> +
> +	/* If the page was hot a while ago, don't promote */
> +	if ((now - phi->last_update) > 2 * msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
> +		count_vm_event(KPROMOTED_MIG_COLD_OLD);

Shouldn't we update phi->last_update here?

> +		return false;
> +	}
> +
> +	/* If the page hasn't been accessed enough number of times, don't promote */
> +	if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
> +		count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
> +		return false;
> +	}
> +	return true;
> +}
> +
> +/*
> + * Go thro' page hotness information and migrate pages if required.
> + *
> + * Promoted pages are not longer tracked in the hot list.
> + * Cold pages are pruned from the list as well.
> + *
> + * TODO: Batching could be done
> + */
> +static void kpromoted_migrate(pg_data_t *pgdat)
> +{
> +	int nid = pgdat->node_id;
> +	struct page_hotness_info *phi;
> +	struct hlist_node *tmp;
> +	int nr_bkts = HASH_SIZE(page_hotness_hash);
> +	int bkt;
> +
> +	for (bkt = 0; bkt < nr_bkts; bkt++) {
> +		mutex_lock(&page_hotness_lock[bkt]);
> +		hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], hnode) {
> +			if (phi->hot_node != nid)
> +				continue;
> +
> +			if (page_should_be_promoted(phi)) {
> +				count_vm_event(KPROMOTED_MIG_CANDIDATE);
> +				if (!kpromote_page(phi)) {
> +					count_vm_event(KPROMOTED_MIG_PROMOTED);
> +					hlist_del_init(&phi->hnode);
> +					kfree(phi);
> +				}
> +			} else {
> +				/*
> +				 * Not a suitable page or cold page, stop tracking it.
> +				 * TODO: Identify cold pages and drive demotion?
> +				 */
> +				count_vm_event(KPROMOTED_MIG_DROPPED);
> +				hlist_del_init(&phi->hnode);
> +				kfree(phi);

Won't existing demotion already handle this?

> +			}
> +		}
> +		mutex_unlock(&page_hotness_lock[bkt]);
> +	}
> +}
> +

It sounds like NUMA balancing, promotion and demotion can all act on parallel on
these folios, if not could you clarify their relationship and dependency?


> +static struct page_hotness_info *__kpromoted_lookup(unsigned long pfn, int bkt)
> +{
> +	struct page_hotness_info *phi;
> +
> +	hlist_for_each_entry(phi, &page_hotness_hash[bkt], hnode) {
> +		if (phi->pfn == pfn)
> +			return phi;
> +	}
> +	return NULL;
> +}
> +
> +static struct page_hotness_info *kpromoted_lookup(unsigned long pfn, int bkt, unsigned long now)
> +{
> +	struct page_hotness_info *phi;
> +
> +	phi = __kpromoted_lookup(pfn, bkt);
> +	if (!phi) {
> +		phi = kzalloc(sizeof(struct page_hotness_info), GFP_KERNEL);
> +		if (!phi)
> +			return ERR_PTR(-ENOMEM);
> +
> +		phi->pfn = pfn;
> +		phi->frequency = 1;
> +		phi->last_update = now;
> +		phi->recency = now;
> +		hlist_add_head(&phi->hnode, &page_hotness_hash[bkt]);
> +		count_vm_event(KPROMOTED_RECORD_ADDED);
> +	} else {
> +		count_vm_event(KPROMOTED_RECORD_EXISTS);
> +	}
> +	return phi;
> +}
> +
> +/*
> + * Called by subsystems that generate page hotness/access information.
> + *
> + * Records the memory access info for futher action by kpromoted.
> + */
> +int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now)
> +{
> +	struct page_hotness_info *phi;
> +	struct page *page;
> +	struct folio *folio;
> +	int ret, bkt;
> +
> +	count_vm_event(KPROMOTED_RECORDED_ACCESSES);
> +
> +	switch (src) {
> +	case KPROMOTED_HW_HINTS:
> +		count_vm_event(KPROMOTED_RECORD_HWHINTS);
> +		break;
> +	case KPROMOTED_PGTABLE_SCAN:
> +		count_vm_event(KPROMOTED_RECORD_PGTSCANS);
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	/*
> +	 * Record only accesses from lower tiers.
> +	 * Assuming node having CPUs as toptier for now.
> +	 */
> +	if (node_is_toptier(pfn_to_nid(pfn))) {
> +		count_vm_event(KPROMOTED_RECORD_TOPTIER);
> +		return 0;
> +	}
> +
> +	page = pfn_to_online_page(pfn);
> +	if (!page || is_zone_device_page(page))
> +		return 0;
> +
> +	folio = page_folio(page);
> +	if (!folio_test_lru(folio))
> +		return 0;
> +
> +	bkt = hash_min(pfn, KPROMOTED_HASH_ORDER);
> +	mutex_lock(&page_hotness_lock[bkt]);
> +	phi = kpromoted_lookup(pfn, bkt, now);
> +	if (!phi) {
> +		ret = PTR_ERR(phi);
> +		goto out;
> +	}
> +
> +	if ((phi->last_update - now) > msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
> +		/* New window */
> +		phi->frequency = 1; /* TODO: Factor in the history */
> +		phi->last_update = now;
> +	} else {
> +		phi->frequency++;
> +	}
> +	phi->recency = now;
> +
> +	/*
> +	 * TODOs:
> +	 * 1. Source nid is hard-coded for some temperature sources
> +	 * 2. Take action if hot_node changes - may be a shared page?
> +	 * 3. Maintain node info for every access within the window?
> +	 */
> +	phi->hot_node = (nid == NUMA_NO_NODE) ? 1 : nid;

I don't understand why nid needs to be 1 if nid is NUMA_NODE_ID? Does
it mean that it's being promoted to the top tier, the mix of hot_node,
tier and nid is not very clear here.

> +	mutex_unlock(&page_hotness_lock[bkt]);
> +out:
> +	return 0;
> +}
> +
> +/*
> + * Go through the accumulated mem_access_info and migrate
> + * pages if required.
> + */
> +static void kpromoted_do_work(pg_data_t *pgdat)
> +{
> +	kpromoted_migrate(pgdat);
> +}
> +
> +static inline bool kpromoted_work_requested(pg_data_t *pgdat)
> +{
> +	return false;
> +}
> +
> +static int kpromoted(void *p)
> +{
> +	pg_data_t *pgdat = (pg_data_t *)p;
> +	struct task_struct *tsk = current;
> +	long timeout = msecs_to_jiffies(KPROMOTE_DELAY);
> +
> +	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> +
> +	if (!cpumask_empty(cpumask))
> +		set_cpus_allowed_ptr(tsk, cpumask);
> +
> +	while (!kthread_should_stop()) {
> +		wait_event_timeout(pgdat->kpromoted_wait,
> +				   kpromoted_work_requested(pgdat), timeout);
> +		kpromoted_do_work(pgdat);
> +	}
> +	return 0;
> +}
> +
> +static void kpromoted_run(int nid)
> +{
> +	pg_data_t *pgdat = NODE_DATA(nid);
> +
> +	if (pgdat->kpromoted)
> +		return;
> +
> +	pgdat->kpromoted = kthread_run(kpromoted, pgdat, "kpromoted%d", nid);
> +	if (IS_ERR(pgdat->kpromoted)) {
> +		pr_err("Failed to start kpromoted on node %d\n", nid);
> +		pgdat->kpromoted = NULL;
> +	}
> +}
> +
> +static int kpromoted_cpu_online(unsigned int cpu)
> +{
> +	int nid;
> +
> +	for_each_node_state(nid, N_CPU) {
> +		pg_data_t *pgdat = NODE_DATA(nid);
> +		const struct cpumask *mask;
> +
> +		mask = cpumask_of_node(pgdat->node_id);
> +
> +		if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
> +			/* One of our CPUs online: restore mask */
> +			if (pgdat->kpromoted)
> +				set_cpus_allowed_ptr(pgdat->kpromoted, mask);
> +	}
> +	return 0;
> +}
> +
> +static int __init kpromoted_init(void)
> +{
> +	int nid, ret, i;
> +
> +	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> +					"mm/promotion:online",
> +					kpromoted_cpu_online, NULL);
> +	if (ret < 0) {
> +		pr_err("kpromoted: failed to register hotplug callbacks.\n");
> +		return ret;
> +	}
> +
> +	for (i = 0; i < (1UL << KPROMOTED_HASH_ORDER); i++)
> +		mutex_init(&page_hotness_lock[i]);
> +
> +	for_each_node_state(nid, N_CPU)
> +		kpromoted_run(nid);
> +

I think we need a dynamic way to disabling promotion at run time
as well, right?


> +	return 0;
> +}
> +
> +subsys_initcall(kpromoted_init)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 2630cc30147e..d212df24f89b 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1362,6 +1362,15 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
>  static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>  #endif
>  
> +#ifdef CONFIG_KPROMOTED
> +static void pgdat_init_kpromoted(struct pglist_data *pgdat)
> +{
> +	init_waitqueue_head(&pgdat->kpromoted_wait);
> +}
> +#else
> +static void pgdat_init_kpromoted(struct pglist_data *pgdat) {}
> +#endif
> +
>  static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  {
>  	int i;
> @@ -1371,6 +1380,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  
>  	pgdat_init_split_queue(pgdat);
>  	pgdat_init_kcompactd(pgdat);
> +	pgdat_init_kpromoted(pgdat);
>  
>  	init_waitqueue_head(&pgdat->kswapd_wait);
>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 16bfe1c694dd..618f44bae5c8 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1466,6 +1466,19 @@ const char * const vmstat_text[] = {
>  	"kstack_rest",
>  #endif
>  #endif
> +	"kpromoted_recorded_accesses",
> +	"kpromoted_recorded_hwhints",
> +	"kpromoted_recorded_pgtscans",
> +	"kpromoted_record_toptier",
> +	"kpromoted_record_added",
> +	"kpromoted_record_exists",
> +	"kpromoted_mig_right_node",
> +	"kpromoted_mig_non_lru",
> +	"kpromoted_mig_cold_old",
> +	"kpromoted_mig_cold_not_accessed",
> +	"kpromoted_mig_candidate",
> +	"kpromoted_mig_promoted",
> +	"kpromoted_mig_dropped",
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
                     ` (4 preceding siblings ...)
  2025-03-24  3:35   ` Balbir Singh
@ 2025-03-24 13:43   ` Gregory Price
  2025-03-24 14:34     ` Bharata B Rao
  5 siblings, 1 reply; 38+ messages in thread
From: Gregory Price @ 2025-03-24 13:43 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, hannes, honggyu.kim, hughd, jhubbard, k.shutemov,
	kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
	mgorman, mingo, nadav.amit, nphamcs, peterz, raghavendra.kt,
	riel, rientjes, rppt, shivankg, shy828301, sj, vbabka, weixugc,
	willy, ying.huang, ziy, dave, yuanchu, hyeonggon.yoo

On Thu, Mar 06, 2025 at 11:15:30AM +0530, Bharata B Rao wrote:
> kpromoted is a kernel daemon that accumulates hot page info
> from different sources and tries to promote pages from slow
> tiers to top tiers. One instance of this thread runs on each
> node that has CPUs.
>

Hot take: This sounds more like ktieringd not kpromoted

Is it reasonable to split the tracking a promotion logic into separate
interfaces?  This would let us manage, for example, rate-limiting in the
movement interface cleanly without having to care about the tiering
system(s) associated with it.

    my_tiering_magic():
        ... identify hot things ...
        promote(batch_folios, optional_data);
            -> kick daemon thread to wake up and do the promotion
	... continue async things ...

Optional data could be anything from target nodes or accessor info, but
not hotness information.

Then users at least get a clean interface for things like rate-limiting,
and everyone proposing their own take on tiering can consume it.  This
may also be useful for existing users (TPP, reclaim?, etc).

~Gregory


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-24 13:43   ` Gregory Price
@ 2025-03-24 14:34     ` Bharata B Rao
  0 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-24 14:34 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-kernel, linux-mm, AneeshKumar.KizhakeVeetil, Hasan.Maruf,
	Jonathan.Cameron, Michael.Day, akpm, dave.hansen, david,
	feng.tang, hannes, honggyu.kim, hughd, jhubbard, k.shutemov,
	kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
	mgorman, mingo, nadav.amit, nphamcs, peterz, raghavendra.kt,
	riel, rientjes, rppt, shivankg, shy828301, sj, vbabka, weixugc,
	willy, ying.huang, ziy, dave, yuanchu, hyeonggon.yoo

On 24-Mar-25 7:13 PM, Gregory Price wrote:
> On Thu, Mar 06, 2025 at 11:15:30AM +0530, Bharata B Rao wrote:
>> kpromoted is a kernel daemon that accumulates hot page info
>> from different sources and tries to promote pages from slow
>> tiers to top tiers. One instance of this thread runs on each
>> node that has CPUs.
>>
> 
> Hot take: This sounds more like ktieringd not kpromoted

:-)

> 
> Is it reasonable to split the tracking a promotion logic into separate
> interfaces?  This would let us manage, for example, rate-limiting in the
> movement interface cleanly without having to care about the tiering
> system(s) associated with it.
> 
>      my_tiering_magic():
>          ... identify hot things ...
>          promote(batch_folios, optional_data);
>              -> kick daemon thread to wake up and do the promotion
> 	... continue async things ...
> 
> Optional data could be anything from target nodes or accessor info, but
> not hotness information.
> 
> Then users at least get a clean interface for things like rate-limiting,
> and everyone proposing their own take on tiering can consume it.  This
> may also be useful for existing users (TPP, reclaim?, etc).

Yes, Makes sense to split tracking and promotion logic into separate 
parts. There is no need for the promotion part to work with the hot page 
list that belongs to the tracking part as I have done in this RFC.

Raghu and I already saw that migration part is kind of duplicated in our 
patchsets(kmmscand and this) and were thinking of unifying them. Having 
clean separation as you suggest will be good.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too
  2025-03-24  2:55   ` Balbir Singh
@ 2025-03-24 14:51     ` Bharata B Rao
  0 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-24 14:51 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu

On 24-Mar-25 8:25 AM, Balbir Singh wrote:
> On 3/6/25 16:45, Bharata B Rao wrote:
>> migrate_misplaced_folio_prepare() can be called from a
>> context where VMA isn't available. Allow the migration
>> to work from such contexts too.
>>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> ---
>>   mm/migrate.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index fb19a18892c8..5b21856a0dd0 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2639,7 +2639,8 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>>   
>>   /*
>>    * Prepare for calling migrate_misplaced_folio() by isolating the folio if
>> - * permitted. Must be called with the PTL still held.
>> + * permitted. Must be called with the PTL still held if called with a non-NULL
>> + * vma.
>>    */
>>   int migrate_misplaced_folio_prepare(struct folio *folio,
>>   		struct vm_area_struct *vma, int node)
>> @@ -2656,7 +2657,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
>>   		 * See folio_likely_mapped_shared() on possible imprecision
>>   		 * when we cannot easily detect if a folio is shared.
>>   		 */
>> -		if ((vma->vm_flags & VM_EXEC) &&
>> +		if (vma && (vma->vm_flags & VM_EXEC) &&
>>   		    folio_likely_mapped_shared(folio))
>>   			return -EACCES;
>>   
> 
> In the worst case, the absence of the vma would mean that we try to isolate
> and migrate a shared folio with executable pages. Are those a key target for the
> hot page migration?

I don't think they are a key target for hot page migration, but if 
shared executable pages (like shared library pages) have ended up in 
lower tier, it doesn't hurt to get them promoted to top tier, I would think.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages
  2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
                   ` (5 preceding siblings ...)
  2025-03-18  5:28 ` Balbir Singh
@ 2025-03-25  8:18 ` Bharata B Rao
  6 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-25  8:18 UTC (permalink / raw)
  To: bharata
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, dave, david, feng.tang, gourry,
	hannes, honggyu.kim, hughd, hyeonggon.yoo, jhubbard, k.shutemov,
	kbusch, kmanaouil.dev, leesuyeon0506, leillc, liam.howlett,
	linux-kernel, linux-mm, mgorman, mingo, nadav.amit, nphamcs,
	peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, yuanchu, ziy

> Hi,
> 
> This is an attempt towards having a single subsystem that accumulates
> hot page information from lower memory tiers and does hot page
> promotion.
> 
> At the heart of this subsystem is a kernel daemon named kpromoted that
> does the following:
> 
> 1. Exposes an API that other subsystems which detect/generate memory
>    access information can use to inform the daemon about memory
>    accesses from lower memory tiers.
> 2. Maintains the list of hot pages and attempts to promote them to
>    toptiers.
> 
> Currently I have added AMD IBS driver as one source that provides
> page access information as an example. This driver feeds info to
> krpromoted in this RFC patchset.

FWIW, here are some numbers from krpomoted driven hotpage promotion with
IBS as the hotness source:

Test 1
======
Memory allocated on DRAM and CXL nodes explicitly and no demotion activity
is seen.

Benchmark details
-----------------
* Memory is allocated initially on DRAM and CXL nodes separately.
* Two threads: One accessing DRAM-allocated and other CXL-allocated memory.
* Divides memory area into regions and accesses pages within the region randomly
  and repetitively. In the test config shown below, the allocated memory is
  divided into regions of 1GB size and each such region is repetitively (512
  times) accessed with 21474836480 random accesses in each repetition).
* Benchmark score is time taken for accesses to complete, lower is better
* Data accesses from CXL node are expected to trigger promotion
* Test system has 2 DRAM nodes (128G each) and a CXL node (128G)

kernel.numa_balancing		2 for base, 0 for kpromoted
demotion			true
Threads run on			Node 1
Memory allocated on		Node 1(DRAM) and Node 2(CXL)
Initial allocation ratio	75% on DRAM
Allocated memory size		160G (mmap, MAP_POPULATE)
Initial memory on DRAM node	120G
Initial memory on CXL node	40G
Hot region size			1G
Acccess pattern			random
Access granularity		4K
Load/store ratio		50% loads + 50% stores
Number of accesses		21474836480
Nr access repetitions		512

Benchmark completion time
-------------------------
Base, NUMAB=2		261s
kpromoted-ibs, NUMAB=0	281s

Stats comparision
-----------------
				Base,NUMAB=2	kpromoted-IBS,NUMAB=0
pgdemote_kswapd			0		0
pgdemote_direct			0		0
numa_pte_updates		10485760	0
numa_hint_faults		4427809		0
numa_pages_migrated		388229		374765
kpromoted_recorded_accesses			1651130	/* nr accesses reported to kpromoted */
kpromoted_recorded_hwhints			1651130	/* nr accesses coming from IBS */
kpromoted_record_toptier			1269697	/* nr accesses from toptier/DRAM */
kpromoted_record_added				378090	/* nr accesses considered for promotion */
kpromoted_mig_promoted				374765	/* nr pages promoted */
hwhint_nr_events				1674227	/* nr events reported by IBS */
hwhint_dram_accesses				1269626	/* nr DRAM accesses reported by IBS */
hwhint_cxl_accesses				381435	/* nr Extmem (CXL) accesses reported by IBS */
hwhint_useful_samples				1651110	/* nr actionable samples as per IBS driver */


Test 2
======
Memory is allocated with DRAM and CXL nodes in the affinity mask with
MPOL_BIND + MPOL_F_NUMA_BALANCING.

Benchmark details
-----------------
* Initially, memory allocated spreads over from DRAM to CXL, involves demotion
* Single thread accesses the memory
* Divides memory area into regions and accesses pages within the region randomly
  and repetitively. In the test config shown below, the allocated memory is
  divided into regions of 1GB size and each such region is repetitively (512
  times) accessed with 21474836480 random accesses in each repetition).
* Benchmark score is time taken for accesses to complete, lower is better
* Data accesses from CXL node are expected to trigger promotion
* Test system has 2 DRAM nodes (128G each) and a CXL node (128G)

kernel.numa_balancing		2 for base, 0 for kpromoted
demotion			true
Threads run on			Node 1
Memory allocated on		Node 1(DRAM) and Node 2(CXL)
Allocated memory size		192G (mmap, MAP_POPULATE)
Hot region size			1G
Acccess pattern			random
Access granularity		4K
Load/store ratio		50% loads + 50% stores
Number of accesses		21474836480
Nr access repetitions		512

Benchmark completion time
-------------------------
Base, NUMAB=2		628s
kpromoted-ibs, NUMAB=0	626s

Stats comparision
-----------------
				Base,NUMAB=2	kpromoted-IBS,NUMAB=0
pgdemote_kswapd			73187		2196028
pgdemote_direct			0		0
numa_pte_updates		27511631	0
numa_hint_faults		10010852	0
numa_pages_migrated		14		611177	/* such low number of promotions is unexecpted in Base, Need to recheck */
kpromoted_recorded_accesses			1883570
kpromoted_recorded_hwhints			1883570
kpromoted_record_toptier			1262088
kpromoted_record_added				616273
kpromoted_mig_promoted				611077
hwhint_nr_events				1904619
hwhint_dram_accesses				1261758
hwhint_cxl_accesses				621428
hwhint_useful_samples				1883543


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon
  2025-03-24  3:35   ` Balbir Singh
@ 2025-03-28  4:55     ` Bharata B Rao
  0 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2025-03-28  4:55 UTC (permalink / raw)
  To: Balbir Singh, linux-kernel, linux-mm
  Cc: AneeshKumar.KizhakeVeetil, Hasan.Maruf, Jonathan.Cameron,
	Michael.Day, akpm, dave.hansen, david, feng.tang, gourry, hannes,
	honggyu.kim, hughd, jhubbard, k.shutemov, kbusch, kmanaouil.dev,
	leesuyeon0506, leillc, liam.howlett, mgorman, mingo, nadav.amit,
	nphamcs, peterz, raghavendra.kt, riel, rientjes, rppt, shivankg,
	shy828301, sj, vbabka, weixugc, willy, ying.huang, ziy, dave,
	yuanchu, hyeonggon.yoo

Hi Balbir,

Sorry for the delay in response and thanks for the review...

On 24-Mar-25 9:05 AM, Balbir Singh wrote:
> On 3/6/25 16:45, Bharata B Rao wrote:
>> kpromoted is a kernel daemon that accumulates hot page info
>> from different sources and tries to promote pages from slow
>> tiers to top tiers. One instance of this thread runs on each
>> node that has CPUs.
>>
> 
> Could you please elaborate on what is slow vs top tier? A top tier uses
> adist (which is a combination of bandwidth and latency), so I am
> not sure the terminology here holds.

Slow is used to mean bottom tiers here as determined by the memory 
tiering hierarchy.

> 
>> Subsystems that generate hot page access info can report that
>> to kpromoted via this API:
>>
>> int kpromoted_record_access(u64 pfn, int nid, int src,
>> 			    unsigned long time)
>>
>> @pfn: The PFN of the memory accessed
>> @nid: The accessing NUMA node ID
>> @src: The temperature source (subsystem) that generated the
>>        access info
>> @time: The access time in jiffies
>>
>> Some temperature sources may not provide the nid from which
> 
> What is a temperature source?

Temperature source is a term used to refer to the subsystem that 
generates memory access information. For e.g. LRU subsystem that scans 
page tables for Accessed bit becomes a source.

> 
>> the page was accessed. This is true for sources that use
>> page table scanning for PTE Accessed bit. Currently the toptier
>> node to which such pages should be promoted to is hard coded.
>>
> 
> What would it take to make this flexible?

The context here is that sources that provide access information by 
scanning the PTE A bit wouldn't know from which node the access was 
done. Same is the case for kmmscand approach though Raghu has some 
heuristics to deduce the best possible toptier node to which a given 
page should be promoted. More details at 
https://lore.kernel.org/linux-mm/20250319193028.29514-1-raghavendra.kt@amd.com/

What kpromoted did for such cases is to just promote the pages to a node 
whose nid has been hard-coded for now (like 0 or 1 etc)

> 
>> Also, the access time provided some sources may at best be
>> considered approximate. This is especially true for hot pages
>> detected by PTE A bit scanning.
>>
>> kpromoted currently maintains the hot PFN records in hash lists
>> hashed by PFN value. Each record stores the following info:
>>
>> struct page_hotness_info {
>> 	unsigned long pfn;
>>
>> 	/* Time when this record was updated last */
>> 	unsigned long last_update;
>>
>> 	/*
>> 	 * Number of times this page was accessed in the
>> 	 * current window
>> 	 */
>> 	int frequency;
>>
>> 	/* Most recent access time */
>> 	unsigned long recency;
>>
>> 	/* Most recent access from this node */
>> 	int hot_node;
>>
>> 	struct hlist_node hnode;
>> };
>>
>> The way in which a page is categorized as hot enough to be
>> promoted is pretty primitive now.
>>
>> Signed-off-by: Bharata B Rao <bharata@amd.com>
>> ---
>>   include/linux/kpromoted.h     |  54 ++++++
>>   include/linux/mmzone.h        |   4 +
>>   include/linux/vm_event_item.h |  13 ++
>>   mm/Kconfig                    |   7 +
>>   mm/Makefile                   |   1 +
>>   mm/kpromoted.c                | 305 ++++++++++++++++++++++++++++++++++
>>   mm/mm_init.c                  |  10 ++
>>   mm/vmstat.c                   |  13 ++
>>   8 files changed, 407 insertions(+)
>>   create mode 100644 include/linux/kpromoted.h
>>   create mode 100644 mm/kpromoted.c
>>
>> diff --git a/include/linux/kpromoted.h b/include/linux/kpromoted.h
>> new file mode 100644
>> index 000000000000..2bef3d74f03a
>> --- /dev/null
>> +++ b/include/linux/kpromoted.h
>> @@ -0,0 +1,54 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_KPROMOTED_H
>> +#define _LINUX_KPROMOTED_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/init.h>
>> +#include <linux/workqueue_types.h>
>> +
>> +/* Page hotness temperature sources */
>> +enum kpromoted_src {
>> +	KPROMOTED_HW_HINTS,
>> +	KPROMOTED_PGTABLE_SCAN,
>> +};
>> +
>> +#ifdef CONFIG_KPROMOTED
>> +
>> +#define KPROMOTED_FREQ_WINDOW	(5 * MSEC_PER_SEC)
>> +
>> +/* 2 accesses within a window will make the page a promotion candidate */
>> +#define KPRMOTED_FREQ_THRESHOLD	2
>> +
> 
> Were these value derived empirically?

It is something I started with capture the notion of "repeated access".

> 
> 
>> +#define KPROMOTED_HASH_ORDER	16
>> +
>> +struct page_hotness_info {
>> +	unsigned long pfn;
>> +
>> +	/* Time when this record was updated last */
>> +	unsigned long last_update;
>> +
>> +	/*
>> +	 * Number of times this page was accessed in the
>> +	 * current window
>> +	 */
>> +	int frequency;
>> +
>> +	/* Most recent access time */
>> +	unsigned long recency;
>> +
>> +	/* Most recent access from this node */
>> +	int hot_node;
>> +	struct hlist_node hnode;
>> +};
>> +
>> +#define KPROMOTE_DELAY	MSEC_PER_SEC
>> +
>> +int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now);
>> +#else
>> +static inline int kpromoted_record_access(u64 pfn, int nid, int src,
>> +					  unsigned long now)
>> +{
>> +	return 0;
>> +}
>> +#endif /* CONFIG_KPROMOTED */
>> +#endif /* _LINUX_KPROMOTED_H */
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 9540b41894da..a5c4e789aa55 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1459,6 +1459,10 @@ typedef struct pglist_data {
>>   #ifdef CONFIG_MEMORY_FAILURE
>>   	struct memory_failure_stats mf_stats;
>>   #endif
>> +#ifdef CONFIG_KPROMOTED
>> +	struct task_struct *kpromoted;
>> +	wait_queue_head_t kpromoted_wait;
>> +#endif
>>   } pg_data_t;
>>   
>>   #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
>> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
>> index f70d0958095c..b5823b037883 100644
>> --- a/include/linux/vm_event_item.h
>> +++ b/include/linux/vm_event_item.h
>> @@ -182,6 +182,19 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>   		KSTACK_REST,
>>   #endif
>>   #endif /* CONFIG_DEBUG_STACK_USAGE */
>> +		KPROMOTED_RECORDED_ACCESSES,
>> +		KPROMOTED_RECORD_HWHINTS,
>> +		KPROMOTED_RECORD_PGTSCANS,
>> +		KPROMOTED_RECORD_TOPTIER,
>> +		KPROMOTED_RECORD_ADDED,
>> +		KPROMOTED_RECORD_EXISTS,
>> +		KPROMOTED_MIG_RIGHT_NODE,
>> +		KPROMOTED_MIG_NON_LRU,
>> +		KPROMOTED_MIG_COLD_OLD,
>> +		KPROMOTED_MIG_COLD_NOT_ACCESSED,
>> +		KPROMOTED_MIG_CANDIDATE,
>> +		KPROMOTED_MIG_PROMOTED,
>> +		KPROMOTED_MIG_DROPPED,
>>   		NR_VM_EVENT_ITEMS
>>   };
>>   
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 1b501db06417..ceaa462a0ce6 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1358,6 +1358,13 @@ config PT_RECLAIM
>>   
>>   	  Note: now only empty user PTE page table pages will be reclaimed.
>>   
>> +config KPROMOTED
>> +	bool "Kernel hot page promotion daemon"
>> +	def_bool y
>> +	depends on NUMA && MIGRATION && MMU
>> +	help
>> +	  Promote hot pages from lower tier to top tier by using the
>> +	  memory access information provided by various sources.
>>   
>>   source "mm/damon/Kconfig"
>>   
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 850386a67b3e..bf4f5f18f1f9 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -147,3 +147,4 @@ obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
>>   obj-$(CONFIG_EXECMEM) += execmem.o
>>   obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
>>   obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
>> +obj-$(CONFIG_KPROMOTED) += kpromoted.o
>> diff --git a/mm/kpromoted.c b/mm/kpromoted.c
>> new file mode 100644
>> index 000000000000..2a8b8495b6b3
>> --- /dev/null
>> +++ b/mm/kpromoted.c
>> @@ -0,0 +1,305 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * kpromoted is a kernel thread that runs on each node that has CPU i,e.,
>> + * on regular nodes.
>> + *
>> + * Maintains list of hot pages from lower tiers and promotes them.
>> + */
>> +#include <linux/kpromoted.h>
>> +#include <linux/kthread.h>
>> +#include <linux/mutex.h>
>> +#include <linux/mmzone.h>
>> +#include <linux/migrate.h>
>> +#include <linux/memory-tiers.h>
>> +#include <linux/slab.h>
>> +#include <linux/sched.h>
>> +#include <linux/cpuhotplug.h>
>> +#include <linux/hashtable.h>
>> +
>> +static DEFINE_HASHTABLE(page_hotness_hash, KPROMOTED_HASH_ORDER);
>> +static struct mutex page_hotness_lock[1UL << KPROMOTED_HASH_ORDER];
>> +
>> +static int kpromote_page(struct page_hotness_info *phi)
>> +{
> 
> Why not just call it kpromote_folio?

Yes, can be called so.

> 
>> +	struct page *page = pfn_to_page(phi->pfn);
>> +	struct folio *folio;
>> +	int ret;
>> +
>> +	if (!page)
>> +		return 1;
> 
> Do we need to check for is_zone_device_page() here?

That and other checks are part of page_should_be_promoted() call just 
prior to attempting to promote.

> 
>> +
>> +	folio = page_folio(page);
>> +	ret = migrate_misplaced_folio_prepare(folio, NULL, phi->hot_node);
>> +	if (ret)
>> +		return 1;
>> +
>> +	return migrate_misplaced_folio(folio, phi->hot_node);
>> +}
> 
> 
> Could you please document the assumptions for kpromote_page(), what locks
> should be held? Does the ref count need to be incremented?

Yes, will document. However it doesn't expect folio refcount to be 
incremented as I am tracking hotpages via PFNs and not by using struct 
folios.

> 
>> +
>> +static int page_should_be_promoted(struct page_hotness_info *phi)
>> +{
>> +	struct page *page = pfn_to_online_page(phi->pfn);
>> +	unsigned long now = jiffies;
>> +	struct folio *folio;
>> +
>> +	if (!page || is_zone_device_page(page))
>> +		return false;
>> +
>> +	folio = page_folio(page);
>> +	if (!folio_test_lru(folio)) {
>> +		count_vm_event(KPROMOTED_MIG_NON_LRU);
>> +		return false;
>> +	}
>> +	if (folio_nid(folio) == phi->hot_node) {
>> +		count_vm_event(KPROMOTED_MIG_RIGHT_NODE);
>> +		return false;
>> +	}
>> +
>> +	/* If the page was hot a while ago, don't promote */
>> +	if ((now - phi->last_update) > 2 * msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
>> +		count_vm_event(KPROMOTED_MIG_COLD_OLD);
> 
> Shouldn't we update phi->last_update here?

Hmm I am not sure about updating from here where we are checking for 
migration feasibility. last_update records the time when the page was 
last accesed.

> 
>> +		return false;
>> +	}
>> +
>> +	/* If the page hasn't been accessed enough number of times, don't promote */
>> +	if (phi->frequency < KPRMOTED_FREQ_THRESHOLD) {
>> +		count_vm_event(KPROMOTED_MIG_COLD_NOT_ACCESSED);
>> +		return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +/*
>> + * Go thro' page hotness information and migrate pages if required.
>> + *
>> + * Promoted pages are not longer tracked in the hot list.
>> + * Cold pages are pruned from the list as well.
>> + *
>> + * TODO: Batching could be done
>> + */
>> +static void kpromoted_migrate(pg_data_t *pgdat)
>> +{
>> +	int nid = pgdat->node_id;
>> +	struct page_hotness_info *phi;
>> +	struct hlist_node *tmp;
>> +	int nr_bkts = HASH_SIZE(page_hotness_hash);
>> +	int bkt;
>> +
>> +	for (bkt = 0; bkt < nr_bkts; bkt++) {
>> +		mutex_lock(&page_hotness_lock[bkt]);
>> +		hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], hnode) {
>> +			if (phi->hot_node != nid)
>> +				continue;
>> +
>> +			if (page_should_be_promoted(phi)) {
>> +				count_vm_event(KPROMOTED_MIG_CANDIDATE);
>> +				if (!kpromote_page(phi)) {
>> +					count_vm_event(KPROMOTED_MIG_PROMOTED);
>> +					hlist_del_init(&phi->hnode);
>> +					kfree(phi);
>> +				}
>> +			} else {
>> +				/*
>> +				 * Not a suitable page or cold page, stop tracking it.
>> +				 * TODO: Identify cold pages and drive demotion?
>> +				 */
>> +				count_vm_event(KPROMOTED_MIG_DROPPED);
>> +				hlist_del_init(&phi->hnode);
>> +				kfree(phi);
> 
> Won't existing demotion already handle this?

Yes it does. I had a note here to check if it makes sense to drive 
demotion of pages that are being dropped off from kpromoted tracking 
presumably becasue they aren't hot any longer.

> 
>> +			}
>> +		}
>> +		mutex_unlock(&page_hotness_lock[bkt]);
>> +	}
>> +}
>> +
> 
> It sounds like NUMA balancing, promotion and demotion can all act on parallel on
> these folios, if not could you clarify their relationship and dependency?

kpromoted tracks the hotness of PFNs. It goes through same steps that 
others use to isolate the pages prior to migration. So it is not 
possible to find a page that kpromoted wants to migrate being parallely 
considered by NUMAB for migration or vmscan for demotion. I don't see 
any obvious dependency here, but I can check in detail.

> 
> 
>> +static struct page_hotness_info *__kpromoted_lookup(unsigned long pfn, int bkt)
>> +{
>> +	struct page_hotness_info *phi;
>> +
>> +	hlist_for_each_entry(phi, &page_hotness_hash[bkt], hnode) {
>> +		if (phi->pfn == pfn)
>> +			return phi;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +static struct page_hotness_info *kpromoted_lookup(unsigned long pfn, int bkt, unsigned long now)
>> +{
>> +	struct page_hotness_info *phi;
>> +
>> +	phi = __kpromoted_lookup(pfn, bkt);
>> +	if (!phi) {
>> +		phi = kzalloc(sizeof(struct page_hotness_info), GFP_KERNEL);
>> +		if (!phi)
>> +			return ERR_PTR(-ENOMEM);
>> +
>> +		phi->pfn = pfn;
>> +		phi->frequency = 1;
>> +		phi->last_update = now;
>> +		phi->recency = now;
>> +		hlist_add_head(&phi->hnode, &page_hotness_hash[bkt]);
>> +		count_vm_event(KPROMOTED_RECORD_ADDED);
>> +	} else {
>> +		count_vm_event(KPROMOTED_RECORD_EXISTS);
>> +	}
>> +	return phi;
>> +}
>> +
>> +/*
>> + * Called by subsystems that generate page hotness/access information.
>> + *
>> + * Records the memory access info for futher action by kpromoted.
>> + */
>> +int kpromoted_record_access(u64 pfn, int nid, int src, unsigned long now)
>> +{
>> +	struct page_hotness_info *phi;
>> +	struct page *page;
>> +	struct folio *folio;
>> +	int ret, bkt;
>> +
>> +	count_vm_event(KPROMOTED_RECORDED_ACCESSES);
>> +
>> +	switch (src) {
>> +	case KPROMOTED_HW_HINTS:
>> +		count_vm_event(KPROMOTED_RECORD_HWHINTS);
>> +		break;
>> +	case KPROMOTED_PGTABLE_SCAN:
>> +		count_vm_event(KPROMOTED_RECORD_PGTSCANS);
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	/*
>> +	 * Record only accesses from lower tiers.
>> +	 * Assuming node having CPUs as toptier for now.
>> +	 */
>> +	if (node_is_toptier(pfn_to_nid(pfn))) {
>> +		count_vm_event(KPROMOTED_RECORD_TOPTIER);
>> +		return 0;
>> +	}
>> +
>> +	page = pfn_to_online_page(pfn);
>> +	if (!page || is_zone_device_page(page))
>> +		return 0;
>> +
>> +	folio = page_folio(page);
>> +	if (!folio_test_lru(folio))
>> +		return 0;
>> +
>> +	bkt = hash_min(pfn, KPROMOTED_HASH_ORDER);
>> +	mutex_lock(&page_hotness_lock[bkt]);
>> +	phi = kpromoted_lookup(pfn, bkt, now);
>> +	if (!phi) {
>> +		ret = PTR_ERR(phi);
>> +		goto out;
>> +	}
>> +
>> +	if ((phi->last_update - now) > msecs_to_jiffies(KPROMOTED_FREQ_WINDOW)) {
>> +		/* New window */
>> +		phi->frequency = 1; /* TODO: Factor in the history */
>> +		phi->last_update = now;
>> +	} else {
>> +		phi->frequency++;
>> +	}
>> +	phi->recency = now;
>> +
>> +	/*
>> +	 * TODOs:
>> +	 * 1. Source nid is hard-coded for some temperature sources
>> +	 * 2. Take action if hot_node changes - may be a shared page?
>> +	 * 3. Maintain node info for every access within the window?
>> +	 */
>> +	phi->hot_node = (nid == NUMA_NO_NODE) ? 1 : nid;
> 
> I don't understand why nid needs to be 1 if nid is NUMA_NODE_ID? Does
> it mean that it's being promoted to the top tier, the mix of hot_node,
> tier and nid is not very clear here.

As I mentioned earlier, if the access information wasn't accompanied by 
nid (which is specified by NUMA_NO_NODE), it will be promoted to a 
hard-code (currently) toptier node.

> 
>> +	mutex_unlock(&page_hotness_lock[bkt]);
>> +out:
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Go through the accumulated mem_access_info and migrate
>> + * pages if required.
>> + */
>> +static void kpromoted_do_work(pg_data_t *pgdat)
>> +{
>> +	kpromoted_migrate(pgdat);
>> +}
>> +
>> +static inline bool kpromoted_work_requested(pg_data_t *pgdat)
>> +{
>> +	return false;
>> +}
>> +
>> +static int kpromoted(void *p)
>> +{
>> +	pg_data_t *pgdat = (pg_data_t *)p;
>> +	struct task_struct *tsk = current;
>> +	long timeout = msecs_to_jiffies(KPROMOTE_DELAY);
>> +
>> +	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>> +
>> +	if (!cpumask_empty(cpumask))
>> +		set_cpus_allowed_ptr(tsk, cpumask);
>> +
>> +	while (!kthread_should_stop()) {
>> +		wait_event_timeout(pgdat->kpromoted_wait,
>> +				   kpromoted_work_requested(pgdat), timeout);
>> +		kpromoted_do_work(pgdat);
>> +	}
>> +	return 0;
>> +}
>> +
>> +static void kpromoted_run(int nid)
>> +{
>> +	pg_data_t *pgdat = NODE_DATA(nid);
>> +
>> +	if (pgdat->kpromoted)
>> +		return;
>> +
>> +	pgdat->kpromoted = kthread_run(kpromoted, pgdat, "kpromoted%d", nid);
>> +	if (IS_ERR(pgdat->kpromoted)) {
>> +		pr_err("Failed to start kpromoted on node %d\n", nid);
>> +		pgdat->kpromoted = NULL;
>> +	}
>> +}
>> +
>> +static int kpromoted_cpu_online(unsigned int cpu)
>> +{
>> +	int nid;
>> +
>> +	for_each_node_state(nid, N_CPU) {
>> +		pg_data_t *pgdat = NODE_DATA(nid);
>> +		const struct cpumask *mask;
>> +
>> +		mask = cpumask_of_node(pgdat->node_id);
>> +
>> +		if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
>> +			/* One of our CPUs online: restore mask */
>> +			if (pgdat->kpromoted)
>> +				set_cpus_allowed_ptr(pgdat->kpromoted, mask);
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int __init kpromoted_init(void)
>> +{
>> +	int nid, ret, i;
>> +
>> +	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>> +					"mm/promotion:online",
>> +					kpromoted_cpu_online, NULL);
>> +	if (ret < 0) {
>> +		pr_err("kpromoted: failed to register hotplug callbacks.\n");
>> +		return ret;
>> +	}
>> +
>> +	for (i = 0; i < (1UL << KPROMOTED_HASH_ORDER); i++)
>> +		mutex_init(&page_hotness_lock[i]);
>> +
>> +	for_each_node_state(nid, N_CPU)
>> +		kpromoted_run(nid);
>> +
> 
> I think we need a dynamic way to disabling promotion at run time
> as well, right?

Myabe, but I understand that promotion is an activity that should be 
benefitial in general. What specific scenarios do you think would need 
explicit disabling of promotion?

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-03-28  4:55 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-06  5:45 [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages Bharata B Rao
2025-03-06  5:45 ` [RFC PATCH 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-03-06 12:13   ` David Hildenbrand
2025-03-07  3:00     ` Bharata B Rao
2025-03-06 17:24   ` Gregory Price
2025-03-06 17:45     ` Matthew Wilcox
2025-03-06 18:19       ` Gregory Price
2025-03-06 18:42         ` Matthew Wilcox
2025-03-06 20:03           ` Gregory Price
2025-03-24  2:55   ` Balbir Singh
2025-03-24 14:51     ` Bharata B Rao
2025-03-06  5:45 ` [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon Bharata B Rao
2025-03-06 17:22   ` Mike Day
2025-03-07  3:27     ` Bharata B Rao
2025-03-13 16:44   ` Davidlohr Bueso
2025-03-17  3:39     ` Bharata B Rao
2025-03-17 15:05       ` Gregory Price
2025-03-17 16:22         ` Bharata B Rao
2025-03-17 18:24           ` Gregory Price
2025-03-13 20:36   ` Davidlohr Bueso
2025-03-17  3:49     ` Bharata B Rao
2025-03-14 15:28   ` Jonathan Cameron
2025-03-18  4:09     ` Bharata B Rao
2025-03-18 14:17       ` Jonathan Cameron
2025-03-24  3:35   ` Balbir Singh
2025-03-28  4:55     ` Bharata B Rao
2025-03-24 13:43   ` Gregory Price
2025-03-24 14:34     ` Bharata B Rao
2025-03-06  5:45 ` [RFC PATCH 3/4] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-03-14 15:38   ` Jonathan Cameron
2025-03-06  5:45 ` [RFC PATCH 4/4] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-03-16 22:00 ` [RFC PATCH 0/4] Kernel daemon for detecting and promoting hot pages SeongJae Park
2025-03-18  6:33   ` Raghavendra K T
2025-03-18 10:45   ` Bharata B Rao
2025-03-18  5:28 ` Balbir Singh
2025-03-20  9:07   ` Bharata B Rao
2025-03-21  6:19     ` Balbir Singh
2025-03-25  8:18 ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox