[RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
@ 2024-12-01 15:38 Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 01/10] mm: Add kmmscand kernel daemon Raghavendra K T
                   ` (11 more replies)
  0 siblings, 12 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

Introduction:
=============
This patchset is an outcome of an ongoing collaboration between AMD and Meta.
Meta wanted to explore an alternative page promotion technique as they
observe high latency spikes in their workloads that access CXL memory.

In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.

This is an early RFC patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.

Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit. It then migrates/promotes the pages to the toptier node
(node 0 in the current approach).

Thus, the approach pushes overhead of scanning, NUMA hint faults and
migrations off from process context.

Initial results show promising number on a microbenchmark.

Experiment:
============
Abench microbenchmark,
- Allocates 8GB/32GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
  granularity.
- 512 iterations with a delay of 1 us between two successive iterations.

SUT: 512 CPU, 2 node 256GB, AMD EPYC.

3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>

Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.

Base case: 6.11-rc6    w/ numab mode = 2 (hot page promotion is enabled).
patched case: 6.11-rc6 w/ numab mode = 0 (numa balancing is disabled).
we expect daemon to do page promotion.

Result [*]:
========
         base                    patched
         time in sec  (%stdev)   time in sec  (%stdev)     %gain
 8GB     133.66       ( 0.38 )        113.77  ( 1.83 )     14.88
32GB     584.77       ( 0.19 )        542.79  ( 0.11 )      7.17

[*] Please note current patchset applies on 6.13-rc, but these results
are old because latest kernel has issues in populating CXL node memory.
Emailing findings/fix on that soon.

Overhead:
The below time is calculated using patch 10. Actual overhead for patched
case may be even lesser.

               (scan + migration)  time in sec
Total memory   base kernel    patched kernel       %gain
8GB             65.743          13.93              78.8114324
32GB           153.95          132.12              14.17992855

Breakup for 8GB         base    patched
numa_task_work_oh       0.883   0
numa_hf_migration_oh   64.86    0
kmmscand_scan_oh        0       2.74
kmmscand_migration_oh   0      11.19

Breakup for 32GB        base    patched
numa_task_work_oh       4.79     0
numa_hf_migration_oh   149.16    0
kmmscand_scan_oh         0      23.4
kmmscand_migration_oh    0     108.72

Limitations:
===========
PTE A bit scanning approach lacks information about exact destination
node to migrate to.

Notes/Observations on design/Implementations/Alternatives/TODOs...
================================
1. Fine-tuning scan throttling

2. Use migrate_balanced_pgdat() to balance toptier node before migration
 OR Use migrate_misplaced_folio_prepare() directly.
 But it may need some optimizations (for e.g., invoke occasionaly so
that overhead is not there for every migration).

3. Explore if a separate PAGE_EXT flag is needed instead of reusing
PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
But practically does not look good idea.

4. Use timestamp information-based migration (Similar to numab mode=2).
instead of migrating immediately when PTE A bit set.
(cons:
 - It will not be accurate since it is done outside of process
context.
 - Performance benefit may be lost.)

5. Explore if we need to use PFN information + hash list instead of
simple migration list. Here scanning is directly done with PFN belonging
to CXL node.

6. Holding PTE lock before migration.

7. Solve: how to find target toptier node for migration.

8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
physical addresses accessed.

9. Gregory has nicely mentioned some details/ideas on different approaches in
[1] : development notes, in the context of promoting unmapped page cache folios.

10. SJ had pointed about concerns about kernel-thread based approaches as in
kstaled [2]. So current patchset has tried to address the issue with simple
algorithms to reduce CPU overhead. Migration throttling, Running the daemon
in NICE priority, Parallelizing migration with scanning could help further.

11. Toptier pages scanned can be used to assist current NUMAB by providing information
on hot VMAs.

Credits
=======
Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and
support.

Kernel thread skeleton and some part of the code is hugely inspired by khugepaged
implementation and some part of IBS patches from Bharata [3].

Looking forward for your comment on whether the current approach in this
*early* RFC looks promising, or are there any alternative ideas etc.

Links:
[1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@gourry.net/
[2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@google.com/#r
[3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@hirez.programming.kicks-ass.net/

I might have CCed more people or less people than needed
unintentionally.

Raghavendra K T (10):
  mm: Add kmmscand kernel daemon
  mm: Maintain mm_struct list in the system
  mm: Scan the mm and create a migration list
  mm/migration: Migrate accessed folios to toptier node
  mm: Add throttling of mm scanning using scan_period
  mm: Add throttling of mm scanning using scan_size
  sysfs: Add sysfs support to tune scanning
  vmstat: Add vmstat counters
  trace/kmmscand: Add tracing of scanning and migration
  kmmscand: Add scanning

 fs/exec.c                     |    4 +
 include/linux/kmmscand.h      |   30 +
 include/linux/mm.h            |   14 +
 include/linux/mm_types.h      |    4 +
 include/linux/vm_event_item.h |   14 +
 include/trace/events/kmem.h   |   99 +++
 kernel/fork.c                 |    4 +
 kernel/sched/fair.c           |   13 +-
 mm/Kconfig                    |    7 +
 mm/Makefile                   |    1 +
 mm/huge_memory.c              |    1 +
 mm/kmmscand.c                 | 1144 +++++++++++++++++++++++++++++++++
 mm/memory.c                   |   12 +-
 mm/vmstat.c                   |   14 +
 14 files changed, 1352 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/kmmscand.h
 create mode 100644 mm/kmmscand.c

base-commit: bcc8eda6d34934d80b96adb8dc4ff5dfc632a53a
-- 
2.39.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 01/10] mm: Add kmmscand kernel daemon
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 02/10] mm: Maintain mm_struct list in the system Raghavendra K T
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

Add a skeleton to support scanning and migration.
Also add a config option for the same.

High level design:

While (1):
  scan the slowtier pages belonging to VMAs of a task.
  Add to migation list
  migrate scanned pages to node 0 (default)

The overall code is heavily influenced by khugepaged design.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/Kconfig    |   7 ++
 mm/Makefile   |   1 +
 mm/kmmscand.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 190 insertions(+)
 create mode 100644 mm/kmmscand.c

diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b016808..a0b5ab6a9b67 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -740,6 +740,13 @@ config KSM
 	  until a program has madvised that an area is MADV_MERGEABLE, and
 	  root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
 
+config KMMSCAND
+	bool "Enable PTE A bit scanning and Migration"
+	depends on NUMA_BALANCING
+	help
+	  Enable PTE A bit scanning of page. CXL pages accessed are migrated to
+	  regular NUMA node (node 0 - default).
+
 config DEFAULT_MMAP_MIN_ADDR
 	int "Low address space to protect from user allocation"
 	depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index dba52bb0da8a..1b6b00e39d12 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -94,6 +94,7 @@ obj-$(CONFIG_FAIL_PAGE_ALLOC) += fail_page_alloc.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
+obj-$(CONFIG_KMMSCAND) += kmmscand.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/kmmscand.c b/mm/kmmscand.c
new file mode 100644
index 000000000000..23cf5638fe10
--- /dev/null
+++ b/mm/kmmscand.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+#include <linux/cleanup.h>
+
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+
+static struct task_struct *kmmscand_thread __read_mostly;
+static DEFINE_MUTEX(kmmscand_mutex);
+
+/* How long to pause between two scan and migration cycle */
+static unsigned int kmmscand_scan_sleep_ms __read_mostly = 16;
+
+/* Max number of mms to scan in one scan and migration cycle */
+#define KMMSCAND_MMS_TO_SCAN	(4 * 1024UL)
+static unsigned long kmmscand_mms_to_scan __read_mostly = KMMSCAND_MMS_TO_SCAN;
+
+volatile bool kmmscand_scan_enabled = true;
+static bool need_wakeup;
+
+static unsigned long kmmscand_sleep_expire;
+
+static DECLARE_WAIT_QUEUE_HEAD(kmmscand_wait);
+
+struct kmmscand_scan {
+	struct list_head mm_head;
+};
+
+struct kmmscand_scan kmmscand_scan = {
+	.mm_head = LIST_HEAD_INIT(kmmscand_scan.mm_head),
+};
+
+static int kmmscand_has_work(void)
+{
+	return !list_empty(&kmmscand_scan.mm_head);
+}
+
+static bool kmmscand_should_wakeup(void)
+{
+	bool wakeup =  kthread_should_stop() || need_wakeup ||
+	       time_after_eq(jiffies, kmmscand_sleep_expire);
+	if (need_wakeup)
+		need_wakeup = false;
+
+	return wakeup;
+}
+
+static void kmmscand_wait_work(void)
+{
+	if (kmmscand_has_work()) {
+		const unsigned long scan_sleep_jiffies =
+			msecs_to_jiffies(kmmscand_scan_sleep_ms);
+
+		if (!scan_sleep_jiffies)
+			return;
+
+		kmmscand_sleep_expire = jiffies + scan_sleep_jiffies;
+		wait_event_timeout(kmmscand_wait,
+					     kmmscand_should_wakeup(),
+					     scan_sleep_jiffies);
+		return;
+	}
+}
+
+static void kmmscand_migrate_folio(void)
+{
+}
+
+static unsigned long kmmscand_scan_mm_slot(void)
+{
+	/* placeholder for scanning */
+	msleep(100);
+	return 0;
+}
+
+static void kmmscand_do_scan(void)
+{
+	unsigned long iter = 0, mms_to_scan;
+
+	mms_to_scan = READ_ONCE(kmmscand_mms_to_scan);
+
+	while (true) {
+		cond_resched();
+
+		if (unlikely(kthread_should_stop()) || !READ_ONCE(kmmscand_scan_enabled))
+			break;
+
+		if (kmmscand_has_work())
+			kmmscand_scan_mm_slot();
+
+		kmmscand_migrate_folio();
+		iter++;
+		if (iter >= mms_to_scan)
+			break;
+	}
+}
+
+static int kmmscand(void *none)
+{
+	for (;;) {
+		if (unlikely(kthread_should_stop()))
+			break;
+
+		kmmscand_do_scan();
+
+		while (!READ_ONCE(kmmscand_scan_enabled)) {
+			cpu_relax();
+			kmmscand_wait_work();
+		}
+
+		kmmscand_wait_work();
+	}
+	return 0;
+}
+
+static int start_kmmscand(void)
+{
+	int err = 0;
+
+	guard(mutex)(&kmmscand_mutex);
+
+	/* Some one already succeeded in starting daemon */
+	if (kmmscand_thread)
+		goto end;
+
+	kmmscand_thread = kthread_run(kmmscand, NULL, "kmmscand");
+	if (IS_ERR(kmmscand_thread)) {
+		pr_err("kmmscand: kthread_run(kmmscand) failed\n");
+		err = PTR_ERR(kmmscand_thread);
+		kmmscand_thread = NULL;
+		goto end;
+	} else {
+		pr_info("kmmscand: Successfully started kmmscand");
+	}
+
+	if (!list_empty(&kmmscand_scan.mm_head))
+		wake_up_interruptible(&kmmscand_wait);
+
+end:
+	return err;
+}
+
+static int stop_kmmscand(void)
+{
+	int err = 0;
+
+	guard(mutex)(&kmmscand_mutex);
+
+	if (kmmscand_thread) {
+		kthread_stop(kmmscand_thread);
+		kmmscand_thread = NULL;
+	}
+
+	return err;
+}
+
+static int __init kmmscand_init(void)
+{
+	int err;
+
+	err = start_kmmscand();
+	if (err)
+		goto err_kmmscand;
+
+	return 0;
+
+err_kmmscand:
+	stop_kmmscand();
+
+	return err;
+}
+subsys_initcall(kmmscand_init);
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 02/10] mm: Maintain mm_struct list in the system
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 01/10] mm: Add kmmscand kernel daemon Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 03/10] mm: Scan the mm and create a migration list Raghavendra K T
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T,
	linux-fsdevel

Add a hook in the fork and exec path to link mm_struct.
Reuse the mm_slot infrastructure to aid insert and lookup of mm_struct.

CC: linux-fsdevel@vger.kernel.org
Suggested-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 fs/exec.c                |  4 ++
 include/linux/kmmscand.h | 30 ++++++++++++++
 kernel/fork.c            |  4 ++
 mm/kmmscand.c            | 86 +++++++++++++++++++++++++++++++++++++++-
 4 files changed, 123 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kmmscand.h

diff --git a/fs/exec.c b/fs/exec.c
index 98cb7ba9983c..bd72107b2ab1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -68,6 +68,7 @@
 #include <linux/user_events.h>
 #include <linux/rseq.h>
 #include <linux/ksm.h>
+#include <linux/kmmscand.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -274,6 +275,8 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	if (err)
 		goto err_ksm;
 
+	kmmscand_execve(mm);
+
 	/*
 	 * Place the stack at the largest stack address the architecture
 	 * supports. Later, we'll move this to an appropriate place. We don't
@@ -296,6 +299,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 	return 0;
 err:
 	ksm_exit(mm);
+	kmmscand_exit(mm);
 err_ksm:
 	mmap_write_unlock(mm);
 err_free:
diff --git a/include/linux/kmmscand.h b/include/linux/kmmscand.h
new file mode 100644
index 000000000000..b120c65ee8c6
--- /dev/null
+++ b/include/linux/kmmscand.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KMMSCAND_H_
+#define _LINUX_KMMSCAND_H_
+
+#ifdef CONFIG_KMMSCAND
+extern void __kmmscand_enter(struct mm_struct *mm);
+extern void __kmmscand_exit(struct mm_struct *mm);
+
+static inline void kmmscand_execve(struct mm_struct *mm)
+{
+	__kmmscand_enter(mm);
+}
+
+static inline void kmmscand_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	__kmmscand_enter(mm);
+}
+
+static inline void kmmscand_exit(struct mm_struct *mm)
+{
+	__kmmscand_exit(mm);
+}
+#else /* !CONFIG_KMMSCAND */
+static inline void __kmmscand_enter(struct mm_struct *mm) {}
+static inline void __kmmscand_exit(struct mm_struct *mm) {}
+static inline void kmmscand_execve(struct mm_struct *mm) {}
+static inline void kmmscand_fork(struct mm_struct *mm, struct mm_struct *oldmm) {}
+static inline void kmmscand_exit(struct mm_struct *mm) {}
+#endif
+#endif /* _LINUX_KMMSCAND_H_ */
diff --git a/kernel/fork.c b/kernel/fork.c
index 1450b461d196..812b0032592e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -85,6 +85,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/kmmscand.h>
 #include <linux/signalfd.h>
 #include <linux/uprobes.h>
 #include <linux/aio.h>
@@ -659,6 +660,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	mm->exec_vm = oldmm->exec_vm;
 	mm->stack_vm = oldmm->stack_vm;
 
+	kmmscand_fork(mm, oldmm);
+
 	/* Use __mt_dup() to efficiently build an identical maple tree. */
 	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
 	if (unlikely(retval))
@@ -1350,6 +1353,7 @@ static inline void __mmput(struct mm_struct *mm)
 	exit_aio(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
+	kmmscand_exit(mm);
 	exit_mmap(mm);
 	mm_put_huge_zero_folio(mm);
 	set_mm_exe_file(mm, NULL);
diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 23cf5638fe10..957128d4e425 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -7,13 +7,14 @@
 #include <linux/swap.h>
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
+#include <linux/kmmscand.h>
 #include <linux/string.h>
 #include <linux/delay.h>
 #include <linux/cleanup.h>
 
 #include <asm/pgalloc.h>
 #include "internal.h"
-
+#include "mm_slot.h"
 
 static struct task_struct *kmmscand_thread __read_mostly;
 static DEFINE_MUTEX(kmmscand_mutex);
@@ -30,10 +31,21 @@ static bool need_wakeup;
 
 static unsigned long kmmscand_sleep_expire;
 
+static DEFINE_SPINLOCK(kmmscand_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kmmscand_wait);
 
+#define KMMSCAND_SLOT_HASH_BITS 10
+static DEFINE_READ_MOSTLY_HASHTABLE(kmmscand_slots_hash, KMMSCAND_SLOT_HASH_BITS);
+
+static struct kmem_cache *kmmscand_slot_cache __read_mostly;
+
+struct kmmscand_mm_slot {
+	struct mm_slot slot;
+};
+
 struct kmmscand_scan {
 	struct list_head mm_head;
+	struct kmmscand_mm_slot *mm_slot;
 };
 
 struct kmmscand_scan kmmscand_scan = {
@@ -76,6 +88,11 @@ static void kmmscand_migrate_folio(void)
 {
 }
 
+static inline int kmmscand_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
 static unsigned long kmmscand_scan_mm_slot(void)
 {
 	/* placeholder for scanning */
@@ -123,6 +140,65 @@ static int kmmscand(void *none)
 	return 0;
 }
 
+static inline void kmmscand_destroy(void)
+{
+	kmem_cache_destroy(kmmscand_slot_cache);
+}
+
+void __kmmscand_enter(struct mm_struct *mm)
+{
+	struct kmmscand_mm_slot *kmmscand_slot;
+	struct mm_slot *slot;
+	int wakeup;
+
+	/* __kmmscand_exit() must not run from under us */
+	VM_BUG_ON_MM(kmmscand_test_exit(mm), mm);
+
+	kmmscand_slot = mm_slot_alloc(kmmscand_slot_cache);
+
+	if (!kmmscand_slot)
+		return;
+
+	slot = &kmmscand_slot->slot;
+
+	spin_lock(&kmmscand_mm_lock);
+	mm_slot_insert(kmmscand_slots_hash, mm, slot);
+
+	wakeup = list_empty(&kmmscand_scan.mm_head);
+	list_add_tail(&slot->mm_node, &kmmscand_scan.mm_head);
+	spin_unlock(&kmmscand_mm_lock);
+
+	mmgrab(mm);
+	if (wakeup)
+		wake_up_interruptible(&kmmscand_wait);
+}
+
+void __kmmscand_exit(struct mm_struct *mm)
+{
+	struct kmmscand_mm_slot *mm_slot;
+	struct mm_slot *slot;
+	int free = 0;
+
+	spin_lock(&kmmscand_mm_lock);
+	slot = mm_slot_lookup(kmmscand_slots_hash, mm);
+	mm_slot = mm_slot_entry(slot, struct kmmscand_mm_slot, slot);
+	if (mm_slot && kmmscand_scan.mm_slot != mm_slot) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		free = 1;
+	}
+
+	spin_unlock(&kmmscand_mm_lock);
+
+	if (free) {
+		mm_slot_free(kmmscand_slot_cache, mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		mmap_write_lock(mm);
+		mmap_write_unlock(mm);
+	}
+}
+
 static int start_kmmscand(void)
 {
 	int err = 0;
@@ -168,6 +244,13 @@ static int __init kmmscand_init(void)
 {
 	int err;
 
+	kmmscand_slot_cache = KMEM_CACHE(kmmscand_mm_slot, 0);
+
+	if (!kmmscand_slot_cache) {
+		pr_err("kmmscand: kmem_cache error");
+		return -ENOMEM;
+	}
+
 	err = start_kmmscand();
 	if (err)
 		goto err_kmmscand;
@@ -176,6 +259,7 @@ static int __init kmmscand_init(void)
 
 err_kmmscand:
 	stop_kmmscand();
+	kmmscand_destroy();
 
 	return err;
 }
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 03/10] mm: Scan the mm and create a migration list
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 01/10] mm: Add kmmscand kernel daemon Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 02/10] mm: Maintain mm_struct list in the system Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 04/10] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

Since we already have list of mm_struct in the system, add a module to
scan each mm that walks VMAs of each mm_struct and scan all the pages
associated with that.

 In the scan path: Check for the recently acccessed pages (folios) belonging to
slowtier nodes. Add all those folios to a migration list.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kmmscand.c | 268 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 264 insertions(+), 4 deletions(-)

diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 957128d4e425..0496359d07f5 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -4,12 +4,19 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/pagewalk.h>
+#include <linux/page_ext.h>
+#include <linux/page_idle.h>
+#include <linux/page_table_check.h>
+#include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
 #include <linux/kmmscand.h>
+#include <linux/memory-tiers.h>
+#include <linux/mempolicy.h>
 #include <linux/string.h>
-#include <linux/delay.h>
 #include <linux/cleanup.h>
 
 #include <asm/pgalloc.h>
@@ -32,6 +39,7 @@ static bool need_wakeup;
 static unsigned long kmmscand_sleep_expire;
 
 static DEFINE_SPINLOCK(kmmscand_mm_lock);
+static DEFINE_SPINLOCK(kmmscand_migrate_lock);
 static DECLARE_WAIT_QUEUE_HEAD(kmmscand_wait);
 
 #define KMMSCAND_SLOT_HASH_BITS 10
@@ -41,6 +49,7 @@ static struct kmem_cache *kmmscand_slot_cache __read_mostly;
 
 struct kmmscand_mm_slot {
 	struct mm_slot slot;
+	long address;
 };
 
 struct kmmscand_scan {
@@ -52,6 +61,21 @@ struct kmmscand_scan kmmscand_scan = {
 	.mm_head = LIST_HEAD_INIT(kmmscand_scan.mm_head),
 };
 
+struct kmmscand_migrate_list {
+	struct list_head migrate_head;
+};
+
+struct kmmscand_migrate_list kmmscand_migrate_list = {
+	.migrate_head = LIST_HEAD_INIT(kmmscand_migrate_list.migrate_head),
+};
+
+struct kmmscand_migrate_info {
+	struct list_head migrate_node;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct folio *folio;
+	unsigned long address;
+};
 static int kmmscand_has_work(void)
 {
 	return !list_empty(&kmmscand_scan.mm_head);
@@ -84,8 +108,140 @@ static void kmmscand_wait_work(void)
 	}
 }
 
-static void kmmscand_migrate_folio(void)
+static bool kmmscand_eligible_srcnid(int nid)
 {
+	if (!node_is_toptier(nid))
+		return true;
+	return false;
+}
+
+static bool folio_idle_clear_pte_refs_one(struct folio *folio,
+					 struct vm_area_struct *vma,
+					 unsigned long addr,
+					 pte_t *ptep)
+{
+	bool referenced = false;
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd = pmd_off(mm, addr);
+
+	if (ptep) {
+		if (ptep_clear_young_notify(vma, addr, ptep))
+			referenced = true;
+	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+		if (!pmd_present(*pmd))
+			WARN_ON_ONCE(1);
+		if (pmdp_clear_young_notify(vma, addr, pmd))
+			referenced = true;
+	} else {
+		WARN_ON_ONCE(1);
+	}
+
+	if (referenced) {
+		folio_clear_idle(folio);
+		folio_set_young(folio);
+	}
+	return true;
+}
+
+static void page_idle_clear_pte_refs(struct page *page, pte_t *pte, struct mm_walk *walk)
+{
+	bool need_lock;
+	struct folio *folio =  page_folio(page);
+	unsigned long address;
+
+	if (!folio_mapped(folio) || !folio_raw_mapping(folio))
+		return;
+
+	need_lock = !folio_test_anon(folio) || folio_test_ksm(folio);
+	if (need_lock && !folio_trylock(folio))
+		return;
+	address = vma_address(walk->vma, page_pgoff(folio, page), compound_nr(page));
+	VM_BUG_ON_VMA(address == -EFAULT, vma);
+	folio_idle_clear_pte_refs_one(folio, walk->vma, address, pte);
+
+	if (need_lock)
+		folio_unlock(folio);
+}
+
+static int hot_vma_idle_pte_entry(pte_t *pte,
+				 unsigned long addr,
+				 unsigned long next,
+				 struct mm_walk *walk)
+{
+	struct page *page;
+	struct folio *folio;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	struct kmmscand_migrate_info *info;
+	struct kmmscand_migrate_list *migrate_list = walk->private;
+	int srcnid;
+
+	pte_t pteval = ptep_get(pte);
+
+	if (pte_none(pteval))
+		return 1;
+	vma = walk->vma;
+	mm = vma->vm_mm;
+	page = pte_page(*pte);
+
+	page_idle_clear_pte_refs(page, pte, walk);
+
+	folio = page_folio(page);
+	folio_get(folio);
+
+	if (!folio || folio_is_zone_device(folio)) {
+		folio_put(folio);
+		return 1;
+	}
+
+	srcnid = folio_nid(folio);
+
+	if (!folio_test_idle(folio) || folio_test_young(folio) ||
+			mmu_notifier_test_young(mm, addr) ||
+			folio_test_referenced(folio) || pte_young(pteval)) {
+
+		/* Do not try to promote pages from regular nodes */
+		if (!kmmscand_eligible_srcnid(srcnid))
+			goto end;
+
+		info = kzalloc(sizeof(struct kmmscand_migrate_info), GFP_KERNEL);
+		if (info && migrate_list) {
+
+			info->mm = mm;
+			info->vma = vma;
+			info->folio = folio;
+
+			spin_lock(&kmmscand_migrate_lock);
+			list_add_tail(&info->migrate_node, &migrate_list->migrate_head);
+			spin_unlock(&kmmscand_migrate_lock);
+		}
+	}
+end:
+	folio_set_idle(folio);
+	folio_put(folio);
+	return 0;
+}
+
+static const struct mm_walk_ops hot_vma_set_idle_ops = {
+	.pte_entry = hot_vma_idle_pte_entry,
+	.walk_lock = PGWALK_RDLOCK,
+};
+
+static void kmmscand_walk_page_vma(struct vm_area_struct *vma)
+{
+	if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
+	    is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
+		return;
+	}
+
+	if (!vma->vm_mm ||
+	    (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+		return;
+
+	if (!vma_is_accessible(vma))
+		return;
+
+	walk_page_vma(vma, &hot_vma_set_idle_ops, &kmmscand_migrate_list);
 }
 
 static inline int kmmscand_test_exit(struct mm_struct *mm)
@@ -93,10 +249,113 @@ static inline int kmmscand_test_exit(struct mm_struct *mm)
 	return atomic_read(&mm->mm_users) == 0;
 }
 
+static void kmmscand_collect_mm_slot(struct kmmscand_mm_slot *mm_slot)
+{
+	struct mm_slot *slot = &mm_slot->slot;
+	struct mm_struct *mm = slot->mm;
+
+	lockdep_assert_held(&kmmscand_mm_lock);
+
+	if (kmmscand_test_exit(mm)) {
+		/* free mm_slot */
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+
+		mm_slot_free(kmmscand_slot_cache, mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static void kmmscand_migrate_folio(void)
+{
+}
+
 static unsigned long kmmscand_scan_mm_slot(void)
 {
-	/* placeholder for scanning */
-	msleep(100);
+	bool update_mmslot_info = false;
+
+	unsigned long address;
+
+	struct mm_slot *slot;
+	struct mm_struct *mm;
+	struct vma_iterator vmi;
+	struct vm_area_struct *vma = NULL;
+	struct kmmscand_mm_slot *mm_slot;
+
+	/* Retrieve mm */
+	spin_lock(&kmmscand_mm_lock);
+
+	if (kmmscand_scan.mm_slot) {
+		mm_slot = kmmscand_scan.mm_slot;
+		slot = &mm_slot->slot;
+		address = mm_slot->address;
+	} else {
+		slot = list_entry(kmmscand_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		mm_slot = mm_slot_entry(slot, struct kmmscand_mm_slot, slot);
+		address = mm_slot->address;
+		kmmscand_scan.mm_slot = mm_slot;
+	}
+
+	mm = slot->mm;
+
+	spin_unlock(&kmmscand_mm_lock);
+
+	if (unlikely(!mmap_read_trylock(mm)))
+		goto outerloop_mmap_lock;
+
+	if (unlikely(kmmscand_test_exit(mm)))
+		goto outerloop;
+
+
+	vma_iter_init(&vmi, mm, address);
+
+	for_each_vma(vmi, vma) {
+		/* Count the scanned pages here to decide exit */
+		kmmscand_walk_page_vma(vma);
+
+		address = vma->vm_end;
+	}
+
+	if (!vma)
+		address = 0;
+
+	update_mmslot_info = true;
+
+outerloop:
+	/* exit_mmap will destroy ptes after this */
+	mmap_read_unlock(mm);
+
+outerloop_mmap_lock:
+	spin_lock(&kmmscand_mm_lock);
+	VM_BUG_ON(kmmscand_scan.mm_slot != mm_slot);
+
+	if (update_mmslot_info)
+		mm_slot->address = address;
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (unlikely(kmmscand_test_exit(mm)) || !vma) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * kmmscand runs here, kmmscand_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (slot->mm_node.next != &kmmscand_scan.mm_head) {
+			slot = list_entry(slot->mm_node.next,
+					struct mm_slot, mm_node);
+			kmmscand_scan.mm_slot =
+				mm_slot_entry(slot, struct kmmscand_mm_slot, slot);
+
+		} else
+			kmmscand_scan.mm_slot = NULL;
+
+		if (kmmscand_test_exit(mm))
+			kmmscand_collect_mm_slot(mm_slot);
+	}
+
+	spin_unlock(&kmmscand_mm_lock);
 	return 0;
 }
 
@@ -159,6 +418,7 @@ void __kmmscand_enter(struct mm_struct *mm)
 	if (!kmmscand_slot)
 		return;
 
+	kmmscand_slot->address = 0;
 	slot = &kmmscand_slot->slot;
 
 	spin_lock(&kmmscand_mm_lock);
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 04/10] mm/migration: Migrate accessed folios to toptier node
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (2 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 03/10] mm: Scan the mm and create a migration list Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 05/10] mm: Add throttling of mm scanning using scan_period Raghavendra K T
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

For each recently accessed slowtier folio in the migration list:
 - Isolate LRU pages
 - Migrate to a regular node.

The rationale behind whole migration is to speedup the access to
recently accessed pages.

Limitation:
 PTE A bit scanning approach lacks information about exact destination
node to migrate to.

Reason:
 PROT_NONE hint fault based scanning is done in a process context. Here
when the fault occurs, source CPU of the fault associated task is known.
Time of page access is also accurate.
With the lack of above information, migration is done to node 0 by default.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kmmscand.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 178 insertions(+)

TBD: Before calling migrate_misplaced_folio, we need to hold a PTL lock.
But since we are not coming from fault path, it is tricky. We need to
fix this before the final patch.

diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 0496359d07f5..3b4453b053f4 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -4,6 +4,7 @@
 #include <linux/sched.h>
 #include <linux/sched/mm.h>
 #include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
 #include <linux/rmap.h>
 #include <linux/pagewalk.h>
 #include <linux/page_ext.h>
@@ -36,7 +37,15 @@ static unsigned long kmmscand_mms_to_scan __read_mostly = KMMSCAND_MMS_TO_SCAN;
 volatile bool kmmscand_scan_enabled = true;
 static bool need_wakeup;
 
+/* mm of the migrating folio entry */
+static struct mm_struct *kmmscand_cur_migrate_mm;
+
+/* Migration list is manipulated underneath because of mm_exit */
+static bool  kmmscand_migration_list_dirty;
+
 static unsigned long kmmscand_sleep_expire;
+#define KMMSCAND_DEFAULT_TARGET_NODE	(0)
+static int kmmscand_target_node = KMMSCAND_DEFAULT_TARGET_NODE;
 
 static DEFINE_SPINLOCK(kmmscand_mm_lock);
 static DEFINE_SPINLOCK(kmmscand_migrate_lock);
@@ -115,6 +124,107 @@ static bool kmmscand_eligible_srcnid(int nid)
 	return false;
 }
 
+/*
+ * Do not know what info to pass in the future to make
+ * decision on taget node. Keep it void * now.
+ */
+static int kmmscand_get_target_node(void *data)
+{
+	return kmmscand_target_node;
+}
+
+static int kmmscand_migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node)
+{
+	if (folio_is_file_lru(folio)) {
+		/*
+		 * Do not migrate file folios that are mapped in multiple
+		 * processes with execute permissions as they are probably
+		 * shared libraries.
+		 *
+		 * See folio_likely_mapped_shared() on possible imprecision
+		 * when we cannot easily detect if a folio is shared.
+		 */
+		if (vma && (vma->vm_flags & VM_EXEC) &&
+		    folio_likely_mapped_shared(folio))
+			return -EACCES;
+		/*
+		 * Do not migrate dirty folios as not all filesystems can move
+		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
+		 * cycles.
+		 */
+		if (folio_test_dirty(folio))
+			return -EAGAIN;
+	}
+
+	if (!folio_isolate_lru(folio))
+		return -EAGAIN;
+
+	return 0;
+}
+
+enum kmmscand_migration_err {
+	KMMSCAND_NULL_MM = 1,
+	KMMSCAND_INVALID_FOLIO,
+	KMMSCAND_INVALID_VMA,
+	KMMSCAND_INELIGIBLE_SRC_NODE,
+	KMMSCAND_SAME_SRC_DEST_NODE,
+	KMMSCAND_LRU_ISOLATION_ERR,
+};
+
+static int kmmscand_promote_folio(struct kmmscand_migrate_info *info)
+{
+	unsigned long pfn;
+	struct page *page;
+	struct folio *folio;
+	struct vm_area_struct *vma;
+	int ret;
+
+	int srcnid, destnid;
+
+	if (info->mm == NULL)
+		return KMMSCAND_NULL_MM;
+
+	folio = info->folio;
+
+	/* Check again if the folio is really valid now */
+	if (folio) {
+		pfn = folio_pfn(folio);
+		page = pfn_to_online_page(pfn);
+	}
+
+	if (!page || !folio || !folio_test_lru(folio) ||
+		folio_is_zone_device(folio) || !folio_mapped(folio))
+		return KMMSCAND_INVALID_FOLIO;
+
+	vma = info->vma;
+
+	/* XXX: Need to validate vma here?. vma_lookup() results in 2x regression */
+	if (!vma)
+		return KMMSCAND_INVALID_VMA;
+
+	srcnid = folio_nid(folio);
+
+	/* Do not try to promote pages from regular nodes */
+	if (!kmmscand_eligible_srcnid(srcnid))
+		return KMMSCAND_INELIGIBLE_SRC_NODE;
+
+	destnid = kmmscand_get_target_node(NULL);
+
+	if (srcnid == destnid)
+		return KMMSCAND_SAME_SRC_DEST_NODE;
+
+	folio_get(folio);
+	ret = kmmscand_migrate_misplaced_folio_prepare(folio, vma, destnid);
+	if (ret) {
+		folio_put(folio);
+		return KMMSCAND_LRU_ISOLATION_ERR;
+	}
+	folio_put(folio);
+
+	return  migrate_misplaced_folio(folio, vma, destnid);
+}
+
 static bool folio_idle_clear_pte_refs_one(struct folio *folio,
 					 struct vm_area_struct *vma,
 					 unsigned long addr,
@@ -266,8 +376,74 @@ static void kmmscand_collect_mm_slot(struct kmmscand_mm_slot *mm_slot)
 	}
 }
 
+static void kmmscand_cleanup_migration_list(struct mm_struct *mm)
+{
+	struct kmmscand_migrate_info *info, *tmp;
+
+start_again:
+	spin_lock(&kmmscand_migrate_lock);
+	if (!list_empty(&kmmscand_migrate_list.migrate_head)) {
+
+		if (mm == READ_ONCE(kmmscand_cur_migrate_mm)) {
+			/* A folio in this mm is being migrated. wait */
+			WRITE_ONCE(kmmscand_migration_list_dirty, true);
+			spin_unlock(&kmmscand_migrate_lock);
+			goto start_again;
+		}
+
+		list_for_each_entry_safe(info, tmp, &kmmscand_migrate_list.migrate_head,
+			migrate_node) {
+			if (info && (info->mm == mm)) {
+				info->mm = NULL;
+				WRITE_ONCE(kmmscand_migration_list_dirty, true);
+			}
+		}
+	}
+	spin_unlock(&kmmscand_migrate_lock);
+}
+
 static void kmmscand_migrate_folio(void)
 {
+	int ret = 0;
+	struct kmmscand_migrate_info *info, *tmp;
+
+	spin_lock(&kmmscand_migrate_lock);
+
+	if (!list_empty(&kmmscand_migrate_list.migrate_head)) {
+		list_for_each_entry_safe(info, tmp, &kmmscand_migrate_list.migrate_head,
+			migrate_node) {
+			if (READ_ONCE(kmmscand_migration_list_dirty)) {
+				kmmscand_migration_list_dirty = false;
+				list_del(&info->migrate_node);
+				/*
+				 * Do not try to migrate this entry because mm might have
+				 * vanished underneath.
+				 */
+				kfree(info);
+				spin_unlock(&kmmscand_migrate_lock);
+				goto dirty_list_handled;
+			}
+
+			list_del(&info->migrate_node);
+			/* Note down the mm of folio entry we are migrating */
+			WRITE_ONCE(kmmscand_cur_migrate_mm, info->mm);
+			spin_unlock(&kmmscand_migrate_lock);
+
+			if (info->mm)
+				ret = kmmscand_promote_folio(info);
+
+			kfree(info);
+
+			spin_lock(&kmmscand_migrate_lock);
+			/* Reset  mm  of folio entry we are migrating */
+			WRITE_ONCE(kmmscand_cur_migrate_mm, NULL);
+			spin_unlock(&kmmscand_migrate_lock);
+dirty_list_handled:
+			//cond_resched();
+			spin_lock(&kmmscand_migrate_lock);
+		}
+	}
+	spin_unlock(&kmmscand_migrate_lock);
 }
 
 static unsigned long kmmscand_scan_mm_slot(void)
@@ -450,6 +626,8 @@ void __kmmscand_exit(struct mm_struct *mm)
 
 	spin_unlock(&kmmscand_mm_lock);
 
+	kmmscand_cleanup_migration_list(mm);
+
 	if (free) {
 		mm_slot_free(kmmscand_slot_cache, mm_slot);
 		mmdrop(mm);
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 05/10] mm: Add throttling of mm scanning using scan_period
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (3 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 04/10] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 06/10] mm: Add throttling of mm scanning using scan_size Raghavendra K T
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

Before this patch, scanning of tasks' mm is done continuously and also
at the same rate.

Improve that by adding a throttling logic:
1) if there were useful pages found during last scan and current scan,
decrease the scan_period (to increase scan rate) by TUNE_PERCENT (15%).

2) if there were no useful pages found in last scan, and there are
candidate migration pages in the current scan decrease the scan_period
aggressively by 2 power SCAN_CHANGE_SCALE (2^3 = 8 now).

Vice versa is done for the reverse case.
Scan period is clamped between MIN (400ms) and MAX (5sec).

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm_types.h |   4 ++
 mm/kmmscand.c            | 123 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 125 insertions(+), 2 deletions(-)

Future improvements:
1. Consider the slope of useful pages found in last
scan and current scan for finer tuning.
2. Use migration failure information.

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7361a8f3ab68..620b360b06fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -978,6 +978,10 @@ struct mm_struct {
 
 		/* numa_scan_seq prevents two threads remapping PTEs. */
 		int numa_scan_seq;
+#endif
+#ifdef CONFIG_KMMSCAND
+		/* Tracks number of pages with PTE A  bit set after scanning. */
+		atomic_long_t nr_accessed;
 #endif
 		/*
 		 * An operation with batched TLB flushing is going on. Anything
diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 3b4453b053f4..589aed604cd6 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -19,6 +19,7 @@
 #include <linux/mempolicy.h>
 #include <linux/string.h>
 #include <linux/cleanup.h>
+#include <linux/minmax.h>
 
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -27,6 +28,16 @@
 static struct task_struct *kmmscand_thread __read_mostly;
 static DEFINE_MUTEX(kmmscand_mutex);
 
+/*
+ * Scan period for each mm.
+ * Min: 400ms default: 2sec Max: 5sec
+ */
+#define KMMSCAND_SCAN_PERIOD_MAX	5000U
+#define KMMSCAND_SCAN_PERIOD_MIN	400U
+#define KMMSCAND_SCAN_PERIOD		2000U
+
+static unsigned int kmmscand_mm_scan_period_ms __read_mostly = KMMSCAND_SCAN_PERIOD;
+
 /* How long to pause between two scan and migration cycle */
 static unsigned int kmmscand_scan_sleep_ms __read_mostly = 16;
 
@@ -58,6 +69,11 @@ static struct kmem_cache *kmmscand_slot_cache __read_mostly;
 
 struct kmmscand_mm_slot {
 	struct mm_slot slot;
+	/* Unit: ms. Determines how aften mm scan should happen. */
+	unsigned int scan_period;
+	unsigned long next_scan;
+	/* Tracks how many useful pages obtained for migration in the last scan */
+	unsigned long scan_delta;
 	long address;
 };
 
@@ -85,6 +101,7 @@ struct kmmscand_migrate_info {
 	struct folio *folio;
 	unsigned long address;
 };
+
 static int kmmscand_has_work(void)
 {
 	return !list_empty(&kmmscand_scan.mm_head);
@@ -324,6 +341,12 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 			spin_lock(&kmmscand_migrate_lock);
 			list_add_tail(&info->migrate_node, &migrate_list->migrate_head);
 			spin_unlock(&kmmscand_migrate_lock);
+
+			/*
+			 * XXX: Should nr_accessed be per vma for finer control?
+			 * XXX: We are increamenting atomic var under mmap_readlock
+			 */
+			atomic_long_inc(&mm->nr_accessed);
 		}
 	}
 end:
@@ -446,11 +469,85 @@ static void kmmscand_migrate_folio(void)
 	spin_unlock(&kmmscand_migrate_lock);
 }
 
+/*
+ * This is the normal change percentage when old and new delta remain same.
+ * i.e., either both positive or both zero.
+ */
+#define SCAN_PERIOD_TUNE_PERCENT	15
+
+/* This is to change the scan_period aggressively when deltas are different */
+#define SCAN_PERIOD_CHANGE_SCALE	3
+/*
+ * XXX: Hack to prevent unmigrated pages coming again and again while scanning.
+ * Actual fix needs to identify the type of unmigrated pages OR consider migration
+ * failures in next scan.
+ */
+#define KMMSCAND_IGNORE_SCAN_THR	100
+
+/*
+ * X : Number of useful pages in the last scan.
+ * Y : Number of useful pages found in current scan.
+ * Tuning scan_period:
+ *	Initial scan_period is 2s.
+ *	case 1: (X = 0, Y = 0)
+ *		Increase scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ *	case 2: (X = 0, Y > 0)
+ *		Decrease scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ *	case 3: (X > 0, Y = 0 )
+ *		Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
+ *	case 4: (X > 0, Y > 0)
+ *		Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ */
+static inline void kmmscand_update_mmslot_info(struct kmmscand_mm_slot *mm_slot, unsigned long total)
+{
+	unsigned int scan_period;
+	unsigned long now;
+	unsigned long old_scan_delta;
+
+	/* XXX: Hack to get rid of continuously failing/unmigrateable pages */
+	if (total < KMMSCAND_IGNORE_SCAN_THR)
+		total = 0;
+
+	scan_period = mm_slot->scan_period;
+
+	old_scan_delta = mm_slot->scan_delta;
+
+	/*
+	 * case 1: old_scan_delta and new delta are similar, (slow) TUNE_PERCENT used.
+	 * case 2: old_scan_delta and new delta are different. (fast) CHANGE_SCALE used.
+	 * TBD:
+	 * 1. Further tune scan_period based on delta between last and current scan delta.
+	 * 2. Optimize calculation
+	 */
+	if (!old_scan_delta && !total) {
+		scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+		scan_period /= 100;
+	} else if (old_scan_delta && total) {
+		scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period;
+		scan_period /= 100;
+	} else if (old_scan_delta && !total) {
+		scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE;
+	} else {
+		scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE;
+	}
+
+	scan_period = clamp(scan_period, KMMSCAND_SCAN_PERIOD_MIN, KMMSCAND_SCAN_PERIOD_MAX);
+
+	now = jiffies;
+	mm_slot->next_scan = now + msecs_to_jiffies(scan_period);
+	mm_slot->scan_period = scan_period;
+	mm_slot->scan_delta = total;
+}
+
 static unsigned long kmmscand_scan_mm_slot(void)
 {
 	bool update_mmslot_info = false;
 
+	unsigned int mm_slot_scan_period;
+	unsigned long now;
+	unsigned long mm_slot_next_scan;
 	unsigned long address;
+	unsigned long folio_nr_access_s, folio_nr_access_e, total = 0;
 
 	struct mm_slot *slot;
 	struct mm_struct *mm;
@@ -473,6 +570,8 @@ static unsigned long kmmscand_scan_mm_slot(void)
 		kmmscand_scan.mm_slot = mm_slot;
 	}
 
+	mm_slot_next_scan = mm_slot->next_scan;
+	mm_slot_scan_period = mm_slot->scan_period;
 	mm = slot->mm;
 
 	spin_unlock(&kmmscand_mm_lock);
@@ -483,6 +582,16 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	if (unlikely(kmmscand_test_exit(mm)))
 		goto outerloop;
 
+	now = jiffies;
+	/*
+	 * Dont scan if :
+	 * This is not a first scan AND
+	 * Reaching here before designated next_scan time.
+	 */
+	if (mm_slot_next_scan && time_before(now, mm_slot_next_scan))
+		goto outerloop;
+
+	folio_nr_access_s = atomic_long_read(&mm->nr_accessed);
 
 	vma_iter_init(&vmi, mm, address);
 
@@ -492,6 +601,8 @@ static unsigned long kmmscand_scan_mm_slot(void)
 
 		address = vma->vm_end;
 	}
+	folio_nr_access_e = atomic_long_read(&mm->nr_accessed);
+	total = folio_nr_access_e - folio_nr_access_s;
 
 	if (!vma)
 		address = 0;
@@ -506,8 +617,12 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	spin_lock(&kmmscand_mm_lock);
 	VM_BUG_ON(kmmscand_scan.mm_slot != mm_slot);
 
-	if (update_mmslot_info)
+
+	if (update_mmslot_info) {
 		mm_slot->address = address;
+		kmmscand_update_mmslot_info(mm_slot, total);
+	}
+
 	/*
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
@@ -532,7 +647,7 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	}
 
 	spin_unlock(&kmmscand_mm_lock);
-	return 0;
+	return total;
 }
 
 static void kmmscand_do_scan(void)
@@ -595,6 +710,10 @@ void __kmmscand_enter(struct mm_struct *mm)
 		return;
 
 	kmmscand_slot->address = 0;
+	kmmscand_slot->scan_period = kmmscand_mm_scan_period_ms;
+	kmmscand_slot->next_scan = 0;
+	kmmscand_slot->scan_delta = 0;
+
 	slot = &kmmscand_slot->slot;
 
 	spin_lock(&kmmscand_mm_lock);
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 06/10] mm: Add throttling of mm scanning using scan_size
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (4 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 05/10] mm: Add throttling of mm scanning using scan_period Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 07/10] sysfs: Add sysfs support to tune scanning Raghavendra K T
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

Before this patch, scanning is done on entire virtual address space
of all the tasks. Now the scan size is shrunk or expanded based on the
useful pages found in the last scan.

This helps to quickly get out of unnecessary scanning thus burning
lesser CPU.

Drawback: If a useful chunk is at the other end of the VMA space, it
will delay scanning and migration.

Shrink/expand algorithm for scan_size:
X : Number of useful pages in the last scan.
Y : Number of useful pages found in current scan.
Initial scan_size is 4GB
 case 1: (X = 0, Y = 0)
  Decrease scan_size by 2
 case 2: (X = 0, Y > 0)
  Aggressively change to MAX (16GB)
 case 3: (X > 0, Y = 0 )
   No change
 case 4: (X > 0, Y > 0)
   Increase scan_size by 2

Scan size is clamped between MIN (512MB) and MAX (16GB)).
TBD:  Tuning this based on real workload

Experiment:
============
Abench microbenchmark,
- Allocates 8GB/32GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
  granularity.
- 512 iterations.

SUT: 512 CPU, 2 node 256GB, AMD EPYC.

3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>

Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.

Base case: 6.11-rc6    w/ numab mode = 2 (hot page promotion is enabled).
patched case: 6.11-rc6 w/ numab mode = 0 (numa balancing is disabled).
we expect daemon to do page promotion.

Result:
========
         base                    patched
         time in sec  (%stdev)   time in sec  (%stdev)     %gain
 8GB     133.66       ( 0.38 )        113.77  ( 1.83 )     14.88
32GB     584.77       ( 0.19 )        542.79  ( 0.11 )      7.17

Overhead:
The below time is calculated using patch 10. Actual overhead for patched
case may be even lesser.

               (scan + migration)  time in sec
Total memory   base kernel    patched kernel       %gain
8GB             65.743          13.93              78.8114324
32GB           153.95          132.12              14.17992855

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kmmscand.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 589aed604cd6..2efef53f9402 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -28,6 +28,16 @@
 static struct task_struct *kmmscand_thread __read_mostly;
 static DEFINE_MUTEX(kmmscand_mutex);
 
+/*
+ * Total VMA size to cover during scan.
+ * Min: 512MB default: 4GB max: 16GB
+ */
+#define KMMSCAND_SCAN_SIZE_MIN	(512 * 1024 * 1024UL)
+#define KMMSCAND_SCAN_SIZE_MAX	(16 * 1024 * 1024 * 1024UL)
+#define KMMSCAND_SCAN_SIZE	(4 * 1024 * 1024 * 1024UL)
+
+static unsigned long kmmscand_scan_size __read_mostly = KMMSCAND_SCAN_SIZE;
+
 /*
  * Scan period for each mm.
  * Min: 400ms default: 2sec Max: 5sec
@@ -74,6 +84,8 @@ struct kmmscand_mm_slot {
 	unsigned long next_scan;
 	/* Tracks how many useful pages obtained for migration in the last scan */
 	unsigned long scan_delta;
+	/* Determines how much VMA address space to be covered in the scanning */
+	unsigned long scan_size;
 	long address;
 };
 
@@ -484,6 +496,7 @@ static void kmmscand_migrate_folio(void)
  */
 #define KMMSCAND_IGNORE_SCAN_THR	100
 
+#define SCAN_SIZE_CHANGE_SCALE	1
 /*
  * X : Number of useful pages in the last scan.
  * Y : Number of useful pages found in current scan.
@@ -497,11 +510,22 @@ static void kmmscand_migrate_folio(void)
  *		Increase scan_period by (2 << SCAN_PERIOD_CHANGE_SCALE).
  *	case 4: (X > 0, Y > 0)
  *		Decrease scan_period by SCAN_PERIOD_TUNE_PERCENT.
+ * Tuning scan_size:
+ * Initial scan_size is 4GB
+ *	case 1: (X = 0, Y = 0)
+ *		Decrease scan_size by (1 << SCAN_SIZE_CHANGE_SCALE).
+ *	case 2: (X = 0, Y > 0)
+ *		scan_size = KMMSCAND_SCAN_SIZE_MAX
+ *  case 3: (X > 0, Y = 0 )
+ *		No change
+ *  case 4: (X > 0, Y > 0)
+ *		Increase scan_size by (1 << SCAN_SIZE_CHANGE_SCALE).
  */
 static inline void kmmscand_update_mmslot_info(struct kmmscand_mm_slot *mm_slot, unsigned long total)
 {
 	unsigned int scan_period;
 	unsigned long now;
+	unsigned long scan_size;
 	unsigned long old_scan_delta;
 
 	/* XXX: Hack to get rid of continuously failing/unmigrateable pages */
@@ -509,6 +533,7 @@ static inline void kmmscand_update_mmslot_info(struct kmmscand_mm_slot *mm_slot,
 		total = 0;
 
 	scan_period = mm_slot->scan_period;
+	scan_size = mm_slot->scan_size;
 
 	old_scan_delta = mm_slot->scan_delta;
 
@@ -522,30 +547,38 @@ static inline void kmmscand_update_mmslot_info(struct kmmscand_mm_slot *mm_slot,
 	if (!old_scan_delta && !total) {
 		scan_period = (100 + SCAN_PERIOD_TUNE_PERCENT) * scan_period;
 		scan_period /= 100;
+		scan_size = scan_size >> SCAN_SIZE_CHANGE_SCALE;
 	} else if (old_scan_delta && total) {
 		scan_period = (100 - SCAN_PERIOD_TUNE_PERCENT) * scan_period;
 		scan_period /= 100;
+		scan_size = scan_size << SCAN_SIZE_CHANGE_SCALE;
 	} else if (old_scan_delta && !total) {
 		scan_period = scan_period << SCAN_PERIOD_CHANGE_SCALE;
 	} else {
 		scan_period = scan_period >> SCAN_PERIOD_CHANGE_SCALE;
+		scan_size = KMMSCAND_SCAN_SIZE_MAX;
 	}
 
 	scan_period = clamp(scan_period, KMMSCAND_SCAN_PERIOD_MIN, KMMSCAND_SCAN_PERIOD_MAX);
+	scan_size = clamp(scan_size, KMMSCAND_SCAN_SIZE_MIN, KMMSCAND_SCAN_SIZE_MAX);
 
 	now = jiffies;
 	mm_slot->next_scan = now + msecs_to_jiffies(scan_period);
 	mm_slot->scan_period = scan_period;
+	mm_slot->scan_size = scan_size;
 	mm_slot->scan_delta = total;
 }
 
 static unsigned long kmmscand_scan_mm_slot(void)
 {
+	bool next_mm = false;
 	bool update_mmslot_info = false;
 
 	unsigned int mm_slot_scan_period;
 	unsigned long now;
 	unsigned long mm_slot_next_scan;
+	unsigned long mm_slot_scan_size;
+	unsigned long scanned_size = 0;
 	unsigned long address;
 	unsigned long folio_nr_access_s, folio_nr_access_e, total = 0;
 
@@ -572,6 +605,7 @@ static unsigned long kmmscand_scan_mm_slot(void)
 
 	mm_slot_next_scan = mm_slot->next_scan;
 	mm_slot_scan_period = mm_slot->scan_period;
+	mm_slot_scan_size = mm_slot->scan_size;
 	mm = slot->mm;
 
 	spin_unlock(&kmmscand_mm_lock);
@@ -579,8 +613,10 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto outerloop_mmap_lock;
 
-	if (unlikely(kmmscand_test_exit(mm)))
+	if (unlikely(kmmscand_test_exit(mm))) {
+		next_mm = true;
 		goto outerloop;
+	}
 
 	now = jiffies;
 	/*
@@ -598,8 +634,20 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	for_each_vma(vmi, vma) {
 		/* Count the scanned pages here to decide exit */
 		kmmscand_walk_page_vma(vma);
-
+		scanned_size += vma->vm_end - vma->vm_start;
 		address = vma->vm_end;
+
+		if (scanned_size >= mm_slot_scan_size) {
+			folio_nr_access_e = atomic_long_read(&mm->nr_accessed);
+			total = folio_nr_access_e - folio_nr_access_s;
+			/* If we had got accessed pages, ignore the current scan_size threshold */
+			if (total > KMMSCAND_IGNORE_SCAN_THR) {
+				mm_slot_scan_size = KMMSCAND_SCAN_SIZE_MAX;
+				continue;
+			}
+			next_mm = true;
+			break;
+		}
 	}
 	folio_nr_access_e = atomic_long_read(&mm->nr_accessed);
 	total = folio_nr_access_e - folio_nr_access_s;
@@ -627,7 +675,7 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	 * Release the current mm_slot if this mm is about to die, or
 	 * if we scanned all vmas of this mm.
 	 */
-	if (unlikely(kmmscand_test_exit(mm)) || !vma) {
+	if (unlikely(kmmscand_test_exit(mm)) || !vma || next_mm) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * kmmscand runs here, kmmscand_exit will find
@@ -711,6 +759,7 @@ void __kmmscand_enter(struct mm_struct *mm)
 
 	kmmscand_slot->address = 0;
 	kmmscand_slot->scan_period = kmmscand_mm_scan_period_ms;
+	kmmscand_slot->scan_size = kmmscand_scan_size;
 	kmmscand_slot->next_scan = 0;
 	kmmscand_slot->scan_delta = 0;
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 07/10] sysfs: Add sysfs support to tune scanning
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (5 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 06/10] mm: Add throttling of mm scanning using scan_size Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 08/10] vmstat: Add vmstat counters Raghavendra K T
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

Support below tunables:
scan_enable: turn on or turn off mm_struct scanning
scan_period: initial scan_period (default: 2sec)
scan_sleep_ms: sleep time between two successive round of scanning and
migration.
mms_to_scan: total mm_struct to scan before taking a pause.
target_node: default regular node to which migration of accessed pages
is done

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/kmmscand.c | 205 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)

diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 2efef53f9402..344a45bd2d3e 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -20,6 +20,7 @@
 #include <linux/string.h>
 #include <linux/cleanup.h>
 #include <linux/minmax.h>
+#include <trace/events/kmem.h>
 
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -114,6 +115,170 @@ struct kmmscand_migrate_info {
 	unsigned long address;
 };
 
+#ifdef CONFIG_SYSFS
+static ssize_t scan_sleep_ms_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_scan_sleep_ms);
+}
+
+static ssize_t scan_sleep_ms_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int msecs;
+	int err;
+
+	err = kstrtouint(buf, 10, &msecs);
+	if (err)
+		return -EINVAL;
+
+	kmmscand_scan_sleep_ms = msecs;
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+static struct kobj_attribute scan_sleep_ms_attr =
+	__ATTR_RW(scan_sleep_ms);
+
+static ssize_t mm_scan_period_ms_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_mm_scan_period_ms);
+}
+
+/* If a value less than MIN or greater than MAX asked for store value is clamped */
+static ssize_t mm_scan_period_ms_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int msecs, stored_msecs;
+	int err;
+
+	err = kstrtouint(buf, 10, &msecs);
+	if (err)
+		return -EINVAL;
+
+	stored_msecs = clamp(msecs, KMMSCAND_SCAN_PERIOD_MIN, KMMSCAND_SCAN_PERIOD_MAX);
+
+	kmmscand_mm_scan_period_ms = stored_msecs;
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute mm_scan_period_ms_attr =
+	__ATTR_RW(mm_scan_period_ms);
+
+static ssize_t mms_to_scan_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%lu\n", kmmscand_mms_to_scan);
+}
+
+static ssize_t mms_to_scan_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	err = kstrtoul(buf, 10, &val);
+	if (err)
+		return -EINVAL;
+
+	kmmscand_mms_to_scan = val;
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute mms_to_scan_attr =
+	__ATTR_RW(mms_to_scan);
+
+static ssize_t scan_enabled_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_scan_enabled ? 1 : 0);
+}
+
+static ssize_t scan_enabled_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned int val;
+	int err;
+
+	err = kstrtouint(buf, 10, &val);
+	if (err || val > 1)
+		return -EINVAL;
+
+	if (val) {
+		kmmscand_scan_enabled = true;
+		need_wakeup = true;
+	} else
+		kmmscand_scan_enabled = false;
+
+	kmmscand_sleep_expire = 0;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+
+static struct kobj_attribute scan_enabled_attr =
+	__ATTR_RW(scan_enabled);
+
+static ssize_t target_node_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sysfs_emit(buf, "%u\n", kmmscand_target_node);
+}
+
+static ssize_t target_node_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int err, node;
+
+	err = kstrtoint(buf, 10, &node);
+	if (err)
+		return -EINVAL;
+
+	kmmscand_sleep_expire = 0;
+	if (!node_is_toptier(node))
+		return -EINVAL;
+
+	kmmscand_target_node = node;
+	wake_up_interruptible(&kmmscand_wait);
+
+	return count;
+}
+static struct kobj_attribute target_node_attr =
+	__ATTR_RW(target_node);
+
+static struct attribute *kmmscand_attr[] = {
+	&scan_sleep_ms_attr.attr,
+	&mm_scan_period_ms_attr.attr,
+	&mms_to_scan_attr.attr,
+	&scan_enabled_attr.attr,
+	&target_node_attr.attr,
+	NULL,
+};
+
+struct attribute_group kmmscand_attr_group = {
+	.attrs = kmmscand_attr,
+	.name = "kmmscand",
+};
+#endif
+
 static int kmmscand_has_work(void)
 {
 	return !list_empty(&kmmscand_scan.mm_head);
@@ -738,9 +903,43 @@ static int kmmscand(void *none)
 	return 0;
 }
 
+#ifdef CONFIG_SYSFS
+extern struct kobject *mm_kobj;
+static int __init kmmscand_init_sysfs(struct kobject **kobj)
+{
+	int err;
+
+	err = sysfs_create_group(*kobj, &kmmscand_attr_group);
+	if (err) {
+		pr_err("failed to register kmmscand group\n");
+		goto err_kmmscand_attr;
+	}
+
+	return 0;
+
+err_kmmscand_attr:
+	sysfs_remove_group(*kobj, &kmmscand_attr_group);
+	return err;
+}
+
+static void __init kmmscand_exit_sysfs(struct kobject *kobj)
+{
+		sysfs_remove_group(kobj, &kmmscand_attr_group);
+}
+#else
+static inline int __init kmmscand_init_sysfs(struct kobject **kobj)
+{
+	return 0;
+}
+static inline void __init kmmscand_exit_sysfs(struct kobject *kobj)
+{
+}
+#endif
+
 static inline void kmmscand_destroy(void)
 {
 	kmem_cache_destroy(kmmscand_slot_cache);
+	kmmscand_exit_sysfs(mm_kobj);
 }
 
 void __kmmscand_enter(struct mm_struct *mm)
@@ -857,6 +1056,11 @@ static int __init kmmscand_init(void)
 		return -ENOMEM;
 	}
 
+	err = kmmscand_init_sysfs(&mm_kobj);
+
+	if (err)
+		goto err_init_sysfs;
+
 	err = start_kmmscand();
 	if (err)
 		goto err_kmmscand;
@@ -865,6 +1069,7 @@ static int __init kmmscand_init(void)
 
 err_kmmscand:
 	stop_kmmscand();
+err_init_sysfs:
 	kmmscand_destroy();
 
 	return err;
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 08/10] vmstat: Add vmstat counters
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (6 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 07/10] sysfs: Add sysfs support to tune scanning Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-01 15:38 ` [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration Raghavendra K T
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

Add vmstat counter to track scanning, migration and
type of pages. 

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h            | 11 ++++++++
 include/linux/vm_event_item.h | 10 +++++++
 mm/kmmscand.c                 | 50 ++++++++++++++++++++++++++++++++++-
 mm/vmstat.c                   | 10 +++++++
 4 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c39c4945946c..306452c11d31 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -681,6 +681,17 @@ struct vm_operations_struct {
 					  unsigned long addr);
 };
 
+#ifdef CONFIG_KMMSCAND
+void count_kmmscand_mm_scans(void);
+void count_kmmscand_vma_scans(void);
+void count_kmmscand_migadded(void);
+void count_kmmscand_migrated(void);
+void count_kmmscand_migrate_failed(void);
+void count_kmmscand_slowtier(void);
+void count_kmmscand_toptier(void);
+void count_kmmscand_idlepage(void);
+#endif
+
 #ifdef CONFIG_NUMA_BALANCING
 static inline void vma_numab_state_init(struct vm_area_struct *vma)
 {
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..b2ccd4f665aa 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -65,6 +65,16 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
 #endif
+#ifdef CONFIG_KMMSCAND
+		KMMSCAND_MM_SCANS,
+		KMMSCAND_VMA_SCANS,
+		KMMSCAND_MIGADDED,
+		KMMSCAND_MIGRATED,
+		KMMSCAND_MIGRATE_FAILED,
+		KMMSCAND_SLOWTIER,
+		KMMSCAND_TOPTIER,
+		KMMSCAND_IDLEPAGE,
+#endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
 		THP_MIGRATION_SUCCESS,
diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 344a45bd2d3e..682c0523c0b4 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -279,6 +279,39 @@ struct attribute_group kmmscand_attr_group = {
 };
 #endif
 
+void count_kmmscand_mm_scans(void)
+{
+	count_vm_numa_event(KMMSCAND_MM_SCANS);
+}
+void count_kmmscand_vma_scans(void)
+{
+	count_vm_numa_event(KMMSCAND_VMA_SCANS);
+}
+void count_kmmscand_migadded(void)
+{
+	count_vm_numa_event(KMMSCAND_MIGADDED);
+}
+void count_kmmscand_migrated(void)
+{
+	count_vm_numa_event(KMMSCAND_MIGRATED);
+}
+void count_kmmscand_migrate_failed(void)
+{
+	count_vm_numa_event(KMMSCAND_MIGRATE_FAILED);
+}
+void count_kmmscand_slowtier(void)
+{
+	count_vm_numa_event(KMMSCAND_SLOWTIER);
+}
+void count_kmmscand_toptier(void)
+{
+	count_vm_numa_event(KMMSCAND_TOPTIER);
+}
+void count_kmmscand_idlepage(void)
+{
+	count_vm_numa_event(KMMSCAND_IDLEPAGE);
+}
+
 static int kmmscand_has_work(void)
 {
 	return !list_empty(&kmmscand_scan.mm_head);
@@ -500,6 +533,9 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 
 	srcnid = folio_nid(folio);
 
+	if (node_is_toptier(srcnid))
+		count_kmmscand_toptier();
+
 	if (!folio_test_idle(folio) || folio_test_young(folio) ||
 			mmu_notifier_test_young(mm, addr) ||
 			folio_test_referenced(folio) || pte_young(pteval)) {
@@ -511,6 +547,7 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 		info = kzalloc(sizeof(struct kmmscand_migrate_info), GFP_KERNEL);
 		if (info && migrate_list) {
 
+			count_kmmscand_slowtier();
 			info->mm = mm;
 			info->vma = vma;
 			info->folio = folio;
@@ -524,8 +561,10 @@ static int hot_vma_idle_pte_entry(pte_t *pte,
 			 * XXX: We are increamenting atomic var under mmap_readlock
 			 */
 			atomic_long_inc(&mm->nr_accessed);
+			count_kmmscand_migadded();
 		}
-	}
+	} else
+		count_kmmscand_idlepage();
 end:
 	folio_set_idle(folio);
 	folio_put(folio);
@@ -632,6 +671,12 @@ static void kmmscand_migrate_folio(void)
 			if (info->mm)
 				ret = kmmscand_promote_folio(info);
 
+			/* TBD: encode migrated count here, currently assume folio_nr_pages */
+			if (!ret)
+				count_kmmscand_migrated();
+			else
+				count_kmmscand_migrate_failed();
+
 			kfree(info);
 
 			spin_lock(&kmmscand_migrate_lock);
@@ -799,6 +844,7 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	for_each_vma(vmi, vma) {
 		/* Count the scanned pages here to decide exit */
 		kmmscand_walk_page_vma(vma);
+		count_kmmscand_vma_scans();
 		scanned_size += vma->vm_end - vma->vm_start;
 		address = vma->vm_end;
 
@@ -822,6 +868,8 @@ static unsigned long kmmscand_scan_mm_slot(void)
 
 	update_mmslot_info = true;
 
+	count_kmmscand_mm_scans();
+
 outerloop:
 	/* exit_mmap will destroy ptes after this */
 	mmap_read_unlock(mm);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d016314a56c..d758e7155042 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1340,6 +1340,16 @@ const char * const vmstat_text[] = {
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
 #endif
+#ifdef CONFIG_KMMSCAND
+	"nr_kmmscand_mm_scans",
+	"nr_kmmscand_vma_scans",
+	"nr_kmmscand_migadded",
+	"nr_kmmscand_migrated",
+	"nr_kmmscand_migrate_failed",
+	"nr_kmmscand_slowtier",
+	"nr_kmmscand_toptier",
+	"nr_kmmscand_idlepage",
+#endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
 	"pgmigrate_fail",
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (7 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 08/10] vmstat: Add vmstat counters Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-05 17:46   ` Steven Rostedt
  2024-12-01 15:38 ` [RFC PATCH V0 DO NOT MERGE 10/10] kmmscand: Add scanning Raghavendra K T
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T,
	Steven Rostedt, Masami Hiramatsu, linux-trace-kernel

Add tracing support to track
 - start and end of scanning.
 - migration.

CC: Steven Rostedt <rostedt@goodmis.org>
CC: Masami Hiramatsu <mhiramat@kernel.org>
CC: linux-trace-kernel@vger.kernel.org

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/trace/events/kmem.h | 99 +++++++++++++++++++++++++++++++++++++
 mm/kmmscand.c               | 12 ++++-
 2 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index b37eb0a7060f..80978d85300d 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -9,6 +9,105 @@
 #include <linux/tracepoint.h>
 #include <trace/events/mmflags.h>
 
+TRACE_EVENT(kmem_mm_enter,
+
+	TP_PROTO(struct task_struct *t,
+		 struct mm_struct *mm),
+
+	TP_ARGS(t, mm),
+
+	TP_STRUCT__entry(
+		__array(	char, comm, TASK_COMM_LEN	)
+		__field(	struct mm_struct *, mm		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->mm = mm;
+	),
+
+	TP_printk("kmmscand: mm_enter comm =%s mm=%p", __entry->comm, __entry->mm)
+);
+
+TRACE_EVENT(kmem_scan_mm_start,
+
+	TP_PROTO(struct task_struct *t,
+		 struct mm_struct *mm),
+
+	TP_ARGS(t, mm),
+
+	TP_STRUCT__entry(
+		__array(	char, comm, TASK_COMM_LEN	)
+		__field(	struct mm_struct *, mm		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->mm = mm;
+	),
+
+	TP_printk("kmmscand: scan_mm_start comm =%s mm=%p", __entry->comm, __entry->mm)
+);
+
+TRACE_EVENT(kmem_scan_mm_end,
+
+	TP_PROTO(struct task_struct *t,
+		 struct mm_struct *mm,
+		 unsigned long start,
+		 unsigned long total,
+		 unsigned long scan_size,
+		 unsigned long scan_period),
+
+	TP_ARGS(t, mm, start, total, scan_period, scan_size),
+
+	TP_STRUCT__entry(
+		__array(	char, comm, TASK_COMM_LEN	)
+		__field(	struct mm_struct *, mm		)
+		__field(	unsigned long,   start		)
+		__field(	unsigned long,   total		)
+		__field(	unsigned long,   scan_period	)
+		__field(	unsigned long,   scan_size	)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->mm = mm;
+		__entry->start = start;
+		__entry->total = total;
+		__entry->scan_period = scan_period;
+		__entry->scan_size = scan_size;
+	),
+
+	TP_printk("kmmscand: scan_mm_end comm =%s mm=%p, start = %ld, total = %ld, scan_period = %ld, scan_size = %ld",
+			__entry->comm, __entry->mm, __entry->start,
+			__entry->total, __entry->scan_period, __entry->scan_size)
+);
+
+TRACE_EVENT(kmem_scan_mm_migrate,
+
+	TP_PROTO(struct task_struct *t,
+		 struct mm_struct *mm,
+		 int rc),
+
+	TP_ARGS(t, mm, rc),
+
+	TP_STRUCT__entry(
+		__array(	char, comm, TASK_COMM_LEN	)
+		__field(	struct mm_struct *, mm		)
+		__field(	int,   rc			)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->mm = mm;
+		__entry->rc = rc;
+	),
+
+	TP_printk("kmmscand: scan_mm_migrate comm =%s mm=%p rc=%d", __entry->comm,
+			__entry->mm, __entry->rc)
+);
+
+
 TRACE_EVENT(kmem_cache_alloc,
 
 	TP_PROTO(unsigned long call_site,
diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 682c0523c0b4..70f588a210dd 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -668,8 +668,10 @@ static void kmmscand_migrate_folio(void)
 			WRITE_ONCE(kmmscand_cur_migrate_mm, info->mm);
 			spin_unlock(&kmmscand_migrate_lock);
 
-			if (info->mm)
+			if (info->mm) {
 				ret = kmmscand_promote_folio(info);
+				trace_kmem_scan_mm_migrate(info->mm->owner, info->mm, ret);
+			}
 
 			/* TBD: encode migrated count here, currently assume folio_nr_pages */
 			if (!ret)
@@ -828,6 +830,9 @@ static unsigned long kmmscand_scan_mm_slot(void)
 		goto outerloop;
 	}
 
+	if (mm->owner)
+		trace_kmem_scan_mm_start(mm->owner, mm);
+
 	now = jiffies;
 	/*
 	 * Dont scan if :
@@ -868,6 +873,10 @@ static unsigned long kmmscand_scan_mm_slot(void)
 
 	update_mmslot_info = true;
 
+	if (mm->owner)
+		trace_kmem_scan_mm_end(mm->owner, mm, address, total,
+					mm_slot_scan_period, mm_slot_scan_size);
+
 	count_kmmscand_mm_scans();
 
 outerloop:
@@ -1020,6 +1029,7 @@ void __kmmscand_enter(struct mm_struct *mm)
 	spin_unlock(&kmmscand_mm_lock);
 
 	mmgrab(mm);
+	trace_kmem_mm_enter(mm->owner, mm);
 	if (wakeup)
 		wake_up_interruptible(&kmmscand_wait);
 }
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH V0 DO NOT MERGE 10/10] kmmscand: Add scanning
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (8 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration Raghavendra K T
@ 2024-12-01 15:38 ` Raghavendra K T
  2024-12-10 18:53 ` [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit SeongJae Park
  2025-02-12 17:02 ` Davidlohr Bueso
  11 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-01 15:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj
  Cc: willy, kirill.shutemov, mgorman, vbabka, hughd, rientjes,
	shy828301, Liam.Howlett, peterz, mingo, Raghavendra K T

overhead caclulation support

Intended to be used only for experimental purpose.
Not to be merged.

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h            |  3 +++
 include/linux/vm_event_item.h |  4 ++++
 kernel/sched/fair.c           | 13 ++++++++-----
 mm/huge_memory.c              |  1 +
 mm/kmmscand.c                 |  9 +++++++++
 mm/memory.c                   | 12 ++++++++----
 mm/vmstat.c                   |  4 ++++
 7 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 306452c11d31..7380aab1fa62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -600,6 +600,7 @@ struct vm_fault {
 					 * page table to avoid allocation from
 					 * atomic context.
 					 */
+	unsigned long start_time;
 };
 
 /*
@@ -690,6 +691,8 @@ void count_kmmscand_migrate_failed(void);
 void count_kmmscand_slowtier(void);
 void count_kmmscand_toptier(void);
 void count_kmmscand_idlepage(void);
+void count_kmmscand_scan_oh(long delta);
+void count_kmmscand_migration_oh(long delta);
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index b2ccd4f665aa..4c7eaea01f13 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -64,6 +64,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
+		NUMA_TASK_WORK_OH,
+		NUMA_HF_MIGRATION_OH,
 #endif
 #ifdef CONFIG_KMMSCAND
 		KMMSCAND_MM_SCANS,
@@ -74,6 +76,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KMMSCAND_SLOWTIER,
 		KMMSCAND_TOPTIER,
 		KMMSCAND_IDLEPAGE,
+		KMMSCAND_SCAN_OH,
+		KMMSCAND_MIGRATION_OH,
 #endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..d205be30ae6c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3299,6 +3299,7 @@ static void task_numa_work(struct callback_head *work)
 	struct vma_iterator vmi;
 	bool vma_pids_skipped;
 	bool vma_pids_forced = false;
+	unsigned long old = jiffies;
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -3312,7 +3313,7 @@ static void task_numa_work(struct callback_head *work)
 	 * work.
 	 */
 	if (p->flags & PF_EXITING)
-		return;
+		goto out1;
 
 	if (!mm->numa_next_scan) {
 		mm->numa_next_scan = now +
@@ -3324,7 +3325,7 @@ static void task_numa_work(struct callback_head *work)
 	 */
 	migrate = mm->numa_next_scan;
 	if (time_before(now, migrate))
-		return;
+		goto out1;
 
 	if (p->numa_scan_period == 0) {
 		p->numa_scan_period_max = task_scan_max(p);
@@ -3333,7 +3334,7 @@ static void task_numa_work(struct callback_head *work)
 
 	next_scan = now + msecs_to_jiffies(p->numa_scan_period);
 	if (!try_cmpxchg(&mm->numa_next_scan, &migrate, next_scan))
-		return;
+		goto out1;
 
 	/*
 	 * Delay this task enough that another task of this mm will likely win
@@ -3345,11 +3346,11 @@ static void task_numa_work(struct callback_head *work)
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
 	virtpages = pages * 8;	   /* Scan up to this much virtual space */
 	if (!pages)
-		return;
+		goto out1;
 
 
 	if (!mmap_read_trylock(mm))
-		return;
+		goto out1;
 
 	/*
 	 * VMAs are skipped if the current PID has not trapped a fault within
@@ -3526,6 +3527,8 @@ static void task_numa_work(struct callback_head *work)
 		u64 diff = p->se.sum_exec_runtime - runtime;
 		p->node_stamp += 32 * diff;
 	}
+out1:
+	__count_vm_events(NUMA_TASK_WORK_OH, jiffies_to_usecs(jiffies - old));
 }
 
 void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ee335d96fc39..d948d1fbbffd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1250,6 +1250,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		spin_unlock(vmf->ptl);
 	}
 
+	__count_vm_events(NUMA_HF_MIGRATION_OH, jiffies_to_usecs(jiffies - vmf->start_time));
 	return 0;
 unlock_release:
 	spin_unlock(vmf->ptl);
diff --git a/mm/kmmscand.c b/mm/kmmscand.c
index 70f588a210dd..bd2c65f38da2 100644
--- a/mm/kmmscand.c
+++ b/mm/kmmscand.c
@@ -644,8 +644,10 @@ static void kmmscand_cleanup_migration_list(struct mm_struct *mm)
 static void kmmscand_migrate_folio(void)
 {
 	int ret = 0;
+	unsigned long tstart, tend;
 	struct kmmscand_migrate_info *info, *tmp;
 
+	tstart = jiffies;
 	spin_lock(&kmmscand_migrate_lock);
 
 	if (!list_empty(&kmmscand_migrate_list.migrate_head)) {
@@ -691,6 +693,8 @@ static void kmmscand_migrate_folio(void)
 		}
 	}
 	spin_unlock(&kmmscand_migrate_lock);
+	tend = jiffies;
+	__count_vm_events(KMMSCAND_MIGRATION_OH, jiffies_to_usecs(tend - tstart));
 }
 
 /*
@@ -788,6 +792,8 @@ static unsigned long kmmscand_scan_mm_slot(void)
 
 	unsigned int mm_slot_scan_period;
 	unsigned long now;
+
+	unsigned long tstart, tend;
 	unsigned long mm_slot_next_scan;
 	unsigned long mm_slot_scan_size;
 	unsigned long scanned_size = 0;
@@ -800,6 +806,7 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	struct vm_area_struct *vma = NULL;
 	struct kmmscand_mm_slot *mm_slot;
 
+	tstart  = jiffies;
 	/* Retrieve mm */
 	spin_lock(&kmmscand_mm_lock);
 
@@ -917,6 +924,8 @@ static unsigned long kmmscand_scan_mm_slot(void)
 	}
 
 	spin_unlock(&kmmscand_mm_lock);
+	tend = jiffies;
+	__count_vm_events(KMMSCAND_SCAN_OH, jiffies_to_usecs(tend - tstart));
 	return total;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 75c2dfd04f72..baea436124b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5590,7 +5590,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 	if (unlikely(!pte_same(old_pte, vmf->orig_pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
+		goto out;
 	}
 
 	pte = pte_modify(old_pte, vma->vm_page_prot);
@@ -5629,17 +5629,18 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		nid = target_nid;
 		flags |= TNF_MIGRATED;
 		task_numa_fault(last_cpupid, nid, nr_pages, flags);
-		return 0;
+		goto out;
 	}
 
 	flags |= TNF_MIGRATE_FAIL;
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 				       vmf->address, &vmf->ptl);
 	if (unlikely(!vmf->pte))
-		return 0;
+		goto out;
+
 	if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return 0;
+		goto out;
 	}
 out_map:
 	/*
@@ -5656,6 +5657,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 
 	if (nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, nid, nr_pages, flags);
+out:
+	__count_vm_events(NUMA_HF_MIGRATION_OH, jiffies_to_usecs(jiffies - vmf->start_time));
 	return 0;
 }
 
@@ -5858,6 +5861,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.flags = flags,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
+		.start_time = jiffies,
 	};
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long vm_flags = vma->vm_flags;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d758e7155042..b7fe51342970 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1339,6 +1339,8 @@ const char * const vmstat_text[] = {
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
+	"numa_task_work_oh",
+	"numa_hf_migration_oh",
 #endif
 #ifdef CONFIG_KMMSCAND
 	"nr_kmmscand_mm_scans",
@@ -1349,6 +1351,8 @@ const char * const vmstat_text[] = {
 	"nr_kmmscand_slowtier",
 	"nr_kmmscand_toptier",
 	"nr_kmmscand_idlepage",
+	"kmmscand_scan_oh",
+	"kmmscand_migration_oh",
 #endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",
-- 
2.39.3



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration
  2024-12-01 15:38 ` [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration Raghavendra K T
@ 2024-12-05 17:46   ` Steven Rostedt
  2024-12-06  6:33     ` Raghavendra K T
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Rostedt @ 2024-12-05 17:46 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj, willy, kirill.shutemov, mgorman, vbabka, hughd,
	rientjes, shy828301, Liam.Howlett, peterz, mingo,
	Masami Hiramatsu, linux-trace-kernel

On Sun, 1 Dec 2024 15:38:17 +0000
Raghavendra K T <raghavendra.kt@amd.com> wrote:

> Add tracing support to track
>  - start and end of scanning.
>  - migration.
> 
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Masami Hiramatsu <mhiramat@kernel.org>
> CC: linux-trace-kernel@vger.kernel.org
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>  include/trace/events/kmem.h | 99 +++++++++++++++++++++++++++++++++++++
>  mm/kmmscand.c               | 12 ++++-
>  2 files changed, 110 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index b37eb0a7060f..80978d85300d 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -9,6 +9,105 @@
>  #include <linux/tracepoint.h>
>  #include <trace/events/mmflags.h>
>  
> +TRACE_EVENT(kmem_mm_enter,
> +
> +	TP_PROTO(struct task_struct *t,
> +		 struct mm_struct *mm),
> +
> +	TP_ARGS(t, mm),
> +
> +	TP_STRUCT__entry(
> +		__array(	char, comm, TASK_COMM_LEN	)

Is there a reason to record "comm"? There's other ways to retrieve it than
to always write it to the ring buffer.

> +		__field(	struct mm_struct *, mm		)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> +		__entry->mm = mm;
> +	),
> +
> +	TP_printk("kmmscand: mm_enter comm =%s mm=%p", __entry->comm, __entry->mm)
> +);
> +
> +TRACE_EVENT(kmem_scan_mm_start,
> +
> +	TP_PROTO(struct task_struct *t,
> +		 struct mm_struct *mm),
> +
> +	TP_ARGS(t, mm),
> +
> +	TP_STRUCT__entry(
> +		__array(	char, comm, TASK_COMM_LEN	)
> +		__field(	struct mm_struct *, mm		)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> +		__entry->mm = mm;
> +	),
> +
> +	TP_printk("kmmscand: scan_mm_start comm =%s mm=%p", __entry->comm, __entry->mm)

No need to write the event name into the TP_printk(). That's redundant.

Also, the above two events are pretty much identical. Please use
DECLARE_EVENT_CLASS().

> +);
> +
> +TRACE_EVENT(kmem_scan_mm_end,
> +
> +	TP_PROTO(struct task_struct *t,
> +		 struct mm_struct *mm,
> +		 unsigned long start,
> +		 unsigned long total,
> +		 unsigned long scan_size,
> +		 unsigned long scan_period),
> +
> +	TP_ARGS(t, mm, start, total, scan_period, scan_size),
> +
> +	TP_STRUCT__entry(
> +		__array(	char, comm, TASK_COMM_LEN	)

Again, why comm?

> +		__field(	struct mm_struct *, mm		)
> +		__field(	unsigned long,   start		)
> +		__field(	unsigned long,   total		)
> +		__field(	unsigned long,   scan_period	)
> +		__field(	unsigned long,   scan_size	)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> +		__entry->mm = mm;
> +		__entry->start = start;
> +		__entry->total = total;
> +		__entry->scan_period = scan_period;
> +		__entry->scan_size = scan_size;
> +	),
> +
> +	TP_printk("kmmscand: scan_mm_end comm =%s mm=%p, start = %ld, total = %ld, scan_period = %ld, scan_size = %ld",
> +			__entry->comm, __entry->mm, __entry->start,
> +			__entry->total, __entry->scan_period, __entry->scan_size)
> +);
> +
> +TRACE_EVENT(kmem_scan_mm_migrate,
> +
> +	TP_PROTO(struct task_struct *t,
> +		 struct mm_struct *mm,
> +		 int rc),
> +
> +	TP_ARGS(t, mm, rc),
> +
> +	TP_STRUCT__entry(
> +		__array(	char, comm, TASK_COMM_LEN	)
> +		__field(	struct mm_struct *, mm		)
> +		__field(	int,   rc			)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
> +		__entry->mm = mm;
> +		__entry->rc = rc;
> +	),
> +
> +	TP_printk("kmmscand: scan_mm_migrate comm =%s mm=%p rc=%d", __entry->comm,
> +			__entry->mm, __entry->rc)
> +);
> +
> +
>  TRACE_EVENT(kmem_cache_alloc,
>  
>  	TP_PROTO(unsigned long call_site,
> diff --git a/mm/kmmscand.c b/mm/kmmscand.c
> index 682c0523c0b4..70f588a210dd 100644
> --- a/mm/kmmscand.c
> +++ b/mm/kmmscand.c
> @@ -668,8 +668,10 @@ static void kmmscand_migrate_folio(void)
>  			WRITE_ONCE(kmmscand_cur_migrate_mm, info->mm);
>  			spin_unlock(&kmmscand_migrate_lock);
>  
> -			if (info->mm)
> +			if (info->mm) {
>  				ret = kmmscand_promote_folio(info);
> +				trace_kmem_scan_mm_migrate(info->mm->owner, info->mm, ret);
> +			}
>  
>  			/* TBD: encode migrated count here, currently assume folio_nr_pages */
>  			if (!ret)
> @@ -828,6 +830,9 @@ static unsigned long kmmscand_scan_mm_slot(void)
>  		goto outerloop;
>  	}
>  
> +	if (mm->owner)
> +		trace_kmem_scan_mm_start(mm->owner, mm);
> +
>  	now = jiffies;
>  	/*
>  	 * Dont scan if :
> @@ -868,6 +873,10 @@ static unsigned long kmmscand_scan_mm_slot(void)
>  
>  	update_mmslot_info = true;
>  
> +	if (mm->owner)
> +		trace_kmem_scan_mm_end(mm->owner, mm, address, total,
> +					mm_slot_scan_period, mm_slot_scan_size);

Please do not add conditions that is used just for calling a tracepoint.
That takes away the "nop" of the function. You can either use
TRACE_EVENT_CONDITION() or DEFINE_EVENT_CONDITION(), or you can hard code
it here:

	if (trace_kmem_scan_mm_end_enabled()) {
		if (mm->owner)
			trace_kmem_scan_mm_end(mm->owner, mm, address, total,
						mm_slot_scan_period, mm_slot_scan_size);
	}

But since it is a single condition, I would prefer the *_CONDITION() macros
above.

-- Steve

> +
>  	count_kmmscand_mm_scans();
>  
>  outerloop:
> @@ -1020,6 +1029,7 @@ void __kmmscand_enter(struct mm_struct *mm)
>  	spin_unlock(&kmmscand_mm_lock);
>  
>  	mmgrab(mm);
> +	trace_kmem_mm_enter(mm->owner, mm);
>  	if (wakeup)
>  		wake_up_interruptible(&kmmscand_wait);
>  }



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration
  2024-12-05 17:46   ` Steven Rostedt
@ 2024-12-06  6:33     ` Raghavendra K T
  2024-12-06 14:49       ` Steven Rostedt
  0 siblings, 1 reply; 18+ messages in thread
From: Raghavendra K T @ 2024-12-06  6:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj, willy, kirill.shutemov, mgorman, vbabka, hughd,
	rientjes, shy828301, Liam.Howlett, peterz, mingo,
	Masami Hiramatsu, linux-trace-kernel



On 12/5/2024 11:16 PM, Steven Rostedt wrote:
> On Sun, 1 Dec 2024 15:38:17 +0000
> Raghavendra K T <raghavendra.kt@amd.com> wrote:
> 
>> Add tracing support to track
>>   - start and end of scanning.
>>   - migration.
>>
>> CC: Steven Rostedt <rostedt@goodmis.org>
>> CC: Masami Hiramatsu <mhiramat@kernel.org>
>> CC: linux-trace-kernel@vger.kernel.org
>>

[...]

>> +
>> +	TP_STRUCT__entry(
>> +		__array(	char, comm, TASK_COMM_LEN	)
> 
> Is there a reason to record "comm"? There's other ways to retrieve it than
> to always write it to the ring buffer.
> 

Thank you for the review Steve. The motivation was to filter benchmark
in the trace to understand the behavior.
I will explore regarding other ways of retrieving comm.
(or may be even PID is enough..)

[...]

>> +
>> +	TP_printk("kmmscand: scan_mm_start comm =%s mm=%p", __entry->comm, __entry->mm)
> 
> No need to write the event name into the TP_printk(). That's redundant.
> 
> Also, the above two events are pretty much identical. Please use
> DECLARE_EVENT_CLASS().

Sure. will do.


>> +
>> +	TP_STRUCT__entry(
>> +		__array(	char, comm, TASK_COMM_LEN	)
> 
> Again, why comm?
> 

Will do same change here too.

[...]

>> +	if (mm->owner)
>> +		trace_kmem_scan_mm_end(mm->owner, mm, address, total,
>> +					mm_slot_scan_period, mm_slot_scan_size);
> 
> Please do not add conditions that is used just for calling a tracepoint.
> That takes away the "nop" of the function. You can either use
> TRACE_EVENT_CONDITION() or DEFINE_EVENT_CONDITION(), or you can hard code
> it here:
> 
> 	if (trace_kmem_scan_mm_end_enabled()) {
> 		if (mm->owner)
> 			trace_kmem_scan_mm_end(mm->owner, mm, address, total,
> 						mm_slot_scan_period, mm_slot_scan_size);
> 	}
> 
> But since it is a single condition, I would prefer the *_CONDITION() macros
> above.
> 

Very helpful suggestion.

Thanks again.. I will keep these points in mind for next version.

- Raghu




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration
  2024-12-06  6:33     ` Raghavendra K T
@ 2024-12-06 14:49       ` Steven Rostedt
  0 siblings, 0 replies; 18+ messages in thread
From: Steven Rostedt @ 2024-12-06 14:49 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj, willy, kirill.shutemov, mgorman, vbabka, hughd,
	rientjes, shy828301, Liam.Howlett, peterz, mingo,
	Masami Hiramatsu, linux-trace-kernel

On Fri, 6 Dec 2024 12:03:29 +0530
Raghavendra K T <raghavendra.kt@amd.com> wrote:

> On 12/5/2024 11:16 PM, Steven Rostedt wrote:
> > On Sun, 1 Dec 2024 15:38:17 +0000
> > Raghavendra K T <raghavendra.kt@amd.com> wrote:
> >   
> >> Add tracing support to track
> >>   - start and end of scanning.
> >>   - migration.
> >>
> >> CC: Steven Rostedt <rostedt@goodmis.org>
> >> CC: Masami Hiramatsu <mhiramat@kernel.org>
> >> CC: linux-trace-kernel@vger.kernel.org
> >>  
> 
> [...]
> 
> >> +
> >> +	TP_STRUCT__entry(
> >> +		__array(	char, comm, TASK_COMM_LEN	)  
> > 
> > Is there a reason to record "comm"? There's other ways to retrieve it than
> > to always write it to the ring buffer.
> >   
> 
> Thank you for the review Steve. The motivation was to filter benchmark
> in the trace to understand the behavior.
> I will explore regarding other ways of retrieving comm.
> (or may be even PID is enough..)

You can filter on current comm for any event with trace-cmd and even with the
"filter" file. It doesn't need to be part of the event.

For the filter file:

 # echo "COMM == rcu_preempt" > /sys/kernel/tracing/events/timer/hrtimer_cancel/filter

or with trace-cmd

 # trace-cmd start -e hrtimer_cancel -f 'COMM == "rcu_preempt"'
 # trace-cmd show
# tracer: nop
#
# entries-in-buffer/entries-written: 10/10   #P:8
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
     rcu_preempt-18      [001] d..3. 54968.170887: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54968.177704: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54968.181678: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54968.185679: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54968.186092: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54968.193676: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54968.193686: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54972.871315: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54972.875176: hrtimer_cancel: hrtimer=00000000456b5702
     rcu_preempt-18      [001] d..3. 54972.881751: hrtimer_cancel: hrtimer=00000000456b5702

Or you can do it after the fact from a trace.dat file:

  # trace-cmd record -e hrtimer_cancel sleep 10
  # trace-cmd report | head
cpus=8
           sleep-1641  [006] d.h2. 55109.598846: hrtimer_cancel:       hrtimer=0xffff9800fdfa1888
           sleep-1641  [006] d..3. 55109.599089: hrtimer_cancel:       hrtimer=0xffff9800fdfb3140
          <idle>-0     [006] d..2. 55109.599111: hrtimer_cancel:       hrtimer=0xffff9800fdfa1888
          <idle>-0     [006] d.h7. 55109.603848: hrtimer_cancel:       hrtimer=0xffff9800fdfb3180
          <idle>-0     [006] dN.2. 55109.603895: hrtimer_cancel:       hrtimer=0xffff9800fdfa1888
          <idle>-0     [000] d.h3. 55109.604478: hrtimer_cancel:       hrtimer=0xffff9800fde33180
          <idle>-0     [000] dN.2. 55109.604492: hrtimer_cancel:       hrtimer=0xffff9800fde21888
     rcu_preempt-18    [000] d..3. 55109.604549: hrtimer_cancel:       hrtimer=0xffff9800fde33140
          <idle>-0     [000] d..2. 55109.604573: hrtimer_cancel:       hrtimer=0xffff9800fde21888

  # trace-cmd report -F '.*:COMM == "rcu_preempt"'
cpus=8
     rcu_preempt-18    [000] d..3. 55109.604549: hrtimer_cancel:       hrtimer=0xffff9800fde33140
     rcu_preempt-18    [000] d..3. 55109.609320: hrtimer_cancel:       hrtimer=0xffff9800fde33140
     rcu_preempt-18    [000] d..3. 55109.613350: hrtimer_cancel:       hrtimer=0xffff9800fde33140
     rcu_preempt-18    [000] d..3. 55119.609772: hrtimer_cancel:       hrtimer=0xffff9800fde33140

-- Steve


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (9 preceding siblings ...)
  2024-12-01 15:38 ` [RFC PATCH V0 DO NOT MERGE 10/10] kmmscand: Add scanning Raghavendra K T
@ 2024-12-10 18:53 ` SeongJae Park
  2024-12-20  6:30   ` Raghavendra K T
  2025-02-12 17:02 ` Davidlohr Bueso
  11 siblings, 1 reply; 18+ messages in thread
From: SeongJae Park @ 2024-12-10 18:53 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: SeongJae Park, linux-mm, linux-kernel, gourry, nehagholkar,
	abhishekd, david, ying.huang, nphamcs, akpm, hannes, feng.tang,
	kbusch, bharata, Hasan.Maruf, willy, kirill.shutemov, mgorman,
	vbabka, hughd, rientjes, shy828301, Liam.Howlett, peterz, mingo

Hello Raghavendra,

Thank you for posting this nice patch series.  I gave you some feedback
offline.  Adding those here again for transparency on this grateful public
discussion.

On Sun, 1 Dec 2024 15:38:08 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote:

> Introduction:
> =============
> This patchset is an outcome of an ongoing collaboration between AMD and Meta.
> Meta wanted to explore an alternative page promotion technique as they
> observe high latency spikes in their workloads that access CXL memory.
> 
> In the current hot page promotion, all the activities including the
> process address space scanning, NUMA hint fault handling and page
> migration is performed in the process context. i.e., scanning overhead is
> borne by applications.

Yet another approach is using DAMON.  DAMON does access monitoring, and further
allows users to request access pattern-driven system operations in name of
DAMOS (Data Access Monitoring-based Operation Schemes).  Using it, users can
request DAMON to find hot pages and promote, while finding cold pages and
demote.  SK hynix has made their CXL-based memory capacity expansion solution
in the way (https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion).  We
collaboratively developed new DAMON features for that, and those are all
in the mainline since Linux v6.11.

I also proposed an idea for advancing it using DAMOS auto-tuning on more
general (>2 tiers) setup
(https:lore.kernel.org/20231112195602.61525-1-sj@kernel.org).  I haven't had a
time to further implement and test the idea so far, though.

> 
> This is an early RFC patch series to do (slow tier) CXL page promotion.
> The approach in this patchset assists/addresses the issue by adding PTE
> Accessed bit scanning.
> 
> Scanning is done by a global kernel thread which routinely scans all
> the processes' address spaces and checks for accesses by reading the
> PTE A bit. It then migrates/promotes the pages to the toptier node
> (node 0 in the current approach).
> 
> Thus, the approach pushes overhead of scanning, NUMA hint faults and
> migrations off from process context.

DAMON also uses PTE A bit as major source of the access information.  And DAMON
does both access monitoring and promotion/demotion in a global kernel thread,
namely kdamond.  Hence the DAMON-based approach would also offload the
overheads from process context.  So I feel your approach has a sort of
similarity with DAMON-based one in a way, and we might have a chance to avoid
unnecessary duplicates.

[...]
> 
> Limitations:
> ===========
> PTE A bit scanning approach lacks information about exact destination
> node to migrate to.

This is same for DAMON-based approach, since DAMON also uses PTE A bit as the
major source of the information.  We aim to extend DAMON to aware of the access
source CPU, and use it for solving this problem, though.  Utilizing page faults
or AMD IBS-like h/w features are on the table of the ideas.

> 
> Notes/Observations on design/Implementations/Alternatives/TODOs...
> ================================
> 1. Fine-tuning scan throttling

DAMON allows users set the upper-limit of monitoring overhead, using
max_nr_regions parameter.  Then it provides its best-effort accuracy.  We also
have ongoing projects for making it more accurate and easier to tune.

> 
> 2. Use migrate_balanced_pgdat() to balance toptier node before migration
>  OR Use migrate_misplaced_folio_prepare() directly.
>  But it may need some optimizations (for e.g., invoke occasionaly so
> that overhead is not there for every migration).
> 
> 3. Explore if a separate PAGE_EXT flag is needed instead of reusing
> PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
> But practically does not look good idea.
> 
> 4. Use timestamp information-based migration (Similar to numab mode=2).
> instead of migrating immediately when PTE A bit set.
> (cons:
>  - It will not be accurate since it is done outside of process
> context.
>  - Performance benefit may be lost.)

DAMON provides a sort of time-based aggregated monitoring results.  And DAMOS
provides prioritization of pages based on the access temperature.  Hence,
DAMON-based apparoach can also be used for a similar purpose (promoting not
every accessed pages but pages that more frequently used for longer time).

> 
> 5. Explore if we need to use PFN information + hash list instead of
> simple migration list. Here scanning is directly done with PFN belonging
> to CXL node.

DAMON supports physical address space monitoring, and maintains the access
monitoring results in its own data structure called damon_region.  So I think
similar benefit can be achieved using DAMON?

[...]
> 8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
> physical addresses accessed.

My biased humble opinion is that it would be very nice to explore this
opportunity, since I show some similarities and opportunities to solve some of
challenges on your approach in an easier way.  Even if it turns out that DAMON
cannot be used for your use case, failing earlier is a good thing, I'd say :)

> 
> 9. Gregory has nicely mentioned some details/ideas on different approaches in
> [1] : development notes, in the context of promoting unmapped page cache folios.

DAMON supports monitoring accesses to unmapped page cache folios, so hopefully
DAMON-based approaches can also solve this issue.

> 
> 10. SJ had pointed about concerns about kernel-thread based approaches as in
> kstaled [2]. So current patchset has tried to address the issue with simple
> algorithms to reduce CPU overhead. Migration throttling, Running the daemon
> in NICE priority, Parallelizing migration with scanning could help further.
> 
> 11. Toptier pages scanned can be used to assist current NUMAB by providing information
> on hot VMAs.
> 
> Credits
> =======
> Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and
> support.

I also learned many things from the great discussions, thank you :)

[...]
> 
> Links:
> [1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@gourry.net/
> [2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@google.com/#r
> [3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@hirez.programming.kicks-ass.net/
> 
> I might have CCed more people or less people than needed
> unintentionally.

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
  2024-12-10 18:53 ` [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit SeongJae Park
@ 2024-12-20  6:30   ` Raghavendra K T
  0 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2024-12-20  6:30 UTC (permalink / raw)
  To: SeongJae Park
  Cc: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, willy, kirill.shutemov, mgorman, vbabka, hughd,
	rientjes, shy828301, Liam.Howlett, peterz, mingo



On 12/11/2024 12:23 AM, SeongJae Park wrote:
> Hello Raghavendra,
> 
> 
> Thank you for posting this nice patch series.  I gave you some feedback
> offline.  Adding those here again for transparency on this grateful public
> discussion.
> 
> On Sun, 1 Dec 2024 15:38:08 +0000 Raghavendra K T <raghavendra.kt@amd.com> wrote:
> 
>> Introduction:
>> =============
>> This patchset is an outcome of an ongoing collaboration between AMD and Meta.
>> Meta wanted to explore an alternative page promotion technique as they
>> observe high latency spikes in their workloads that access CXL memory.
>>
>> In the current hot page promotion, all the activities including the
>> process address space scanning, NUMA hint fault handling and page
>> migration is performed in the process context. i.e., scanning overhead is
>> borne by applications.
> 
> Yet another approach is using DAMON.  DAMON does access monitoring, and further
> allows users to request access pattern-driven system operations in name of
> DAMOS (Data Access Monitoring-based Operation Schemes).  Using it, users can
> request DAMON to find hot pages and promote, while finding cold pages and
> demote.  SK hynix has made their CXL-based memory capacity expansion solution
> in the way (https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion).  We
> collaboratively developed new DAMON features for that, and those are all
> in the mainline since Linux v6.11.
> > I also proposed an idea for advancing it using DAMOS auto-tuning on more
> general (>2 tiers) setup
> (https:lore.kernel.org/20231112195602.61525-1-sj@kernel.org).  I haven't had a
> time to further implement and test the idea so far, though.
> 
>>
>> This is an early RFC patch series to do (slow tier) CXL page promotion.
>> The approach in this patchset assists/addresses the issue by adding PTE
>> Accessed bit scanning.
>>
>> Scanning is done by a global kernel thread which routinely scans all
>> the processes' address spaces and checks for accesses by reading the
>> PTE A bit. It then migrates/promotes the pages to the toptier node
>> (node 0 in the current approach).
>>
>> Thus, the approach pushes overhead of scanning, NUMA hint faults and
>> migrations off from process context.
> 
> DAMON also uses PTE A bit as major source of the access information.  And DAMON
> does both access monitoring and promotion/demotion in a global kernel thread,
> namely kdamond.  Hence the DAMON-based approach would also offload the
> overheads from process context.  So I feel your approach has a sort of
> similarity with DAMON-based one in a way, and we might have a chance to avoid
> unnecessary duplicates.
> 
> [...]
>>
>> Limitations:
>> ===========
>> PTE A bit scanning approach lacks information about exact destination
>> node to migrate to.
> 
> This is same for DAMON-based approach, since DAMON also uses PTE A bit as the
> major source of the information.  We aim to extend DAMON to aware of the access
> source CPU, and use it for solving this problem, though.  Utilizing page faults
> or AMD IBS-like h/w features are on the table of the ideas.
> 
>>
>> Notes/Observations on design/Implementations/Alternatives/TODOs...
>> ================================
>> 1. Fine-tuning scan throttling
> 
> DAMON allows users set the upper-limit of monitoring overhead, using
> max_nr_regions parameter.  Then it provides its best-effort accuracy.  We also
> have ongoing projects for making it more accurate and easier to tune.
> 
>>
>> 2. Use migrate_balanced_pgdat() to balance toptier node before migration
>>   OR Use migrate_misplaced_folio_prepare() directly.
>>   But it may need some optimizations (for e.g., invoke occasionaly so
>> that overhead is not there for every migration).
>>
>> 3. Explore if a separate PAGE_EXT flag is needed instead of reusing
>> PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
>> But practically does not look good idea.
>>
>> 4. Use timestamp information-based migration (Similar to numab mode=2).
>> instead of migrating immediately when PTE A bit set.
>> (cons:
>>   - It will not be accurate since it is done outside of process
>> context.
>>   - Performance benefit may be lost.)
> 
> DAMON provides a sort of time-based aggregated monitoring results.  And DAMOS
> provides prioritization of pages based on the access temperature.  Hence,
> DAMON-based apparoach can also be used for a similar purpose (promoting not
> every accessed pages but pages that more frequently used for longer time).
> 
>>
>> 5. Explore if we need to use PFN information + hash list instead of
>> simple migration list. Here scanning is directly done with PFN belonging
>> to CXL node.
> 
> DAMON supports physical address space monitoring, and maintains the access
> monitoring results in its own data structure called damon_region.  So I think
> similar benefit can be achieved using DAMON?
> 
> [...]
>> 8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
>> physical addresses accessed.
> 
> My biased humble opinion is that it would be very nice to explore this
> opportunity, since I show some similarities and opportunities to solve some of
> challenges on your approach in an easier way.  Even if it turns out that DAMON
> cannot be used for your use case, failing earlier is a good thing, I'd say :)
> 
>>
>> 9. Gregory has nicely mentioned some details/ideas on different approaches in
>> [1] : development notes, in the context of promoting unmapped page cache folios.
> 
> DAMON supports monitoring accesses to unmapped page cache folios, so hopefully
> DAMON-based approaches can also solve this issue.
> 

Hello SJ,

Thank you for detailed explanation again. (Sorry for late
acknowledgement as I was looking forward to MM alignment discussion when
this message came).

I think once the direction is fixed, we could surely use / Reuse lot
source code from DAMON, MGLRU. Amazing design of DAMON should surely
help. Will keep in mind all the points raised here.

Thanks and Regards
- Raghu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
  2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
                   ` (10 preceding siblings ...)
  2024-12-10 18:53 ` [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit SeongJae Park
@ 2025-02-12 17:02 ` Davidlohr Bueso
  2025-02-13  5:39   ` Raghavendra K T
  11 siblings, 1 reply; 18+ messages in thread
From: Davidlohr Bueso @ 2025-02-12 17:02 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj, willy, kirill.shutemov, mgorman, vbabka, hughd,
	rientjes, shy828301, Liam.Howlett, peterz, mingo

On Sun, 01 Dec 2024, Raghavendra K T wrote:

>6. Holding PTE lock before migration.

fyi I tried testing this series with 'perf-bench numa mem' and got a soft lockup,
unable to take the PTL (and lost the machine to debug further atm), ie:

[ 3852.217675] CPU: 127 UID: 0 PID: 12537 Comm: watch-numa-sche Tainted: G      D      L     6.14.0-rc2-kmmscand-v1+ #3
[ 3852.217677] Tainted: [D]=DIE, [L]=SOFTLOCKUP
[ 3852.217678] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x290
[ 3852.217683] Code: 77 7b f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 57 85 c0 74 10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 5d 41 5c 41 5d c3
[ 3852.217684] RSP: 0018:ff274259b3c9f988 EFLAGS: 00000202
[ 3852.217685] RAX: 0000000000000001 RBX: ffbd2efd8c08c9a8 RCX: 000ffffffffff000
[ 3852.217686] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffbd2efd8c08c9a8
[ 3852.217687] RBP: ff161328422c1328 R08: ff274259b3c9fb90 R09: ff161328422c1000
[ 3852.217688] R10: 00000000ffffffff R11: 0000000000000004 R12: 00007f52cca00000
[ 3852.217688] R13: ff274259b3c9fa00 R14: ff16132842326000 R15: ff161328422c1328
[ 3852.217689] FS:  00007f32b6f92b80(0000) GS:ff161423bfd80000(0000) knlGS:0000000000000000
[ 3852.217691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3852.217692] CR2: 0000564ddbf68008 CR3: 00000080a81cc005 CR4: 0000000000773ef0
[ 3852.217693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3852.217694] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 3852.217694] PKRU: 55555554
[ 3852.217695] Call Trace:
[ 3852.217696]  <IRQ>
[ 3852.217697]  ? watchdog_timer_fn+0x21b/0x2a0
[ 3852.217699]  ? __pfx_watchdog_timer_fn+0x10/0x10
[ 3852.217702]  ? __hrtimer_run_queues+0x10f/0x2a0
[ 3852.217704]  ? hrtimer_interrupt+0xfb/0x240
[ 3852.217706]  ? __sysvec_apic_timer_interrupt+0x4e/0x110
[ 3852.217709]  ? sysvec_apic_timer_interrupt+0x68/0x90
[ 3852.217712]  </IRQ>
[ 3852.217712]  <TASK>
[ 3852.217713]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 3852.217717]  ? native_queued_spin_lock_slowpath+0x64/0x290
[ 3852.217720]  _raw_spin_lock+0x25/0x30
[ 3852.217723]  __pte_offset_map_lock+0x9a/0x110
[ 3852.217726]  gather_pte_stats+0x1e3/0x2c0
[ 3852.217730]  walk_pgd_range+0x528/0xbb0
[ 3852.217733]  __walk_page_range+0x71/0x1d0
[ 3852.217736]  walk_page_vma+0x98/0xf0
[ 3852.217738]  show_numa_map+0x11a/0x3a0
[ 3852.217741]  seq_read_iter+0x2a6/0x470
[ 3852.217745]  seq_read+0x12b/0x170
[ 3852.217748]  vfs_read+0xe0/0x370
[ 3852.217751]  ? syscall_exit_to_user_mode+0x49/0x210
[ 3852.217755]  ? do_syscall_64+0x8a/0x190
[ 3852.217758]  ksys_read+0x6a/0xe0
[ 3852.217762]  do_syscall_64+0x7e/0x190
[ 3852.217765]  ? __memcg_slab_free_hook+0xd4/0x120
[ 3852.217768]  ? __x64_sys_close+0x38/0x80
[ 3852.217771]  ? kmem_cache_free+0x3bf/0x3e0
[ 3852.217774]  ? syscall_exit_to_user_mode+0x49/0x210
[ 3852.217777]  ? do_syscall_64+0x8a/0x190
[ 3852.217780]  ? do_syscall_64+0x8a/0x190
[ 3852.217783]  ? __irq_exit_rcu+0x3e/0xe0
[ 3852.217785]  entry_SYSCALL_64_after_hwframe+0x76/0x7e


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
  2025-02-12 17:02 ` Davidlohr Bueso
@ 2025-02-13  5:39   ` Raghavendra K T
  0 siblings, 0 replies; 18+ messages in thread
From: Raghavendra K T @ 2025-02-13  5:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel, gourry, nehagholkar, abhishekd, david,
	ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch, bharata,
	Hasan.Maruf, sj, willy, kirill.shutemov, mgorman, vbabka, hughd,
	rientjes, shy828301, Liam.Howlett, peterz, mingo



On 2/12/2025 10:32 PM, Davidlohr Bueso wrote:
> On Sun, 01 Dec 2024, Raghavendra K T wrote:
> 
>> 6. Holding PTE lock before migration.
> 
> fyi I tried testing this series with 'perf-bench numa mem' and got a 
> soft lockup,
> unable to take the PTL (and lost the machine to debug further atm), ie:
> 
> [ 3852.217675] CPU: 127 UID: 0 PID: 12537 Comm: watch-numa-sche Tainted: 
> G      D      L     6.14.0-rc2-kmmscand-v1+ #3
> [ 3852.217677] Tainted: [D]=DIE, [L]=SOFTLOCKUP
> [ 3852.217678] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x290
> [ 3852.217683] Code: 77 7b f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 
> 08 30 e4 09 d0 3d ff 00 00 00 77 57 85 c0 74 10 0f b6 03 84 c0 74 09 f3 
> 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 5d 41 5c 41 5d c3
> [ 3852.217684] RSP: 0018:ff274259b3c9f988 EFLAGS: 00000202
> [ 3852.217685] RAX: 0000000000000001 RBX: ffbd2efd8c08c9a8 RCX: 
> 000ffffffffff000
> [ 3852.217686] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 
> ffbd2efd8c08c9a8
> [ 3852.217687] RBP: ff161328422c1328 R08: ff274259b3c9fb90 R09: 
> ff161328422c1000
> [ 3852.217688] R10: 00000000ffffffff R11: 0000000000000004 R12: 
> 00007f52cca00000
> [ 3852.217688] R13: ff274259b3c9fa00 R14: ff16132842326000 R15: 
> ff161328422c1328
> [ 3852.217689] FS:  00007f32b6f92b80(0000) GS:ff161423bfd80000(0000) 
> knlGS:0000000000000000
> [ 3852.217691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3852.217692] CR2: 0000564ddbf68008 CR3: 00000080a81cc005 CR4: 
> 0000000000773ef0
> [ 3852.217693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 3852.217694] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 
> 0000000000000400
> [ 3852.217694] PKRU: 55555554
> [ 3852.217695] Call Trace:
> [ 3852.217696]  <IRQ>
> [ 3852.217697]  ? watchdog_timer_fn+0x21b/0x2a0
> [ 3852.217699]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [ 3852.217702]  ? __hrtimer_run_queues+0x10f/0x2a0
> [ 3852.217704]  ? hrtimer_interrupt+0xfb/0x240
> [ 3852.217706]  ? __sysvec_apic_timer_interrupt+0x4e/0x110
> [ 3852.217709]  ? sysvec_apic_timer_interrupt+0x68/0x90
> [ 3852.217712]  </IRQ>
> [ 3852.217712]  <TASK>
> [ 3852.217713]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 3852.217717]  ? native_queued_spin_lock_slowpath+0x64/0x290
> [ 3852.217720]  _raw_spin_lock+0x25/0x30
> [ 3852.217723]  __pte_offset_map_lock+0x9a/0x110
> [ 3852.217726]  gather_pte_stats+0x1e3/0x2c0
> [ 3852.217730]  walk_pgd_range+0x528/0xbb0
> [ 3852.217733]  __walk_page_range+0x71/0x1d0
> [ 3852.217736]  walk_page_vma+0x98/0xf0
> [ 3852.217738]  show_numa_map+0x11a/0x3a0
> [ 3852.217741]  seq_read_iter+0x2a6/0x470
> [ 3852.217745]  seq_read+0x12b/0x170
> [ 3852.217748]  vfs_read+0xe0/0x370
> [ 3852.217751]  ? syscall_exit_to_user_mode+0x49/0x210
> [ 3852.217755]  ? do_syscall_64+0x8a/0x190
> [ 3852.217758]  ksys_read+0x6a/0xe0
> [ 3852.217762]  do_syscall_64+0x7e/0x190
> [ 3852.217765]  ? __memcg_slab_free_hook+0xd4/0x120
> [ 3852.217768]  ? __x64_sys_close+0x38/0x80
> [ 3852.217771]  ? kmem_cache_free+0x3bf/0x3e0
> [ 3852.217774]  ? syscall_exit_to_user_mode+0x49/0x210
> [ 3852.217777]  ? do_syscall_64+0x8a/0x190
> [ 3852.217780]  ? do_syscall_64+0x8a/0x190
> [ 3852.217783]  ? __irq_exit_rcu+0x3e/0xe0
> [ 3852.217785]  entry_SYSCALL_64_after_hwframe+0x76/0x7e


Hello David,

Thanks for reporting, details. Reproducer information helps me
to stabilize the code quickly. Micro-benchmark I used did not show any
issues. I will add PTL lock and also check the issue from my side..

(with multiple scanning threads, it could cause even more issues because
of more migration pressure, wondering if I should go ahead with more
stabilized single thread scanning version in the coming post)

Thanks and Regards
- Raghu


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-02-13  5:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-01 15:38 [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 01/10] mm: Add kmmscand kernel daemon Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 02/10] mm: Maintain mm_struct list in the system Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 03/10] mm: Scan the mm and create a migration list Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 04/10] mm/migration: Migrate accessed folios to toptier node Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 05/10] mm: Add throttling of mm scanning using scan_period Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 06/10] mm: Add throttling of mm scanning using scan_size Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 07/10] sysfs: Add sysfs support to tune scanning Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 08/10] vmstat: Add vmstat counters Raghavendra K T
2024-12-01 15:38 ` [RFC PATCH V0 09/10] trace/kmmscand: Add tracing of scanning and migration Raghavendra K T
2024-12-05 17:46   ` Steven Rostedt
2024-12-06  6:33     ` Raghavendra K T
2024-12-06 14:49       ` Steven Rostedt
2024-12-01 15:38 ` [RFC PATCH V0 DO NOT MERGE 10/10] kmmscand: Add scanning Raghavendra K T
2024-12-10 18:53 ` [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit SeongJae Park
2024-12-20  6:30   ` Raghavendra K T
2025-02-12 17:02 ` Davidlohr Bueso
2025-02-13  5:39   ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox