From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
<gourry@gourry.net>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<akpm@linux-foundation.org>, <david@redhat.com>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
<shivankg@amd.com>, Bharata B Rao <bharata@amd.com>
Subject: [RFC PATCH v4 7/9] mm: klruscand: use mglru scanning for page promotion
Date: Sat, 6 Dec 2025 15:44:21 +0530 [thread overview]
Message-ID: <20251206101423.5004-8-bharata@amd.com> (raw)
In-Reply-To: <20251206101423.5004-1-bharata@amd.com>
From: Kinsey Ho <kinseyho@google.com>
Introduce a new kernel daemon, klruscand, that periodically invokes the
MGLRU page table walk. It leverages the new callbacks to gather access
information and forwards it to pghot sub-system for promotion decisions.
This benefits from reusing the existing MGLRU page table walk
infrastructure, which is optimized with features such as hierarchical
scanning and bloom filters to reduce CPU overhead.
As an additional optimization to be added in the future, we can tune
the scan intervals for each memcg.
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
[Reduced the scan interval to 500ms, KLRUSCAND to default n in config]
Signed-off-by: Bharata B Rao <bharata@amd.com>
---
mm/Kconfig | 8 ++++
mm/Makefile | 1 +
mm/klruscand.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 119 insertions(+)
create mode 100644 mm/klruscand.c
diff --git a/mm/Kconfig b/mm/Kconfig
index 6057c96b9cd3..80221594413d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1403,6 +1403,14 @@ config HWMEM_PROFILER
rolled up to PGHOT sub-system for further action like hot page
promotion or NUMA Balancing
+config KLRUSCAND
+ bool "Kernel lower tier access scan daemon"
+ default n
+ depends on PGHOT && LRU_GEN_WALKS_MMU
+ help
+ Scan for accesses from lower tiers by invoking MGLRU to perform
+ page table walks.
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index a6fac171c36e..1c0c79fec106 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -147,3 +147,4 @@ obj-$(CONFIG_EXECMEM) += execmem.o
obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
obj-$(CONFIG_PGHOT) += pghot.o
+obj-$(CONFIG_KLRUSCAND) += klruscand.o
diff --git a/mm/klruscand.c b/mm/klruscand.c
new file mode 100644
index 000000000000..13a41b38d67d
--- /dev/null
+++ b/mm/klruscand.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memcontrol.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/vmalloc.h>
+#include <linux/memory-tiers.h>
+#include <linux/pghot.h>
+
+#include "internal.h"
+
+#define KLRUSCAND_INTERVAL 500
+#define BATCH_SIZE (2 << 16)
+
+static struct task_struct *scan_thread;
+static unsigned long pfn_batch[BATCH_SIZE];
+static int batch_index;
+
+static void flush_cb(void)
+{
+ int i;
+
+ for (i = 0; i < batch_index; i++) {
+ unsigned long pfn = pfn_batch[i];
+
+ pghot_record_access(pfn, NUMA_NO_NODE, PGHOT_PGTABLE_SCAN, jiffies);
+
+ if (i % 16 == 0)
+ cond_resched();
+ }
+ batch_index = 0;
+}
+
+static bool accessed_cb(unsigned long pfn)
+{
+ WARN_ON_ONCE(batch_index == BATCH_SIZE);
+
+ if (batch_index < BATCH_SIZE)
+ pfn_batch[batch_index++] = pfn;
+
+ return batch_index == BATCH_SIZE;
+}
+
+static int klruscand_run(void *unused)
+{
+ struct lru_gen_mm_walk *walk;
+
+ walk = kzalloc(sizeof(*walk),
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ if (!walk)
+ return -ENOMEM;
+
+ while (!kthread_should_stop()) {
+ unsigned long next_wake_time;
+ long sleep_time;
+ struct mem_cgroup *memcg;
+ int flags;
+ int nid;
+
+ next_wake_time = jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL);
+
+ for_each_node_state(nid, N_MEMORY) {
+ pg_data_t *pgdat = NODE_DATA(nid);
+ struct reclaim_state rs = { 0 };
+
+ if (node_is_toptier(nid))
+ continue;
+
+ rs.mm_walk = walk;
+ set_task_reclaim_state(current, &rs);
+ flags = memalloc_noreclaim_save();
+
+ memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ do {
+ struct lruvec *lruvec =
+ mem_cgroup_lruvec(memcg, pgdat);
+ unsigned long max_seq =
+ READ_ONCE((lruvec)->lrugen.max_seq);
+
+ lru_gen_scan_lruvec(lruvec, max_seq, accessed_cb, flush_cb);
+ cond_resched();
+ } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+ memalloc_noreclaim_restore(flags);
+ set_task_reclaim_state(current, NULL);
+ memset(walk, 0, sizeof(*walk));
+ }
+
+ sleep_time = next_wake_time - jiffies;
+ if (sleep_time > 0 && sleep_time != MAX_SCHEDULE_TIMEOUT)
+ schedule_timeout_idle(sleep_time);
+ }
+ kfree(walk);
+ return 0;
+}
+
+static int __init klruscand_init(void)
+{
+ struct task_struct *task;
+
+ task = kthread_run(klruscand_run, NULL, "klruscand");
+
+ if (IS_ERR(task)) {
+ pr_err("Failed to create klruscand kthread\n");
+ return PTR_ERR(task);
+ }
+
+ scan_thread = task;
+ return 0;
+}
+module_init(klruscand_init);
--
2.34.1
next prev parent reply other threads:[~2025-12-06 10:18 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-06 10:14 [RFC PATCH v4 0/9] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 1/9] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 2/9] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 3/9] mm: Hot page tracking and promotion Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 4/9] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 5/9] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 6/9] mm: mglru: generalize page table walk Bharata B Rao
2025-12-06 10:14 ` Bharata B Rao [this message]
2025-12-06 10:14 ` [RFC PATCH v4 8/9] mm: sched: Move hot page promotion from NUMAB=2 to pghot tracking Bharata B Rao
2025-12-06 10:14 ` [RFC PATCH v4 9/9] mm: pghot: Add folio_mark_accessed() as hotness source Bharata B Rao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251206101423.5004-8-bharata@amd.com \
--to=bharata@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=shivankg@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox