From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Bharata B Rao <bharata@amd.com>, <akpm@linux-foundation.org>,
<david@redhat.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
<dave.hansen@intel.com>, <gourry@gourry.net>,
<hannes@cmpxchg.org>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>
Subject: Re: [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion
Date: Fri, 3 Oct 2025 13:30:25 +0100 [thread overview]
Message-ID: <20251003133025.00006f4b@huawei.com> (raw)
In-Reply-To: <20250910144653.212066-8-bharata@amd.com>
On Wed, 10 Sep 2025 20:16:52 +0530
Bharata B Rao <bharata@amd.com> wrote:
> From: Kinsey Ho <kinseyho@google.com>
>
> Introduce a new kernel daemon, klruscand, that periodically invokes the
> MGLRU page table walk. It leverages the new callbacks to gather access
> information and forwards it to the pghot hot page tracking sub-system
> for promotion decisions.
>
> This benefits from reusing the existing MGLRU page table walk
> infrastructure, which is optimized with features such as hierarchical
> scanning and bloom filters to reduce CPU overhead.
>
> As an additional optimization to be added in the future, we can tune
> the scan intervals for each memcg.
>
> Signed-off-by: Kinsey Ho <kinseyho@google.com>
> Signed-off-by: Yuanchu Xie <yuanchu@google.com>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> [Reduced the scan interval to 100ms, pfn_t to unsigned long]
Some very minor comments inline. I know even less about the stuff this
is using than IBS (and I don't know much about that ;)
J
> ---
> mm/Kconfig | 8 ++++
> mm/Makefile | 1 +
> mm/klruscand.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 127 insertions(+)
> create mode 100644 mm/klruscand.c
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 8b236eb874cf..6d53c1208729 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1393,6 +1393,14 @@ config PGHOT
> by various sources. Asynchronous promotion is done by per-node
> kernel threads.
>
> +config KLRUSCAND
> + bool "Kernel lower tier access scan daemon"
> + default y
Why default to y? That's very rarely done for new features.
> + depends on PGHOT && LRU_GEN_WALKS_MMU
> + help
> + Scan for accesses from lower tiers by invoking MGLRU to perform
> + page table walks.
> diff --git a/mm/klruscand.c b/mm/klruscand.c
> new file mode 100644
> index 000000000000..1a51aab29bd9
> --- /dev/null
> +++ b/mm/klruscand.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/memcontrol.h>
Probably pick some ordering scheme for includes.
I'm not spotting what is currently used here.
> +#include <linux/kthread.h>
> +#include <linux/module.h>
> +#include <linux/vmalloc.h>
> +#include <linux/random.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_inline.h>
> +#include <linux/slab.h>
> +#include <linux/sched/clock.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched.h>
> +#include <linux/pghot.h>
> +
> +#include "internal.h"
> +
> +#define KLRUSCAND_INTERVAL_MS 100
> +#define BATCH_SIZE (2 << 16)
> +
> +static struct task_struct *scan_thread;
> +static unsigned long pfn_batch[BATCH_SIZE];
> +static int batch_index;
> +
> +static void flush_cb(void)
> +{
> + int i = 0;
> +
> + for (; i < batch_index; i++) {
> + u64 pfn = pfn_batch[i];
Why dance through types? pfn_batch is unsigned long and it is
cast back to that below.
> +
> + pghot_record_access((unsigned long)pfn, NUMA_NO_NODE,
> + PGHOT_PGTABLE_SCAN, jiffies);
> +
> + if (i % 16 == 0)
No problem with this, but maybe a comment on why 16?
> + cond_resched();
> + }
> + batch_index = 0;
> +}
> +static int klruscand_run(void *unused)
> +{
> + struct lru_gen_mm_walk *walk;
> +
> + walk = kzalloc(sizeof(*walk),
> + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
Maybe use __free() magic so we can forget about having to clear this up on exit.
Entirely up to you though as doesn't simplify code much in this case.
> + if (!walk)
> + return -ENOMEM;
> +
> + while (!kthread_should_stop()) {
> + unsigned long next_wake_time;
> + long sleep_time;
> + struct mem_cgroup *memcg;
> + int flags;
> + int nid;
> +
> + next_wake_time = jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL_MS);
> +
> + for_each_node_state(nid, N_MEMORY) {
> + pg_data_t *pgdat = NODE_DATA(nid);
> + struct reclaim_state rs = { 0 };
> +
> + if (node_is_toptier(nid))
> + continue;
> +
> + rs.mm_walk = walk;
> + set_task_reclaim_state(current, &rs);
> + flags = memalloc_noreclaim_save();
> +
> + memcg = mem_cgroup_iter(NULL, NULL, NULL);
> + do {
> + struct lruvec *lruvec =
> + mem_cgroup_lruvec(memcg, pgdat);
> + unsigned long max_seq =
> + READ_ONCE((lruvec)->lrugen.max_seq);
> +
> + lru_gen_scan_lruvec(lruvec, max_seq,
> + accessed_cb, flush_cb);
> + cond_resched();
> + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> +
> + memalloc_noreclaim_restore(flags);
> + set_task_reclaim_state(current, NULL);
> + memset(walk, 0, sizeof(*walk));
> + }
> +
> + sleep_time = next_wake_time - jiffies;
> + if (sleep_time > 0 && sleep_time != MAX_SCHEDULE_TIMEOUT)
> + schedule_timeout_idle(sleep_time);
> + }
> + kfree(walk);
> + return 0;
> +}
next prev parent reply other threads:[~2025-10-03 12:30 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36 ` Jonathan Cameron
2025-10-03 11:02 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17 ` Jonathan Cameron
2025-10-06 4:13 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19 ` Jonathan Cameron
2025-10-06 4:28 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30 ` Jonathan Cameron [this message]
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38 ` Jonathan Cameron
2025-10-06 5:57 ` Bharata B Rao
2025-10-06 9:53 ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01 ` Gregory Price
2025-09-16 19:45 ` David Rientjes
2025-09-16 22:02 ` Gregory Price
2025-09-17 0:30 ` Wei Xu
2025-09-17 3:20 ` Balbir Singh
2025-09-17 4:15 ` Bharata B Rao
2025-09-17 16:49 ` Jonathan Cameron
2025-09-25 14:03 ` Yiannis Nikolakopoulos
2025-09-25 14:41 ` Gregory Price
2025-10-16 11:48 ` Yiannis Nikolakopoulos
2025-09-25 15:00 ` Jonathan Cameron
2025-09-25 15:08 ` Gregory Price
2025-09-25 15:18 ` Gregory Price
2025-09-25 15:24 ` Jonathan Cameron
2025-09-25 16:06 ` Gregory Price
2025-09-25 17:23 ` Jonathan Cameron
2025-09-25 19:02 ` Gregory Price
2025-10-01 7:22 ` Gregory Price
2025-10-17 9:53 ` Yiannis Nikolakopoulos
2025-10-17 14:15 ` Gregory Price
2025-10-17 14:36 ` Jonathan Cameron
2025-10-17 14:59 ` Gregory Price
2025-10-20 14:05 ` Jonathan Cameron
2025-10-21 18:52 ` Gregory Price
2025-10-21 18:57 ` Gregory Price
2025-10-22 9:09 ` Jonathan Cameron
2025-10-22 15:05 ` Gregory Price
2025-10-23 15:29 ` Jonathan Cameron
2025-10-16 16:16 ` Yiannis Nikolakopoulos
2025-10-20 14:23 ` Jonathan Cameron
2025-10-20 15:05 ` Gregory Price
2025-10-08 17:59 ` Vinicius Petrucci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251003133025.00006f4b@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox