Re: [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Bharata B Rao <bharata@amd.com>, <akpm@linux-foundation.org>,
	<david@redhat.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	<dave.hansen@intel.com>, <gourry@gourry.net>,
	<hannes@cmpxchg.org>, <mgorman@techsingularity.net>,
	<mingo@redhat.com>, <peterz@infradead.org>,
	<raghavendra.kt@amd.com>, <riel@surriel.com>,
	<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
	<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
	<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
	<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
	<byungchul@sk.com>, <kinseyho@google.com>,
	<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
	<balbirs@nvidia.com>, <alok.rathore@samsung.com>
Subject: Re: [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion
Date: Fri, 3 Oct 2025 13:30:25 +0100	[thread overview]
Message-ID: <20251003133025.00006f4b@huawei.com> (raw)
In-Reply-To: <20250910144653.212066-8-bharata@amd.com>

On Wed, 10 Sep 2025 20:16:52 +0530
Bharata B Rao <bharata@amd.com> wrote:

> From: Kinsey Ho <kinseyho@google.com>
> 
> Introduce a new kernel daemon, klruscand, that periodically invokes the
> MGLRU page table walk. It leverages the new callbacks to gather access
> information and forwards it to the pghot hot page tracking sub-system
> for promotion decisions.
> 
> This benefits from reusing the existing MGLRU page table walk
> infrastructure, which is optimized with features such as hierarchical
> scanning and bloom filters to reduce CPU overhead.
> 
> As an additional optimization to be added in the future, we can tune
> the scan intervals for each memcg.
> 
> Signed-off-by: Kinsey Ho <kinseyho@google.com>
> Signed-off-by: Yuanchu Xie <yuanchu@google.com>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> 	[Reduced the scan interval to 100ms, pfn_t to unsigned long]
Some very minor comments inline.  I know even less about the stuff this
is using than IBS (and I don't know much about that ;)

J
> ---
>  mm/Kconfig     |   8 ++++
>  mm/Makefile    |   1 +
>  mm/klruscand.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 127 insertions(+)
>  create mode 100644 mm/klruscand.c
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 8b236eb874cf..6d53c1208729 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1393,6 +1393,14 @@ config PGHOT
>  	  by various sources. Asynchronous promotion is done by per-node
>  	  kernel threads.
>  
> +config KLRUSCAND
> +	bool "Kernel lower tier access scan daemon"
> +	default y

Why default to y? That's very rarely done for new features.

> +	depends on PGHOT && LRU_GEN_WALKS_MMU
> +	help
> +	  Scan for accesses from lower tiers by invoking MGLRU to perform
> +	  page table walks.

> diff --git a/mm/klruscand.c b/mm/klruscand.c
> new file mode 100644
> index 000000000000..1a51aab29bd9
> --- /dev/null
> +++ b/mm/klruscand.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/memcontrol.h>

Probably pick some ordering scheme for includes.
I'm not spotting what is currently used here.

> +#include <linux/kthread.h>
> +#include <linux/module.h>
> +#include <linux/vmalloc.h>
> +#include <linux/random.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_inline.h>
> +#include <linux/slab.h>
> +#include <linux/sched/clock.h>
> +#include <linux/memory-tiers.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched.h>
> +#include <linux/pghot.h>
> +
> +#include "internal.h"
> +
> +#define KLRUSCAND_INTERVAL_MS 100
> +#define BATCH_SIZE (2 << 16)
> +
> +static struct task_struct *scan_thread;
> +static unsigned long pfn_batch[BATCH_SIZE];
> +static int batch_index;
> +
> +static void flush_cb(void)
> +{
> +	int i = 0;
> +
> +	for (; i < batch_index; i++) {
> +		u64 pfn = pfn_batch[i];

Why dance through types?  pfn_batch is unsigned long and it is
cast back to that below.

> +
> +		pghot_record_access((unsigned long)pfn, NUMA_NO_NODE,
> +					PGHOT_PGTABLE_SCAN, jiffies);
> +
> +		if (i % 16 == 0)

No problem with this, but maybe a comment on why 16?

> +			cond_resched();
> +	}
> +	batch_index = 0;
> +}

> +static int klruscand_run(void *unused)
> +{
> +	struct lru_gen_mm_walk *walk;
> +
> +	walk = kzalloc(sizeof(*walk),
> +		       __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);

Maybe use __free() magic so we can forget about having to clear this up on exit.
Entirely up to you though as doesn't simplify code much in this case.

> +	if (!walk)
> +		return -ENOMEM;
> +
> +	while (!kthread_should_stop()) {
> +		unsigned long next_wake_time;
> +		long sleep_time;
> +		struct mem_cgroup *memcg;
> +		int flags;
> +		int nid;
> +
> +		next_wake_time = jiffies + msecs_to_jiffies(KLRUSCAND_INTERVAL_MS);
> +
> +		for_each_node_state(nid, N_MEMORY) {
> +			pg_data_t *pgdat = NODE_DATA(nid);
> +			struct reclaim_state rs = { 0 };
> +
> +			if (node_is_toptier(nid))
> +				continue;
> +
> +			rs.mm_walk = walk;
> +			set_task_reclaim_state(current, &rs);
> +			flags = memalloc_noreclaim_save();
> +
> +			memcg = mem_cgroup_iter(NULL, NULL, NULL);
> +			do {
> +				struct lruvec *lruvec =
> +					mem_cgroup_lruvec(memcg, pgdat);
> +				unsigned long max_seq =
> +					READ_ONCE((lruvec)->lrugen.max_seq);
> +
> +				lru_gen_scan_lruvec(lruvec, max_seq,
> +						    accessed_cb, flush_cb);
> +				cond_resched();
> +			} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
> +
> +			memalloc_noreclaim_restore(flags);
> +			set_task_reclaim_state(current, NULL);
> +			memset(walk, 0, sizeof(*walk));
> +		}
> +
> +		sleep_time = next_wake_time - jiffies;
> +		if (sleep_time > 0 && sleep_time != MAX_SCHEDULE_TIMEOUT)
> +			schedule_timeout_idle(sleep_time);
> +	}
> +	kfree(walk);
> +	return 0;
> +}

next prev parent reply	other threads:[~2025-10-03 12:30 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36   ` Jonathan Cameron
2025-10-03 11:02     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17   ` Jonathan Cameron
2025-10-06  4:13     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19   ` Jonathan Cameron
2025-10-06  4:28     ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22   ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30   ` Jonathan Cameron [this message]
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38   ` Jonathan Cameron
2025-10-06  5:57     ` Bharata B Rao
2025-10-06  9:53       ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01   ` Gregory Price
2025-09-16 19:45     ` David Rientjes
2025-09-16 22:02       ` Gregory Price
2025-09-17  0:30       ` Wei Xu
2025-09-17  3:20         ` Balbir Singh
2025-09-17  4:15           ` Bharata B Rao
2025-09-17 16:49         ` Jonathan Cameron
2025-09-25 14:03           ` Yiannis Nikolakopoulos
2025-09-25 14:41             ` Gregory Price
2025-10-16 11:48               ` Yiannis Nikolakopoulos
2025-09-25 15:00             ` Jonathan Cameron
2025-09-25 15:08               ` Gregory Price
2025-09-25 15:18                 ` Gregory Price
2025-09-25 15:24                 ` Jonathan Cameron
2025-09-25 16:06                   ` Gregory Price
2025-09-25 17:23                     ` Jonathan Cameron
2025-09-25 19:02                       ` Gregory Price
2025-10-01  7:22                         ` Gregory Price
2025-10-17  9:53                           ` Yiannis Nikolakopoulos
2025-10-17 14:15                             ` Gregory Price
2025-10-17 14:36                               ` Jonathan Cameron
2025-10-17 14:59                                 ` Gregory Price
2025-10-20 14:05                                   ` Jonathan Cameron
2025-10-21 18:52                                     ` Gregory Price
2025-10-21 18:57                                       ` Gregory Price
2025-10-22  9:09                                         ` Jonathan Cameron
2025-10-22 15:05                                           ` Gregory Price
2025-10-23 15:29                                             ` Jonathan Cameron
2025-10-16 16:16               ` Yiannis Nikolakopoulos
2025-10-20 14:23                 ` Jonathan Cameron
2025-10-20 15:05                   ` Gregory Price
2025-10-08 17:59       ` Vinicius Petrucci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251003133025.00006f4b@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox