From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Bharata B Rao <bharata@amd.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
<dave.hansen@intel.com>, <gourry@gourry.net>,
<hannes@cmpxchg.org>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<akpm@linux-foundation.org>, <david@redhat.com>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>
Subject: Re: [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted
Date: Fri, 3 Oct 2025 13:38:18 +0100 [thread overview]
Message-ID: <20251003133818.000017af@huawei.com> (raw)
In-Reply-To: <20250910144653.212066-9-bharata@amd.com>
On Wed, 10 Sep 2025 20:16:53 +0530
Bharata B Rao <bharata@amd.com> wrote:
> Currently hot page promotion (NUMA_BALANCING_MEMORY_TIERING
> mode of NUMA Balancing) does hot page detection (via hint faults),
> hot page classification and eventual promotion, all by itself and
> sits within the scheduler.
>
> With the new hot page tracking and promotion mechanism being
> available, NUMA Balancing can limit itself to detection of
> hot pages (via hint faults) and off-load rest of the
> functionality to the common hot page tracking system.
>
> pghot_record_access(PGHOT_HINT_FAULT) API is used to feed the
> hot page info. In addition, the migration rate limiting and
> dynamic threshold logic are moved to kpromoted so that the same
> can be used for hot pages reported by other sources too.
>
> Signed-off-by: Bharata B Rao <bharata@amd.com>
Making a direct replacement without any fallback to previous method
is going to need a lot of data to show there are no important regressions.
So bold move if that's the intent!
J
> ---
> include/linux/pghot.h | 2 +
> kernel/sched/fair.c | 149 ++----------------------------------------
> mm/memory.c | 32 ++-------
> mm/pghot.c | 132 +++++++++++++++++++++++++++++++++++--
> 4 files changed, 142 insertions(+), 173 deletions(-)
>
> diff --git a/mm/pghot.c b/mm/pghot.c
> index 9f7581818b8f..9f5746892bce 100644
> --- a/mm/pghot.c
> +++ b/mm/pghot.c
> @@ -9,6 +9,9 @@
> *
> * kpromoted is a kernel thread that runs on each toptier node and
> * promotes pages from max_heap.
> + *
> + * Migration rate-limiting and dynamic threshold logic implementations
> + * were moved from NUMA Balancing mode 2.
> */
> #include <linux/pghot.h>
> #include <linux/kthread.h>
> @@ -34,6 +37,9 @@ static bool kpromoted_started __ro_after_init;
>
> static unsigned int sysctl_pghot_freq_window = KPROMOTED_FREQ_WINDOW;
>
> +/* Restrict the NUMA promotion throughput (MB/s) for each target node. */
> +static unsigned int sysctl_pghot_promote_rate_limit = 65536;
If the comment correlates with the value, this is 64 GiB/s? That seems
unlikely if I guess possible.
> +
> #ifdef CONFIG_SYSCTL
> static const struct ctl_table pghot_sysctls[] = {
> {
> @@ -44,8 +50,17 @@ static const struct ctl_table pghot_sysctls[] = {
> .proc_handler = proc_dointvec_minmax,
> .extra1 = SYSCTL_ZERO,
> },
> + {
> + .procname = "pghot_promote_rate_limit_MBps",
> + .data = &sysctl_pghot_promote_rate_limit,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = SYSCTL_ZERO,
> + },
> };
> #endif
> +
Put that in earlier patch to reduce noise here.
> static bool phi_heap_less(const void *lhs, const void *rhs, void *args)
> {
> return (*(struct pghot_info **)lhs)->frequency >
> @@ -94,11 +109,99 @@ static bool phi_heap_insert(struct max_heap *phi_heap, struct pghot_info *phi)
> return true;
> }
>
> +/*
> + * For memory tiering mode, if there are enough free pages (more than
> + * enough watermark defined here) in fast memory node, to take full
I'd use enough_wmark Just because "more than enough" is a common
English phrase and I at least tripped over that sentence as a result!
> + * advantage of fast memory capacity, all recently accessed slow
> + * memory pages will be migrated to fast memory node without
> + * considering hot threshold.
> + */
> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
> +{
> + int z;
> + unsigned long enough_wmark;
> +
> + enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
> + pgdat->node_present_pages >> 4);
> + for (z = pgdat->nr_zones - 1; z >= 0; z--) {
> + struct zone *zone = pgdat->node_zones + z;
> +
> + if (!populated_zone(zone))
> + continue;
> +
> + if (zone_watermark_ok(zone, 0,
> + promo_wmark_pages(zone) + enough_wmark,
> + ZONE_MOVABLE, 0))
> + return true;
> + }
> + return false;
> +}
> +
> +static void kpromoted_promotion_adjust_threshold(struct pglist_data *pgdat,
Needs documentation of the algorithm and the reasons for various choices.
I see it is a code move though so maybe that's a job for another day.
> + unsigned long rate_limit,
> + unsigned int ref_th,
> + unsigned long now)
> +{
> + unsigned int start, th_period, unit_th, th;
> + unsigned long nr_cand, ref_cand, diff_cand;
> +
> + now = jiffies_to_msecs(now);
> + th_period = KPROMOTED_PROMOTION_THRESHOLD_WINDOW;
> + start = pgdat->nbp_th_start;
> + if (now - start > th_period &&
> + cmpxchg(&pgdat->nbp_th_start, start, now) == start) {
> + ref_cand = rate_limit *
> + KPROMOTED_PROMOTION_THRESHOLD_WINDOW / MSEC_PER_SEC;
> + nr_cand = node_page_state(pgdat, PGPROMOTE_CANDIDATE);
> + diff_cand = nr_cand - pgdat->nbp_th_nr_cand;
> + unit_th = ref_th * 2 / KPROMOTED_MIGRATION_ADJUST_STEPS;
> + th = pgdat->nbp_threshold ? : ref_th;
> + if (diff_cand > ref_cand * 11 / 10)
> + th = max(th - unit_th, unit_th);
> + else if (diff_cand < ref_cand * 9 / 10)
> + th = min(th + unit_th, ref_th * 2);
> + pgdat->nbp_th_nr_cand = nr_cand;
> + pgdat->nbp_threshold = th;
> + }
> +}
+
> static bool phi_is_pfn_hot(struct pghot_info *phi)
> {
> struct page *page = pfn_to_online_page(phi->pfn);
> - unsigned long now = jiffies;
> struct folio *folio;
> + struct pglist_data *pgdat;
> + unsigned long rate_limit;
> + unsigned int latency, th, def_th;
> + unsigned long now = jiffies;
>
Avoid the reorder. Just put it here in first place if you prefer this.
next prev parent reply other threads:[~2025-10-03 12:38 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-10 14:46 [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 1/8] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 2/8] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-10-03 10:36 ` Jonathan Cameron
2025-10-03 11:02 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 3/8] mm: Hot page tracking and promotion Bharata B Rao
2025-10-03 11:17 ` Jonathan Cameron
2025-10-06 4:13 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 4/8] x86: ibs: In-kernel IBS driver for memory access profiling Bharata B Rao
2025-10-03 12:19 ` Jonathan Cameron
2025-10-06 4:28 ` Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 5/8] x86: ibs: Enable IBS profiling for memory accesses Bharata B Rao
2025-10-03 12:22 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 6/8] mm: mglru: generalize page table walk Bharata B Rao
2025-09-10 14:46 ` [RFC PATCH v2 7/8] mm: klruscand: use mglru scanning for page promotion Bharata B Rao
2025-10-03 12:30 ` Jonathan Cameron
2025-09-10 14:46 ` [RFC PATCH v2 8/8] mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Bharata B Rao
2025-10-03 12:38 ` Jonathan Cameron [this message]
2025-10-06 5:57 ` Bharata B Rao
2025-10-06 9:53 ` Jonathan Cameron
2025-09-10 15:39 ` [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2025-09-10 16:01 ` Gregory Price
2025-09-16 19:45 ` David Rientjes
2025-09-16 22:02 ` Gregory Price
2025-09-17 0:30 ` Wei Xu
2025-09-17 3:20 ` Balbir Singh
2025-09-17 4:15 ` Bharata B Rao
2025-09-17 16:49 ` Jonathan Cameron
2025-09-25 14:03 ` Yiannis Nikolakopoulos
2025-09-25 14:41 ` Gregory Price
2025-10-16 11:48 ` Yiannis Nikolakopoulos
2025-09-25 15:00 ` Jonathan Cameron
2025-09-25 15:08 ` Gregory Price
2025-09-25 15:18 ` Gregory Price
2025-09-25 15:24 ` Jonathan Cameron
2025-09-25 16:06 ` Gregory Price
2025-09-25 17:23 ` Jonathan Cameron
2025-09-25 19:02 ` Gregory Price
2025-10-01 7:22 ` Gregory Price
2025-10-17 9:53 ` Yiannis Nikolakopoulos
2025-10-17 14:15 ` Gregory Price
2025-10-17 14:36 ` Jonathan Cameron
2025-10-17 14:59 ` Gregory Price
2025-10-20 14:05 ` Jonathan Cameron
2025-10-21 18:52 ` Gregory Price
2025-10-21 18:57 ` Gregory Price
2025-10-22 9:09 ` Jonathan Cameron
2025-10-22 15:05 ` Gregory Price
2025-10-23 15:29 ` Jonathan Cameron
2025-10-16 16:16 ` Yiannis Nikolakopoulos
2025-10-20 14:23 ` Jonathan Cameron
2025-10-20 15:05 ` Gregory Price
2025-10-08 17:59 ` Vinicius Petrucci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251003133818.000017af@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox