From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
<gourry@gourry.net>, <mgorman@techsingularity.net>,
<mingo@redhat.com>, <peterz@infradead.org>,
<raghavendra.kt@amd.com>, <riel@surriel.com>,
<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
<akpm@linux-foundation.org>, <david@redhat.com>,
<byungchul@sk.com>, <kinseyho@google.com>,
<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
<shivankg@amd.com>
Subject: Re: [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot
Date: Mon, 30 Mar 2026 10:16:19 +0530 [thread overview]
Message-ID: <12112823-f4b7-4854-a32c-c40985c65521@amd.com> (raw)
In-Reply-To: <20260323095104.238982-6-bharata@amd.com>
Replies to review from Shashiko
[https://sashiko.dev/#/patchset/20260323095104.238982-1-bharata%40amd.com]
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index b298cba853ab..fe957ff91df9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>[ ... ]
>> @@ -2217,11 +2218,26 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>> goto out_map;
>>
>> nid = folio_nid(folio);
>> + needs_promotion = folio_is_promo_candidate(folio);
>>
>> target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
>> &last_cpupid);
>> if (target_nid == NUMA_NO_NODE)
>> goto out_map;
> Does this code inadvertently record pages forbidden from migrating in pghot?
> If numa_migrate_check() returns NUMA_NO_NODE due to memory policy, the code
> jumps to out_map. At that point, nid retains its initialized value of
> folio_nid(folio). Since nid != NUMA_NO_NODE evaluates to true, it looks like
> pghot_record_access() will still be called at the end of the function, which
> could pollute the tracking map and trigger useless daemon sweeps.
While for NUMAB1, we need to call task_numa_fault() to just record the fault
stats, the equivalent of it for NUMAB2 isn't necessary. I will make the call
to pghot_record_access() conditional to this scenario.
>> +
>> + if (needs_promotion) {
>> + /*
>> + * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> + * Isolation and migration are handled by pghot.
>> + *
>> + * TODO: mode2 check
>> + */
>> + writable = false;
> Can this cause a regression involving spurious write-protect faults? By forcing
> writable = false here before jumping to out_map, the write permission appears
> to be stripped from the restored mapping even if the page was previously
> writable. If the application access was a write, it might immediately suffer a
> spurious write-protect fault when it retries the instruction.
>> + nid = target_nid;
>> + goto out_map;
>> + }
Since there is no migration from the fault context for NUMAB2, we don't
release PTL lock at all. Hence I realize that I don't have to modify
writable and ignore_writable at all for the hot page promotion case.
>[ ... ]
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2f815a34d924..289fa6c07a42 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>[ ... ]
>> @@ -6086,16 +6087,31 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>> goto out_map;
>>
>> nid = folio_nid(folio);
>> + needs_promotion = folio_is_promo_candidate(folio);
>> nr_pages = folio_nr_pages(folio);
>>
>> target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
>> writable, &last_cpupid);
>> if (target_nid == NUMA_NO_NODE)
>> goto out_map;
> Similarly to do_huge_pmd_numa_page(), could this also inadvertently record
> pages forbidden from migrating when jumping to out_map with an unmodified
> nid?
Same as above.
>> - if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +
>> + if (needs_promotion) {
>> + /*
>> + * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> + * Isolation and migration are handled by pghot.
>> + */
>> + writable = false;
>> + ignore_writable = true;
> Would forcing writable = false and ignore_writable = true here also strip
> permissions and cause a performance regression from spurious write-protect
> faults for previously writable pages?
Same as above.
>> + nid = target_nid;
>> + goto out_map;
>> + }
>> +
>> + /* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
>> + if (migrate_misplaced_folio_prepare(folio, vmf->vma, target_nid)) {
>> flags |= TNF_MIGRATE_FAIL;
>> goto out_map;
>> }
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a5f48984ed3e..db6832b4b95b 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>[ ... ]
>> @@ -2796,6 +2808,8 @@ int migrate_misplaced_folios_batch(struct list_head
*folio_list, int node)
>> #ifdef CONFIG_NUMA_BALANCING
>> count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
>> count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
>> +#endif
>> +#ifdef CONFIG_PGHOT
>> + mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded);
>> #endif
>> }
> Should this use mod_lruvec_state() instead of mod_node_page_state() to
> accurately update both the node and memcg counters? PGPROMOTE_SUCCESS is
> defined in memcg_node_stat_items, so it is exported to the cgroup
> memory.stat interface. Using mod_node_page_state() bypasses memcg accounting,
> which might cause the cgroup-level promotion statistics to silently underflow
> or remain at zero.
Replied to this in one of the previous threads.
>> diff --git a/mm/pghot.c b/mm/pghot.c
>> index 7d7ef0800ae2..3c0ba254ad4c 100644
>> --- a/mm/pghot.c
>> +++ b/mm/pghot.c
[ ... ]
>> @@ -141,6 +166,110 @@ int pghot_record_access(unsigned long pfn, int nid, int
src, unsigned long now)
>> return 0;
>> }
>[ ... ]
>> +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid,
>> + unsigned long time)
>> +{
>> + struct pglist_data *pgdat;
>> + unsigned long rate_limit;
>> + unsigned int th, def_th;
>> + unsigned long now_ms = jiffies_to_msecs(jiffies); /* Based on full-width
jiffies */
>> + unsigned long now = jiffies;
> Could this cause spurious threshold adjustments during millisecond
> wrap-arounds? jiffies_to_msecs() returns a 32-bit unsigned int which is
> zero-extended into the 64-bit unsigned long now_ms. When computing time
> differences like now_ms - start (where start is an unsigned int) in
> kmigrated_promotion_adjust_threshold(), the subtraction is performed in 64-bit
arithmetic.
> When the 32-bit counter wraps around, the subtraction could yield a very large
> value that falsely satisfies conditions like > MSEC_PER_SEC or > th_period,
> which could cause a regression in rate-limiting.
It could wrap around but due to cmpxchg updating pgdat->nbp_rl_start to
post-wrap value, it should be self-correcting. However I can change now_ms
unsigned int to be perfect.
>> +
>> + pgdat = NODE_DATA(nid);
>> + if (pgdat_free_space_enough(pgdat)) {
>[ ... ]
>> +
>> + return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
>> +}
[ ... ]
>> @@ -218,6 +347,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn,
unsigned long end_pfn,
>> goto out_next;
>> }
>>
>> + if (!kmigrated_should_migrate_memory(nr, nid, time)) {
>> + folio_put(folio);
>> + goto out_next;
>> + }
>> +
> Does this correctly advance the PFN when encountering large folio tail pages?
> Looking at the rest of kmigrated_walk_zone(), the loop iterator pfn is
> advanced by nr = folio_nr_pages(folio) at the out_next label.
> If the loop lands on a tail page of a large folio (for example, if a
> previous iteration failed a check and incremented by 1), folio_nr_pages()
> returns the size of the entire large folio. Adding the full folio size to a
> tail page's PFN overshoots the end of the folio, potentially skipping valid
> pages of subsequent allocations.
> Would it be safer to advance by
> folio_nr_pages(folio) - folio_page_idx(folio, page)?
We could end up on tail pages leading to skipping of some folios but I think
they will be reached in the next pass. Anyway I will check if your suggestion
can be incorporated without any additional overhead.
Regards,
Bharata.
next prev parent reply other threads:[~2026-03-30 4:46 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-23 9:50 [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 1/5] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 2/5] mm: migrate: Add migrate_misplaced_folios_batch() Bharata B Rao
2026-03-26 5:50 ` Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot Bharata B Rao
2026-03-26 10:41 ` Bharata B Rao
2026-03-23 9:51 ` [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-03-30 4:46 ` Bharata B Rao [this message]
2026-03-23 9:56 ` [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23 9:58 ` Bharata B Rao
2026-03-23 9:59 ` Bharata B Rao
2026-03-23 10:01 ` Bharata B Rao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=12112823-f4b7-4854-a32c-c40985c65521@amd.com \
--to=bharata@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=gourry@gourry.net \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=shivankg@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox