Re: [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Bharata B Rao <bharata@amd.com>
To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <Jonathan.Cameron@huawei.com>, <dave.hansen@intel.com>,
	<gourry@gourry.net>, <mgorman@techsingularity.net>,
	<mingo@redhat.com>, <peterz@infradead.org>,
	<raghavendra.kt@amd.com>, <riel@surriel.com>,
	<rientjes@google.com>, <sj@kernel.org>, <weixugc@google.com>,
	<willy@infradead.org>, <ying.huang@linux.alibaba.com>,
	<ziy@nvidia.com>, <dave@stgolabs.net>, <nifan.cxl@gmail.com>,
	<xuezhengchu@huawei.com>, <yiannis@zptcorp.com>,
	<akpm@linux-foundation.org>, <david@redhat.com>,
	<byungchul@sk.com>, <kinseyho@google.com>,
	<joshua.hahnjy@gmail.com>, <yuanchu@google.com>,
	<balbirs@nvidia.com>, <alok.rathore@samsung.com>,
	<shivankg@amd.com>
Subject: Re: [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot
Date: Mon, 30 Mar 2026 10:16:19 +0530	[thread overview]
Message-ID: <12112823-f4b7-4854-a32c-c40985c65521@amd.com> (raw)
In-Reply-To: <20260323095104.238982-6-bharata@amd.com>

Replies to review from Shashiko
[https://sashiko.dev/#/patchset/20260323095104.238982-1-bharata%40amd.com]

>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index b298cba853ab..fe957ff91df9 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>[ ... ]
>> @@ -2217,11 +2218,26 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>  		goto out_map;
>>
>>  	nid = folio_nid(folio);
>> +	needs_promotion = folio_is_promo_candidate(folio);
>>
>>  	target_nid = numa_migrate_check(folio, vmf, haddr, &flags, writable,
>>  					&last_cpupid);
>>  	if (target_nid == NUMA_NO_NODE)
>>  		goto out_map;

> Does this code inadvertently record pages forbidden from migrating in pghot?
> If numa_migrate_check() returns NUMA_NO_NODE due to memory policy, the code
> jumps to out_map. At that point, nid retains its initialized value of
> folio_nid(folio). Since nid != NUMA_NO_NODE evaluates to true, it looks like
> pghot_record_access() will still be called at the end of the function, which
> could pollute the tracking map and trigger useless daemon sweeps.

While for NUMAB1, we need to call task_numa_fault() to just record the fault
stats, the equivalent of it for NUMAB2 isn't necessary. I will make the call
to pghot_record_access() conditional to this scenario.

>> +
>> +	if (needs_promotion) {
>> +		/*
>> +		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> +		 * Isolation and migration are handled by pghot.
>> +		 *
>> +		 * TODO: mode2 check
>> +		 */
>> +		writable = false;

> Can this cause a regression involving spurious write-protect faults? By forcing
> writable = false here before jumping to out_map, the write permission appears
> to be stripped from the restored mapping even if the page was previously
> writable. If the application access was a write, it might immediately suffer a
> spurious write-protect fault when it retries the instruction.
>> +		nid = target_nid;
>> +		goto out_map;
>> +	}

Since there is no migration from the fault context for NUMAB2, we don't
release PTL lock at all. Hence I realize that I don't have to modify
writable and ignore_writable at all for the hot page promotion case.

>[ ... ]
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2f815a34d924..289fa6c07a42 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>[ ... ]
>> @@ -6086,16 +6087,31 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>  		goto out_map;
>>
>>  	nid = folio_nid(folio);
>> +	needs_promotion = folio_is_promo_candidate(folio);
>>  	nr_pages = folio_nr_pages(folio);
>>
>>  	target_nid = numa_migrate_check(folio, vmf, vmf->address, &flags,
>>  					writable, &last_cpupid);
>>  	if (target_nid == NUMA_NO_NODE)
>>  		goto out_map;

> Similarly to do_huge_pmd_numa_page(), could this also inadvertently record
> pages forbidden from migrating when jumping to out_map with an unmodified
> nid?

Same as above.

>> -	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +
>> +	if (needs_promotion) {
>> +		/*
>> +		 * Hot page promotion, mode=NUMA_BALANCING_MEMORY_TIERING.
>> +		 * Isolation and migration are handled by pghot.
>> +		 */
>> +		writable = false;
>> +		ignore_writable = true;

> Would forcing writable = false and ignore_writable = true here also strip
> permissions and cause a performance regression from spurious write-protect
> faults for previously writable pages?

Same as above.

>> +		nid = target_nid;
>> +		goto out_map;
>> +	}
>> +
>> +	/* Balancing b/n toptier nodes, mode=NUMA_BALANCING_NORMAL */
>> +	if (migrate_misplaced_folio_prepare(folio, vmf->vma, target_nid)) {
>>  		flags |= TNF_MIGRATE_FAIL;
>>  		goto out_map;
>>  	}
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a5f48984ed3e..db6832b4b95b 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>[ ... ]
>> @@ -2796,6 +2808,8 @@ int migrate_misplaced_folios_batch(struct list_head
*folio_list, int node)
>>  #ifdef CONFIG_NUMA_BALANCING
>>  		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
>>  		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
>> +#endif
>> +#ifdef CONFIG_PGHOT
>> +		mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS, nr_succeeded);
>>  #endif
>>  	}

> Should this use mod_lruvec_state() instead of mod_node_page_state() to
> accurately update both the node and memcg counters? PGPROMOTE_SUCCESS is
> defined in memcg_node_stat_items, so it is exported to the cgroup
> memory.stat interface. Using mod_node_page_state() bypasses memcg accounting,
> which might cause the cgroup-level promotion statistics to silently underflow
> or remain at zero.

Replied to this in one of the previous threads.

>> diff --git a/mm/pghot.c b/mm/pghot.c
>> index 7d7ef0800ae2..3c0ba254ad4c 100644
>> --- a/mm/pghot.c
>> +++ b/mm/pghot.c
[ ... ]
>> @@ -141,6 +166,110 @@ int pghot_record_access(unsigned long pfn, int nid, int
src, unsigned long now)
>>  	return 0;
>>  }
>[ ... ]
>> +static bool kmigrated_should_migrate_memory(unsigned long nr_pages, int nid,
>> +					    unsigned long time)
>> +{
>> +	struct pglist_data *pgdat;
>> +	unsigned long rate_limit;
>> +	unsigned int th, def_th;
>> +	unsigned long now_ms = jiffies_to_msecs(jiffies); /* Based on full-width
jiffies */
>> +	unsigned long now = jiffies;

> Could this cause spurious threshold adjustments during millisecond
> wrap-arounds? jiffies_to_msecs() returns a 32-bit unsigned int which is
> zero-extended into the 64-bit unsigned long now_ms. When computing time
> differences like now_ms - start (where start is an unsigned int) in
> kmigrated_promotion_adjust_threshold(), the subtraction is performed in 64-bit
arithmetic.
> When the 32-bit counter wraps around, the subtraction could yield a very large
> value that falsely satisfies conditions like > MSEC_PER_SEC or > th_period,
> which could cause a regression in rate-limiting.

It could wrap around but due to cmpxchg updating pgdat->nbp_rl_start to
post-wrap value, it should be self-correcting. However I can change now_ms
unsigned int to be perfect.

>> +
>> +	pgdat = NODE_DATA(nid);
>> +	if (pgdat_free_space_enough(pgdat)) {
>[ ... ]
>> +
>> +	return !kmigrated_promotion_rate_limit(pgdat, rate_limit, nr_pages, now_ms);
>> +}
[ ... ]
>> @@ -218,6 +347,11 @@ static void kmigrated_walk_zone(unsigned long start_pfn,
unsigned long end_pfn,
>>  			goto out_next;
>>  		}
>>
>> +		if (!kmigrated_should_migrate_memory(nr, nid, time)) {
>> +			folio_put(folio);
>> +			goto out_next;
>> +		}
>> +

> Does this correctly advance the PFN when encountering large folio tail pages?
> Looking at the rest of kmigrated_walk_zone(), the loop iterator pfn is
> advanced by nr = folio_nr_pages(folio) at the out_next label.
> If the loop lands on a tail page of a large folio (for example, if a
> previous iteration failed a check and incremented by 1), folio_nr_pages()
> returns the size of the entire large folio. Adding the full folio size to a
> tail page's PFN overshoots the end of the folio, potentially skipping valid
> pages of subsequent allocations.
> Would it be safer to advance by
> folio_nr_pages(folio) - folio_page_idx(folio, page)?

We could end up on tail pages leading to skipping of some folios but I think
they will be reached in the next pass. Anyway I will check if your suggestion
can be incorporated without any additional overhead.

Regards,
Bharata.

next prev parent reply	other threads:[~2026-03-30  4:46 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-23  9:50 [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 1/5] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 2/5] mm: migrate: Add migrate_misplaced_folios_batch() Bharata B Rao
2026-03-26  5:50   ` Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 3/5] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 4/5] mm: pghot: Precision mode for pghot Bharata B Rao
2026-03-26 10:41   ` Bharata B Rao
2026-03-23  9:51 ` [RFC PATCH v6 5/5] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-03-30  4:46   ` Bharata B Rao [this message]
2026-03-23  9:56 ` [RFC PATCH v6 0/5] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-03-23  9:58 ` Bharata B Rao
2026-03-23  9:59 ` Bharata B Rao
2026-03-23 10:01 ` Bharata B Rao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12112823-f4b7-4854-a32c-c40985c65521@amd.com \
    --to=bharata@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox