[PATCH RFC] mm: memory-tiering: Fix PGPROMOTE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
@ 2025-06-19  7:52 Li Zhijian
  2025-06-20  6:28 ` Huang, Ying
  0 siblings, 1 reply; 4+ messages in thread
From: Li Zhijian @ 2025-06-19  7:52 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, linux-kernel, y-goto, Li Zhijian, Huang Ying, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Goto-san reported confusing pgpromote statistics where
the pgpromote_success count significantly exceeded pgpromote_candidate.
The issue manifests under specific memory pressure conditions:
when top-tier memory (DRAM) is exhausted by memhog and allocation begins
in lower-tier memory (CXL). After terminating memhog, the stats show:

$ grep -e pgpromote /proc/vmstat
pgpromote_success 2579
pgpromote_candidate 1

This update increments PGPROMOTE_CANDIDATE within the free space branch
when a promotion decision is made, which may alter the mechanism of the
rate limit. Consequently, it becomes easier to reach the rate limit than
it was previously.

For example:
Rate Limit = 100 pages/sec
Scenario:
  T0: 90 free-space migrations
  T0+100ms: 20-page migration request

Before:
  Rate limit is *not* reached: 0 + 20 = 20 < 100
  PGPROMOTE_CANDIDATE: 20
After:
  Rate limit is reached: 90 + 20 = 110 > 100
  PGPROMOTE_CANDIDATE: 110


Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---

This is markes as RFC because I am uncertain whether we originally
intended for this or if it was overlooked.

However, the current situation where pgpromote_candidate < pgpromote_success
is indeed confusing when interpreted literally.

Cc: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..4715cd4fa248 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		struct pglist_data *pgdat;
 		unsigned long rate_limit;
 		unsigned int latency, th, def_th;
+		long nr = folio_nr_pages(folio)
 
 		pgdat = NODE_DATA(dst_nid);
 		if (pgdat_free_space_enough(pgdat)) {
 			/* workload changed, reset hot threshold */
 			pgdat->nbp_threshold = 0;
+			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
 			return true;
 		}
 
@@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 		if (latency >= th)
 			return false;
 
-		return !numa_promotion_rate_limit(pgdat, rate_limit,
-						  folio_nr_pages(folio));
+		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
 	}
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
-- 
2.43.5



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
  2025-06-19  7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian
@ 2025-06-20  6:28 ` Huang, Ying
  2025-06-23  8:54   ` Zhijian Li (Fujitsu)
  0 siblings, 1 reply; 4+ messages in thread
From: Huang, Ying @ 2025-06-20  6:28 UTC (permalink / raw)
  To: Li Zhijian
  Cc: linux-mm, akpm, linux-kernel, y-goto, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Li Zhijian <lizhijian@fujitsu.com> writes:

> Goto-san reported confusing pgpromote statistics where
> the pgpromote_success count significantly exceeded pgpromote_candidate.
> The issue manifests under specific memory pressure conditions:
> when top-tier memory (DRAM) is exhausted by memhog and allocation begins
> in lower-tier memory (CXL). After terminating memhog, the stats show:

The above description is confusing.  The page promotion occurs when the
size of the top-tier free space is large enough (after killing the
memhog above).  The accessed lower-tier memory will be promoted upon
accessing to take full advantage of the more expensive top-tier memory.

> $ grep -e pgpromote /proc/vmstat
> pgpromote_success 2579
> pgpromote_candidate 1
>
> This update increments PGPROMOTE_CANDIDATE within the free space branch
> when a promotion decision is made, which may alter the mechanism of the
> rate limit. Consequently, it becomes easier to reach the rate limit than
> it was previously.
>
> For example:
> Rate Limit = 100 pages/sec
> Scenario:
>   T0: 90 free-space migrations
>   T0+100ms: 20-page migration request
>
> Before:
>   Rate limit is *not* reached: 0 + 20 = 20 < 100
>   PGPROMOTE_CANDIDATE: 20
> After:
>   Rate limit is reached: 90 + 20 = 110 > 100
>   PGPROMOTE_CANDIDATE: 110

Yes.  The rate limit will be influenced by the change.  So, more tests
may be needed to verify it will not incurs regressions.

>
> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>
> This is markes as RFC because I am uncertain whether we originally
> intended for this or if it was overlooked.
>
> However, the current situation where pgpromote_candidate < pgpromote_success
> is indeed confusing when interpreted literally.
>
> Cc: Huang Ying <ying.huang@linux.alibaba.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> ---
>  kernel/sched/fair.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a14da5396fb..4715cd4fa248 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  		struct pglist_data *pgdat;
>  		unsigned long rate_limit;
>  		unsigned int latency, th, def_th;
> +		long nr = folio_nr_pages(folio)
>  
>  		pgdat = NODE_DATA(dst_nid);
>  		if (pgdat_free_space_enough(pgdat)) {
>  			/* workload changed, reset hot threshold */
>  			pgdat->nbp_threshold = 0;
> +			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>  			return true;
>  		}
>  
> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  		if (latency >= th)
>  			return false;
>  
> -		return !numa_promotion_rate_limit(pgdat, rate_limit,
> -						  folio_nr_pages(folio));
> +		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>  	}
>  
>  	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
  2025-06-20  6:28 ` Huang, Ying
@ 2025-06-23  8:54   ` Zhijian Li (Fujitsu)
  2025-06-24  2:46     ` Huang, Ying
  0 siblings, 1 reply; 4+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-06-23  8:54 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, akpm, linux-kernel, Yasunori Gotou (Fujitsu),
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, kernel test robot



On 20/06/2025 14:28, Huang, Ying wrote:
> Li Zhijian <lizhijian@fujitsu.com> writes:
> 
>> Goto-san reported confusing pgpromote statistics where
>> the pgpromote_success count significantly exceeded pgpromote_candidate.
>> The issue manifests under specific memory pressure conditions:
>> when top-tier memory (DRAM) is exhausted by memhog and allocation begins
>> in lower-tier memory (CXL). After terminating memhog, the stats show:
> 
> The above description is confusing.  The page promotion occurs when the
> size of the top-tier free space is large enough (after killing the
> memhog above).  The accessed lower-tier memory will be promoted upon
> accessing to take full advantage of the more expensive top-tier memory.

Yeah, that's what the promotion does.

Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer):
On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):

# Enable demotion only
echo 1 > /sys/kernel/mm/numa/demotion_enabled
numactl -m 0-1 memhog -r200 3500M >/dev/null &
pid=$!
sleep 2
numactl memhog -r100 2500M >/dev/null &
sleep 10
kill -9 $pid
# Enable promotion
echo 2 > /proc/sys/kernel/numa_balancing

# After a few seconds, we observe `pgpromote_candidate < pgpromote_success`

In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, triggering promotion.
However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE.


> 
>> $ grep -e pgpromote /proc/vmstat
>> pgpromote_success 2579
>> pgpromote_candidate 1
>>
>> This update increments PGPROMOTE_CANDIDATE within the free space branch
>> when a promotion decision is made, which may alter the mechanism of the
>> rate limit. Consequently, it becomes easier to reach the rate limit than
>> it was previously.
>>
>> For example:
>> Rate Limit = 100 pages/sec
>> Scenario:
>>    T0: 90 free-space migrations
>>    T0+100ms: 20-page migration request
>>
>> Before:
>>    Rate limit is *not* reached: 0 + 20 = 20 < 100
>>    PGPROMOTE_CANDIDATE: 20
>> After:
>>    Rate limit is reached: 90 + 20 = 110 > 100
>>    PGPROMOTE_CANDIDATE: 110
> 
> Yes.  The rate limit will be influenced by the change.  So, more tests
> may be needed to verify it will not incurs regressions.


Testing this might be challenging due to workload dependencies. Do you have any recommended workloads for evaluation?
Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested
by LKP due to a compiling error, I will post a V2 soon).

However, regarding the rate limit change itself, I consider this patch logically correct. As stated in the numa_promotion_rate_limit() comment:
> "For memory tiering mode, too high promotion/demotion throughput may hurt application latency."
It seems there is no justification for excluding pgdat_free_space_enough() triggered promotions from the rate limiting mechanism.



> 
>>
>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>
>> This is markes as RFC because I am uncertain whether we originally
>> intended for this or if it was overlooked.
>>
>> However, the current situation where pgpromote_candidate < pgpromote_success
>> is indeed confusing when interpreted literally.
>>
>> Cc: Huang Ying <ying.huang@linux.alibaba.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ben Segall <bsegall@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Valentin Schneider <vschneid@redhat.com>
>> ---
>>   kernel/sched/fair.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7a14da5396fb..4715cd4fa248 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>   		struct pglist_data *pgdat;
>>   		unsigned long rate_limit;
>>   		unsigned int latency, th, def_th;
>> +		long nr = folio_nr_pages(folio)


Cc LKP

There is a compilation error which I overlooked at the time due to several ongoing refactors in
my local code. I appreciate LKP for detecting this issue.


Thanks
Zhijian


>>   
>>   		pgdat = NODE_DATA(dst_nid);
>>   		if (pgdat_free_space_enough(pgdat)) {
>>   			/* workload changed, reset hot threshold */
>>   			pgdat->nbp_threshold = 0;
>> +			mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr);
>>   			return true;
>>   		}
>>   
>> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>>   		if (latency >= th)
>>   			return false;
>>   
>> -		return !numa_promotion_rate_limit(pgdat, rate_limit,
>> -						  folio_nr_pages(folio));
>> +		return !numa_promotion_rate_limit(pgdat, rate_limit, nr);
>>   	}
>>   
>>   	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
> 
> ---
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting
  2025-06-23  8:54   ` Zhijian Li (Fujitsu)
@ 2025-06-24  2:46     ` Huang, Ying
  0 siblings, 0 replies; 4+ messages in thread
From: Huang, Ying @ 2025-06-24  2:46 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: linux-mm, akpm, linux-kernel, Yasunori Gotou (Fujitsu),
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, kernel test robot

"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes:

> On 20/06/2025 14:28, Huang, Ying wrote:
>> Li Zhijian <lizhijian@fujitsu.com> writes:
>> 
>>> Goto-san reported confusing pgpromote statistics where
>>> the pgpromote_success count significantly exceeded pgpromote_candidate.
>>> The issue manifests under specific memory pressure conditions:
>>> when top-tier memory (DRAM) is exhausted by memhog and allocation begins
>>> in lower-tier memory (CXL). After terminating memhog, the stats show:
>> 
>> The above description is confusing.  The page promotion occurs when the
>> size of the top-tier free space is large enough (after killing the
>> memhog above).  The accessed lower-tier memory will be promoted upon
>> accessing to take full advantage of the more expensive top-tier memory.
>
> Yeah, that's what the promotion does.
>
> Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer):
> On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB):
>
> # Enable demotion only
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> numactl -m 0-1 memhog -r200 3500M >/dev/null &
> pid=$!
> sleep 2
> numactl memhog -r100 2500M >/dev/null &
> sleep 10
> kill -9 $pid
> # Enable promotion
> echo 2 > /proc/sys/kernel/numa_balancing
>
> # After a few seconds, we observe `pgpromote_candidate < pgpromote_success`
>
> In this scenario, after terminating the first memhog, the conditions
> for pgdat_free_space_enough() are quickly met, triggering promotion.
> However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE.

Yes.  This is the expected behavior of current implementation.

>
>> 
>>> $ grep -e pgpromote /proc/vmstat
>>> pgpromote_success 2579
>>> pgpromote_candidate 1
>>>
>>> This update increments PGPROMOTE_CANDIDATE within the free space branch
>>> when a promotion decision is made, which may alter the mechanism of the
>>> rate limit. Consequently, it becomes easier to reach the rate limit than
>>> it was previously.
>>>
>>> For example:
>>> Rate Limit = 100 pages/sec
>>> Scenario:
>>>    T0: 90 free-space migrations
>>>    T0+100ms: 20-page migration request
>>>
>>> Before:
>>>    Rate limit is *not* reached: 0 + 20 = 20 < 100
>>>    PGPROMOTE_CANDIDATE: 20
>>> After:
>>>    Rate limit is reached: 90 + 20 = 110 > 100
>>>    PGPROMOTE_CANDIDATE: 110
>> 
>> Yes.  The rate limit will be influenced by the change.  So, more tests
>> may be needed to verify it will not incurs regressions.
>
>
> Testing this might be challenging due to workload dependencies. Do you
> have any recommended workloads for evaluation?

Some in-memory database should be good workloads, for example, redis, etc.

> Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested
> by LKP due to a compiling error, I will post a V2 soon).

LKP has some basic workload to test this, for example, pmbench with
Gauss-ih access pattern.

> However, regarding the rate limit change itself, I consider this patch
> logically correct. As stated in the numa_promotion_rate_limit()
> comment:
>> "For memory tiering mode, too high promotion/demotion throughput may hurt application latency."
> It seems there is no justification for excluding
> pgdat_free_space_enough() triggered promotions from the rate limiting
> mechanism.

In fact, we don't rate limit promotion if there are enough free space on
fast memory to fill the fast memory quickly.  I think that it's
necessary to prevent the fast memory from under-utilized ASAP.

>
>
>> 
>>>
>>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com>
>>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>

[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-06-24  2:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-06-19  7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian
2025-06-20  6:28 ` Huang, Ying
2025-06-23  8:54   ` Zhijian Li (Fujitsu)
2025-06-24  2:46     ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox