* [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting @ 2025-06-19 7:52 Li Zhijian 2025-06-20 6:28 ` Huang, Ying 0 siblings, 1 reply; 4+ messages in thread From: Li Zhijian @ 2025-06-19 7:52 UTC (permalink / raw) To: linux-mm Cc: akpm, linux-kernel, y-goto, Li Zhijian, Huang Ying, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Goto-san reported confusing pgpromote statistics where the pgpromote_success count significantly exceeded pgpromote_candidate. The issue manifests under specific memory pressure conditions: when top-tier memory (DRAM) is exhausted by memhog and allocation begins in lower-tier memory (CXL). After terminating memhog, the stats show: $ grep -e pgpromote /proc/vmstat pgpromote_success 2579 pgpromote_candidate 1 This update increments PGPROMOTE_CANDIDATE within the free space branch when a promotion decision is made, which may alter the mechanism of the rate limit. Consequently, it becomes easier to reach the rate limit than it was previously. For example: Rate Limit = 100 pages/sec Scenario: T0: 90 free-space migrations T0+100ms: 20-page migration request Before: Rate limit is *not* reached: 0 + 20 = 20 < 100 PGPROMOTE_CANDIDATE: 20 After: Rate limit is reached: 90 + 20 = 110 > 100 PGPROMOTE_CANDIDATE: 110 Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> --- This is markes as RFC because I am uncertain whether we originally intended for this or if it was overlooked. However, the current situation where pgpromote_candidate < pgpromote_success is indeed confusing when interpreted literally. Cc: Huang Ying <ying.huang@linux.alibaba.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a14da5396fb..4715cd4fa248 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, struct pglist_data *pgdat; unsigned long rate_limit; unsigned int latency, th, def_th; + long nr = folio_nr_pages(folio) pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) { /* workload changed, reset hot threshold */ pgdat->nbp_threshold = 0; + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); return true; } @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, if (latency >= th) return false; - return !numa_promotion_rate_limit(pgdat, rate_limit, - folio_nr_pages(folio)); + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); } this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); -- 2.43.5 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian @ 2025-06-20 6:28 ` Huang, Ying 2025-06-23 8:54 ` Zhijian Li (Fujitsu) 0 siblings, 1 reply; 4+ messages in thread From: Huang, Ying @ 2025-06-20 6:28 UTC (permalink / raw) To: Li Zhijian Cc: linux-mm, akpm, linux-kernel, y-goto, Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider Li Zhijian <lizhijian@fujitsu.com> writes: > Goto-san reported confusing pgpromote statistics where > the pgpromote_success count significantly exceeded pgpromote_candidate. > The issue manifests under specific memory pressure conditions: > when top-tier memory (DRAM) is exhausted by memhog and allocation begins > in lower-tier memory (CXL). After terminating memhog, the stats show: The above description is confusing. The page promotion occurs when the size of the top-tier free space is large enough (after killing the memhog above). The accessed lower-tier memory will be promoted upon accessing to take full advantage of the more expensive top-tier memory. > $ grep -e pgpromote /proc/vmstat > pgpromote_success 2579 > pgpromote_candidate 1 > > This update increments PGPROMOTE_CANDIDATE within the free space branch > when a promotion decision is made, which may alter the mechanism of the > rate limit. Consequently, it becomes easier to reach the rate limit than > it was previously. > > For example: > Rate Limit = 100 pages/sec > Scenario: > T0: 90 free-space migrations > T0+100ms: 20-page migration request > > Before: > Rate limit is *not* reached: 0 + 20 = 20 < 100 > PGPROMOTE_CANDIDATE: 20 > After: > Rate limit is reached: 90 + 20 = 110 > 100 > PGPROMOTE_CANDIDATE: 110 Yes. The rate limit will be influenced by the change. So, more tests may be needed to verify it will not incurs regressions. > > Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> > Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> > --- > > This is markes as RFC because I am uncertain whether we originally > intended for this or if it was overlooked. > > However, the current situation where pgpromote_candidate < pgpromote_success > is indeed confusing when interpreted literally. > > Cc: Huang Ying <ying.huang@linux.alibaba.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Ben Segall <bsegall@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Valentin Schneider <vschneid@redhat.com> > --- > kernel/sched/fair.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7a14da5396fb..4715cd4fa248 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > struct pglist_data *pgdat; > unsigned long rate_limit; > unsigned int latency, th, def_th; > + long nr = folio_nr_pages(folio) > > pgdat = NODE_DATA(dst_nid); > if (pgdat_free_space_enough(pgdat)) { > /* workload changed, reset hot threshold */ > pgdat->nbp_threshold = 0; > + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); > return true; > } > > @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > if (latency >= th) > return false; > > - return !numa_promotion_rate_limit(pgdat, rate_limit, > - folio_nr_pages(folio)); > + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); > } > > this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-20 6:28 ` Huang, Ying @ 2025-06-23 8:54 ` Zhijian Li (Fujitsu) 2025-06-24 2:46 ` Huang, Ying 0 siblings, 1 reply; 4+ messages in thread From: Zhijian Li (Fujitsu) @ 2025-06-23 8:54 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, akpm, linux-kernel, Yasunori Gotou (Fujitsu), Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, kernel test robot On 20/06/2025 14:28, Huang, Ying wrote: > Li Zhijian <lizhijian@fujitsu.com> writes: > >> Goto-san reported confusing pgpromote statistics where >> the pgpromote_success count significantly exceeded pgpromote_candidate. >> The issue manifests under specific memory pressure conditions: >> when top-tier memory (DRAM) is exhausted by memhog and allocation begins >> in lower-tier memory (CXL). After terminating memhog, the stats show: > > The above description is confusing. The page promotion occurs when the > size of the top-tier free space is large enough (after killing the > memhog above). The accessed lower-tier memory will be promoted upon > accessing to take full advantage of the more expensive top-tier memory. Yeah, that's what the promotion does. Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer): On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): # Enable demotion only echo 1 > /sys/kernel/mm/numa/demotion_enabled numactl -m 0-1 memhog -r200 3500M >/dev/null & pid=$! sleep 2 numactl memhog -r100 2500M >/dev/null & sleep 10 kill -9 $pid # Enable promotion echo 2 > /proc/sys/kernel/numa_balancing # After a few seconds, we observe `pgpromote_candidate < pgpromote_success` In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, triggering promotion. However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. > >> $ grep -e pgpromote /proc/vmstat >> pgpromote_success 2579 >> pgpromote_candidate 1 >> >> This update increments PGPROMOTE_CANDIDATE within the free space branch >> when a promotion decision is made, which may alter the mechanism of the >> rate limit. Consequently, it becomes easier to reach the rate limit than >> it was previously. >> >> For example: >> Rate Limit = 100 pages/sec >> Scenario: >> T0: 90 free-space migrations >> T0+100ms: 20-page migration request >> >> Before: >> Rate limit is *not* reached: 0 + 20 = 20 < 100 >> PGPROMOTE_CANDIDATE: 20 >> After: >> Rate limit is reached: 90 + 20 = 110 > 100 >> PGPROMOTE_CANDIDATE: 110 > > Yes. The rate limit will be influenced by the change. So, more tests > may be needed to verify it will not incurs regressions. Testing this might be challenging due to workload dependencies. Do you have any recommended workloads for evaluation? Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested by LKP due to a compiling error, I will post a V2 soon). However, regarding the rate limit change itself, I consider this patch logically correct. As stated in the numa_promotion_rate_limit() comment: > "For memory tiering mode, too high promotion/demotion throughput may hurt application latency." It seems there is no justification for excluding pgdat_free_space_enough() triggered promotions from the rate limiting mechanism. > >> >> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> >> --- >> >> This is markes as RFC because I am uncertain whether we originally >> intended for this or if it was overlooked. >> >> However, the current situation where pgpromote_candidate < pgpromote_success >> is indeed confusing when interpreted literally. >> >> Cc: Huang Ying <ying.huang@linux.alibaba.com> >> Cc: Ingo Molnar <mingo@redhat.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Juri Lelli <juri.lelli@redhat.com> >> Cc: Vincent Guittot <vincent.guittot@linaro.org> >> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> >> Cc: Steven Rostedt <rostedt@goodmis.org> >> Cc: Ben Segall <bsegall@google.com> >> Cc: Mel Gorman <mgorman@suse.de> >> Cc: Valentin Schneider <vschneid@redhat.com> >> --- >> kernel/sched/fair.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 7a14da5396fb..4715cd4fa248 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >> struct pglist_data *pgdat; >> unsigned long rate_limit; >> unsigned int latency, th, def_th; >> + long nr = folio_nr_pages(folio) Cc LKP There is a compilation error which I overlooked at the time due to several ongoing refactors in my local code. I appreciate LKP for detecting this issue. Thanks Zhijian >> >> pgdat = NODE_DATA(dst_nid); >> if (pgdat_free_space_enough(pgdat)) { >> /* workload changed, reset hot threshold */ >> pgdat->nbp_threshold = 0; >> + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); >> return true; >> } >> >> @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, >> if (latency >= th) >> return false; >> >> - return !numa_promotion_rate_limit(pgdat, rate_limit, >> - folio_nr_pages(folio)); >> + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); >> } >> >> this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); > > --- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting 2025-06-23 8:54 ` Zhijian Li (Fujitsu) @ 2025-06-24 2:46 ` Huang, Ying 0 siblings, 0 replies; 4+ messages in thread From: Huang, Ying @ 2025-06-24 2:46 UTC (permalink / raw) To: Zhijian Li (Fujitsu) Cc: linux-mm, akpm, linux-kernel, Yasunori Gotou (Fujitsu), Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, kernel test robot "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> writes: > On 20/06/2025 14:28, Huang, Ying wrote: >> Li Zhijian <lizhijian@fujitsu.com> writes: >> >>> Goto-san reported confusing pgpromote statistics where >>> the pgpromote_success count significantly exceeded pgpromote_candidate. >>> The issue manifests under specific memory pressure conditions: >>> when top-tier memory (DRAM) is exhausted by memhog and allocation begins >>> in lower-tier memory (CXL). After terminating memhog, the stats show: >> >> The above description is confusing. The page promotion occurs when the >> size of the top-tier free space is large enough (after killing the >> memhog above). The accessed lower-tier memory will be promoted upon >> accessing to take full advantage of the more expensive top-tier memory. > > Yeah, that's what the promotion does. > > Let's clarify the reproducer steps specifically(thanks Goto-san for the reproducer): > On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): > > # Enable demotion only > echo 1 > /sys/kernel/mm/numa/demotion_enabled > numactl -m 0-1 memhog -r200 3500M >/dev/null & > pid=$! > sleep 2 > numactl memhog -r100 2500M >/dev/null & > sleep 10 > kill -9 $pid > # Enable promotion > echo 2 > /proc/sys/kernel/numa_balancing > > # After a few seconds, we observe `pgpromote_candidate < pgpromote_success` > > In this scenario, after terminating the first memhog, the conditions > for pgdat_free_space_enough() are quickly met, triggering promotion. > However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. Yes. This is the expected behavior of current implementation. > >> >>> $ grep -e pgpromote /proc/vmstat >>> pgpromote_success 2579 >>> pgpromote_candidate 1 >>> >>> This update increments PGPROMOTE_CANDIDATE within the free space branch >>> when a promotion decision is made, which may alter the mechanism of the >>> rate limit. Consequently, it becomes easier to reach the rate limit than >>> it was previously. >>> >>> For example: >>> Rate Limit = 100 pages/sec >>> Scenario: >>> T0: 90 free-space migrations >>> T0+100ms: 20-page migration request >>> >>> Before: >>> Rate limit is *not* reached: 0 + 20 = 20 < 100 >>> PGPROMOTE_CANDIDATE: 20 >>> After: >>> Rate limit is reached: 90 + 20 = 110 > 100 >>> PGPROMOTE_CANDIDATE: 110 >> >> Yes. The rate limit will be influenced by the change. So, more tests >> may be needed to verify it will not incurs regressions. > > > Testing this might be challenging due to workload dependencies. Do you > have any recommended workloads for evaluation? Some in-memory database should be good workloads, for example, redis, etc. > Alternatively, could we could rely on the LKP project for impact assessment(Current patch has not really tested > by LKP due to a compiling error, I will post a V2 soon). LKP has some basic workload to test this, for example, pmbench with Gauss-ih access pattern. > However, regarding the rate limit change itself, I consider this patch > logically correct. As stated in the numa_promotion_rate_limit() > comment: >> "For memory tiering mode, too high promotion/demotion throughput may hurt application latency." > It seems there is no justification for excluding > pgdat_free_space_enough() triggered promotions from the rate limiting > mechanism. In fact, we don't rate limit promotion if there are enough free space on fast memory to fill the fast memory quickly. I think that it's necessary to prevent the fast memory from under-utilized ASAP. > > >> >>> >>> Reported-by: Yasunori Gotou (Fujitsu) <y-goto@fujitsu.com> >>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> [snip] --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-24 2:47 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-06-19 7:52 [PATCH RFC] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Li Zhijian 2025-06-20 6:28 ` Huang, Ying 2025-06-23 8:54 ` Zhijian Li (Fujitsu) 2025-06-24 2:46 ` Huang, Ying
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox