* [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full
@ 2024-10-18 7:57 mengensun88
2024-10-21 8:40 ` Huang, Ying
0 siblings, 1 reply; 4+ messages in thread
From: mengensun88 @ 2024-10-18 7:57 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, linux-kernel, alexjlzheng, MengEn Sun
From: MengEn Sun <mengensun@tencent.com>
When a cpu entring NOHZ full, quiet_vmstat may flush percpu
zonestats and nodestats.
The vmstat_shepherd only check percpu zonestats and nodestats
to determine whether it is necessary to fire vmstat_update on
the target cpu for now.
If a process on a certain CPU allocates a large amount of memory,
then frees that memory, and subsequently the CPU enters NOHZ, and
the process not freeing and allocating memory anymore,the
vmstat_update not being executed on the cpu. Because
vmstat_shepherd may not see zonestats and nodestats of the cpu
changed, so may resulting in vmstat_update on the cpu not fired
for a long time.
While, This seems to be fine:
- if freeing and allocating memory occur later, it may the
high_max may be adjust automatically
- If memory is tight, the memory reclamation process will
release the pcp
Whatever, we make vmstat_shepherd to checking whether we need
decay pcp high_max, and fire pcp_decay_high early if we need.
Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
Signed-off-by: MengEn Sun <mengensun@tencent.com>
---
changelog:
v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
v2: Make the commit message clearer by adding some comments.
---
mm/vmstat.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1917c034c045..07b494b06872 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
for_each_populated_zone(zone) {
struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+ struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
struct per_cpu_nodestat *n;
+ /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
+ * entering NOHZ full, see quiet_vmstat. so, we check pcp
+ * high_{min,max} to determine whether it is necessary to run
+ * decay_pcp_high on the corresponding CPU
+ */
+ if (pcp->high_max > pcp->high_min)
+ return true;
+
/*
* The fast way of checking if there are any vmstat diffs.
*/
--
2.43.5
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full
2024-10-18 7:57 [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full mengensun88
@ 2024-10-21 8:40 ` Huang, Ying
2024-10-22 5:14 ` MengEn Sun
0 siblings, 1 reply; 4+ messages in thread
From: Huang, Ying @ 2024-10-21 8:40 UTC (permalink / raw)
To: mengensun88; +Cc: akpm, linux-mm, linux-kernel, alexjlzheng, MengEn Sun
Hi, Mengen,
mengensun88@gmail.com writes:
> From: MengEn Sun <mengensun@tencent.com>
>
> When a cpu entring NOHZ full, quiet_vmstat may flush percpu
> zonestats and nodestats.
>
> The vmstat_shepherd only check percpu zonestats and nodestats
> to determine whether it is necessary to fire vmstat_update on
> the target cpu for now.
>
> If a process on a certain CPU allocates a large amount of memory,
> then frees that memory, and subsequently the CPU enters NOHZ, and
> the process not freeing and allocating memory anymore,the
> vmstat_update not being executed on the cpu. Because
> vmstat_shepherd may not see zonestats and nodestats of the cpu
> changed, so may resulting in vmstat_update on the cpu not fired
> for a long time.
The issue description is too long, can you make it a little shorter?
And can you correct your grammar with some tool? Something like chatgpt
is good for that, e.g., "fix the grammar of the following text: ...".
To make variable and function name distinct, I personally prefer to add
"()" after the function name.
Have verified the issue with some test? If not, I suggest you to do
that.
> While, This seems to be fine:
> - if freeing and allocating memory occur later, it may the
> high_max may be adjust automatically
> - If memory is tight, the memory reclamation process will
> release the pcp
This could be a real issue for me.
> Whatever, we make vmstat_shepherd to checking whether we need
> decay pcp high_max, and fire pcp_decay_high early if we need.
>
> Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
> Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
> Signed-off-by: MengEn Sun <mengensun@tencent.com>
> ---
> changelog:
> v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
> v2: Make the commit message clearer by adding some comments.
> ---
> mm/vmstat.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1917c034c045..07b494b06872 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
>
> for_each_populated_zone(zone) {
> struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
> + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> struct per_cpu_nodestat *n;
>
> + /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
> + * entering NOHZ full, see quiet_vmstat. so, we check pcp
> + * high_{min,max} to determine whether it is necessary to run
> + * decay_pcp_high on the corresponding CPU
> + */
Please follow the comments coding style.
/*
* comments line 1
* comments line 2
*/
> + if (pcp->high_max > pcp->high_min)
> + return true;
> +
We don't tune pcp->high_max/min in fact. Instead, we tune pcp->high.
Your code may make need_update() return true in most cases.
> /*
> * The fast way of checking if there are any vmstat diffs.
> */
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full
2024-10-21 8:40 ` Huang, Ying
@ 2024-10-22 5:14 ` MengEn Sun
2024-10-22 6:36 ` Huang, Ying
0 siblings, 1 reply; 4+ messages in thread
From: MengEn Sun @ 2024-10-22 5:14 UTC (permalink / raw)
To: ying.huang
Cc: akpm, alexjlzheng, linux-kernel, linux-mm, mengensun88, mengensun
Thank you for your suggestion. I understand and am ready to make
some changes
>
> Have verified the issue with some test? If not, I suggest you to do
> that.
>
I have conducted tests:
Applying this patch or not does not have a significant impact on the test results.
perhaps my testing was not thorough enough. #^_^
But, the logic of the code is like following:
CPU0 CPUx
---- -----
T0: vmstat_work is pending
T1: vmstat_shepherd
check vmstat_work
and do nothing
T2: vmstat_work is in unpending state.
T3: alloc many pages
T4: free all the pages allocated at T3
T5: entry NOHZ, flushing all zonestats
and nodestats
T6: next vmstat_shepherd fired
In my opinion, there are indeed some issues. I'm not sure if there's
something I haven't understood?
By the way,
There are two other questions for me:
Q1:
Vmstat_work is a **deferreable work** So, It may be delayed for a long time
by NOHZ. As a result, "vmstat_update() may not be executed once every
second in the above scenario. Therefore, I'm not sure if using a deferrable
work to reduce pcp->high is appropriate. In my tests, if I don't use
deferrable work, it takes about a minute to reduce high to high_min, but
using deferrable work may take several minutes to reduce high to high_min.
Q2:
On a big machine, for example, with 1TB of memory, the default maximum
memory on PCP can be 1TB * 0.125.
This portion of memory is not accounted for in MemFree in /proc/meminfo.
Users can see this portion of memory from /proc/zoneinfo, but the memory
reported by the `free` command is reduced.
can we include the PCP memory in the MemFree statistic in /proc/meminfo?
> > While, This seems to be fine:
> > - if freeing and allocating memory occur later, it may the
> > high_max may be adjust automatically
> > - If memory is tight, the memory reclamation process will
> > release the pcp
>
> This could be a real issue for me.
Thanks, I will test more carefully for those issue
>
> > Whatever, we make vmstat_shepherd to checking whether we need
> > decay pcp high_max, and fire pcp_decay_high early if we need.
> >
> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
> > Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
> > Signed-off-by: MengEn Sun <mengensun@tencent.com>
> > ---
> > changelog:
> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
> > v2: Make the commit message clearer by adding some comments.
> > ---
> > mm/vmstat.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 1917c034c045..07b494b06872 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
> >
> > for_each_populated_zone(zone) {
> > struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
> > + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> > struct per_cpu_nodestat *n;
> >
> > + /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
> > + * entering NOHZ full, see quiet_vmstat. so, we check pcp
> > + * high_{min,max} to determine whether it is necessary to run
> > + * decay_pcp_high on the corresponding CPU
> > + */
>
> Please follow the comments coding style.
>
> /*
> * comments line 1
> * comments line 2
> */
>
Thank you for your suggestion. I understand and am ready to make
some changes
> > + if (pcp->high_max > pcp->high_min)
> > + return true;
> > +
>
> We don't tune pcp->high_max/min in fact. Instead, we tune pcp->high.
> Your code may make need_update() return true in most cases.
You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min?
>
> > /*
> > * The fast way of checking if there are any vmstat diffs.
> > */
>
> --
> Best Regards,
> Huang, Ying
Best Regards,
MengEn, Sun
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full
2024-10-22 5:14 ` MengEn Sun
@ 2024-10-22 6:36 ` Huang, Ying
0 siblings, 0 replies; 4+ messages in thread
From: Huang, Ying @ 2024-10-22 6:36 UTC (permalink / raw)
To: MengEn Sun; +Cc: akpm, alexjlzheng, linux-kernel, linux-mm, mengensun
MengEn Sun <mengensun88@gmail.com> writes:
> Thank you for your suggestion. I understand and am ready to make
> some changes
>
>>
>> Have verified the issue with some test? If not, I suggest you to do
>> that.
>>
>
> I have conducted tests:
> Applying this patch or not does not have a significant impact on the
> test results.
I don't expect some measurable performance difference with the patch.
If we can observe that the PCP size isn't tuned down to high_min before
and is after, that should be a valid test result to show the value of
the patch. Can you try that? The PCP size can be observed via
/proc/zoneinfo.
> perhaps my testing was not thorough enough. #^_^
>
> But, the logic of the code is like following:
> CPU0 CPUx
> ---- -----
> T0: vmstat_work is pending
> T1: vmstat_shepherd
> check vmstat_work
> and do nothing
> T2: vmstat_work is in unpending state.
>
> T3: alloc many pages
> T4: free all the pages allocated at T3
> T5: entry NOHZ, flushing all zonestats
> and nodestats
> T6: next vmstat_shepherd fired
>
> In my opinion, there are indeed some issues. I'm not sure if there's
> something I haven't understood?
>
>
> By the way,
> There are two other questions for me:
> Q1:
> Vmstat_work is a **deferreable work** So, It may be delayed for a long time
> by NOHZ. As a result, "vmstat_update() may not be executed once every
> second in the above scenario. Therefore, I'm not sure if using a deferrable
> work to reduce pcp->high is appropriate. In my tests, if I don't use
> deferrable work, it takes about a minute to reduce high to high_min, but
> using deferrable work may take several minutes to reduce high to high_min.
It's not a big issue to take longer time to decay pcp->high.
> Q2:
> On a big machine, for example, with 1TB of memory, the default maximum
> memory on PCP can be 1TB * 0.125.
> This portion of memory is not accounted for in MemFree in /proc/meminfo.
> Users can see this portion of memory from /proc/zoneinfo, but the memory
> reported by the `free` command is reduced.
> can we include the PCP memory in the MemFree statistic in /proc/meminfo?
This has been discussed before.
https://lore.kernel.org/linux-mm/20220816084426.135528-1-wangkefeng.wang@huawei.com/
https://lore.kernel.org/linux-mm/20240830014453.3070909-1-mawupeng1@huawei.com/
>> > While, This seems to be fine:
>> > - if freeing and allocating memory occur later, it may the
>> > high_max may be adjust automatically
>> > - If memory is tight, the memory reclamation process will
>> > release the pcp
>>
>> This could be a real issue for me.
>
> Thanks, I will test more carefully for those issue
>
>>
>> > Whatever, we make vmstat_shepherd to checking whether we need
>> > decay pcp high_max, and fire pcp_decay_high early if we need.
>> >
>> > Fixes: 51a755c56dc0 ("mm: tune PCP high automatically")
>> > Reviewed-by: Jinliang Zheng <alexjlzheng@tencent.com>
>> > Signed-off-by: MengEn Sun <mengensun@tencent.com>
>> > ---
>> > changelog:
>> > v1: https://lore.kernel.org/lkml/20241012154328.015f57635566485ad60712f3@linux-foundation.org/T/#t
>> > v2: Make the commit message clearer by adding some comments.
>> > ---
>> > mm/vmstat.c | 9 +++++++++
>> > 1 file changed, 9 insertions(+)
>> >
>> > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > index 1917c034c045..07b494b06872 100644
>> > --- a/mm/vmstat.c
>> > +++ b/mm/vmstat.c
>> > @@ -2024,8 +2024,17 @@ static bool need_update(int cpu)
>> >
>> > for_each_populated_zone(zone) {
>> > struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
>> > + struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
>> > struct per_cpu_nodestat *n;
>> >
>> > + /* per_cpu_nodestats and per_cpu_zonestats maybe flush when cpu
>> > + * entering NOHZ full, see quiet_vmstat. so, we check pcp
>> > + * high_{min,max} to determine whether it is necessary to run
>> > + * decay_pcp_high on the corresponding CPU
>> > + */
>>
>> Please follow the comments coding style.
>>
>> /*
>> * comments line 1
>> * comments line 2
>> */
>>
>
> Thank you for your suggestion. I understand and am ready to make
> some changes
>
>> > + if (pcp->high_max > pcp->high_min)
>> > + return true;
>> > +
>>
>> We don't tune pcp->high_max/min in fact. Instead, we tune pcp->high.
>> Your code may make need_update() return true in most cases.
>
> You are right, using high_max is incorrect. May i use pcp->high > pcp->high_min?
>
>>
>> > /*
>> > * The fast way of checking if there are any vmstat diffs.
>> > */
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-10-22 6:40 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-18 7:57 [PATCH linux-mm v2] mm: make pcp_decay_high working better with NOHZ full mengensun88
2024-10-21 8:40 ` Huang, Ying
2024-10-22 5:14 ` MengEn Sun
2024-10-22 6:36 ` Huang, Ying
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox