* [Question] About the PCP free_high heuristic
@ 2025-07-29 8:08 史嘉成
2025-07-29 9:59 ` Huang, Ying
0 siblings, 1 reply; 5+ messages in thread
From: 史嘉成 @ 2025-07-29 8:08 UTC (permalink / raw)
To: ying.huang; +Cc: linux-mm
Hi,
I ran the bw_unix benchmark in lmbench on my test machine (EPYC-7T83, 32 vCPUs,
64 GB of memory):
bin/x86_64-linux-gnu/bw_unix -P 16
The bandwidth result was 30511.63 MB/s when percpu_pagelist_high_fraction was
set to 8; however, the result drops to 21595.98 MB/s when
percpu_pagelist_high_fraction is set to 0 (enabling PCP high auto-tuning).
I first inspected the auto-tuning code, but the root cause of the performance
degradation lies in the triggering threshold of the free_high heuristic:
pcp->free_count >= (batch + pcp->high_min / 2)
I noticed that commit c544a95 increases this threshold, but pcp->high_min is
relatively small when auto-tuning is enabled, and the PCP draining leads to
the performance degradation.
The problem was fixed when increasing the threshold to (batch + pcp->high / 2).
Is it intended to use high_min instead of high in the threshold? Would it be
more adaptive to introduce some new tunables for the free_high threshold?
Best,
Shi, Jiacheng
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Question] About the PCP free_high heuristic
2025-07-29 8:08 [Question] About the PCP free_high heuristic 史嘉成
@ 2025-07-29 9:59 ` Huang, Ying
2025-07-29 11:29 ` Shi, Jiacheng
0 siblings, 1 reply; 5+ messages in thread
From: Huang, Ying @ 2025-07-29 9:59 UTC (permalink / raw)
To: 史嘉成; +Cc: linux-mm
Hi, Jiacheng,
史嘉成 <billsjc@sjtu.edu.cn> writes:
> Hi,
>
> I ran the bw_unix benchmark in lmbench on my test machine (EPYC-7T83, 32 vCPUs,
> 64 GB of memory):
> bin/x86_64-linux-gnu/bw_unix -P 16
> The bandwidth result was 30511.63 MB/s when percpu_pagelist_high_fraction was
> set to 8; however, the result drops to 21595.98 MB/s when
> percpu_pagelist_high_fraction is set to 0 (enabling PCP high auto-tuning).
>
> I first inspected the auto-tuning code, but the root cause of the performance
> degradation lies in the triggering threshold of the free_high heuristic:
> pcp->free_count >= (batch + pcp->high_min / 2)
free_high heuristic is used to increase last level (shared) cache
hotness via letting one core allocate cache-hot pages just freed by
another core. The target use case is network workload.
It appears that free_high heuristic hurts your performance. One
possible reason may be that the last level cache isn't always shared on
AMD CPU. Can you try to bind workload to one CCX and verify whether
this is the root cause?
> I noticed that commit c544a95 increases this threshold, but pcp->high_min is
> relatively small when auto-tuning is enabled, and the PCP draining leads to
> the performance degradation.
>
> The problem was fixed when increasing the threshold to (batch + pcp->high / 2).
> Is it intended to use high_min instead of high in the threshold? Would it be
> more adaptive to introduce some new tunables for the free_high threshold?
In general, new knob isn't welcomed in community, because it's hard for
users to tune so many knobs already.
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Question] About the PCP free_high heuristic
2025-07-29 9:59 ` Huang, Ying
@ 2025-07-29 11:29 ` Shi, Jiacheng
2025-07-30 1:26 ` Huang, Ying
0 siblings, 1 reply; 5+ messages in thread
From: Shi, Jiacheng @ 2025-07-29 11:29 UTC (permalink / raw)
To: Huang, Ying; +Cc: linux-mm
[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]
Hi, Huang, Ying,
You are right. Using high_min is better when the workload is located on a single CCX.
By the way, I'm wondering why the free_high heuristic is only applied to
high-order pages. Would there also be cache misses if cache-hot order-0 pages
are not reused?
Best,
Jiacheng
> Huang, Ying <ying.huang@linux.alibaba.com> writes:
>
> Hi, Jiacheng,
>
> 史嘉成 <billsjc@sjtu.edu.cn <mailto:billsjc@sjtu.edu.cn>> writes:
>
>> Hi,
>>
>> I ran the bw_unix benchmark in lmbench on my test machine (EPYC-7T83, 32 vCPUs,
>> 64 GB of memory):
>> bin/x86_64-linux-gnu/bw_unix -P 16
>> The bandwidth result was 30511.63 MB/s when percpu_pagelist_high_fraction was
>> set to 8; however, the result drops to 21595.98 MB/s when
>> percpu_pagelist_high_fraction is set to 0 (enabling PCP high auto-tuning).
>>
>> I first inspected the auto-tuning code, but the root cause of the performance
>> degradation lies in the triggering threshold of the free_high heuristic:
>> pcp->free_count >= (batch + pcp->high_min / 2)
>
> free_high heuristic is used to increase last level (shared) cache
> hotness via letting one core allocate cache-hot pages just freed by
> another core. The target use case is network workload.
>
> It appears that free_high heuristic hurts your performance. One
> possible reason may be that the last level cache isn't always shared on
> AMD CPU. Can you try to bind workload to one CCX and verify whether
> this is the root cause?
>
>> I noticed that commit c544a95 increases this threshold, but pcp->high_min is
>> relatively small when auto-tuning is enabled, and the PCP draining leads to
>> the performance degradation.
>>
>> The problem was fixed when increasing the threshold to (batch + pcp->high / 2).
>> Is it intended to use high_min instead of high in the threshold? Would it be
>> more adaptive to introduce some new tunables for the free_high threshold?
>
> In general, new knob isn't welcomed in community, because it's hard for
> users to tune so many knobs already.
>
> ---
> Best Regards,
> Huang, Ying
[-- Attachment #2: Type: text/html, Size: 15609 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Question] About the PCP free_high heuristic
2025-07-29 11:29 ` Shi, Jiacheng
@ 2025-07-30 1:26 ` Huang, Ying
2025-07-30 1:33 ` 史嘉成
0 siblings, 1 reply; 5+ messages in thread
From: Huang, Ying @ 2025-07-30 1:26 UTC (permalink / raw)
To: Shi, Jiacheng; +Cc: linux-mm
"Shi, Jiacheng" <billsjc@sjtu.edu.cn> writes:
> Hi, Huang, Ying,
>
> You are right. Using high_min is better when the workload is located on a single CCX.
>
> By the way, I'm wondering why the free_high heuristic is only applied to
> high-order pages. Would there also be cache misses if cache-hot order-0 pages
> are not reused?
The heuristic is mainly for network workload, which uses high-order
pages. You can use `git blame` to try to find the commit which
introduce the heuristic. But it's not a trivial work.
---
Best Regards,
Huang, Ying
> Huang, Ying <ying.huang@linux.alibaba.com> writes:
>
> Hi, Jiacheng,
>
> 史嘉成 <billsjc@sjtu.edu.cn> writes:
>
> Hi,
>
> I ran the bw_unix benchmark in lmbench on my test machine (EPYC-7T83, 32 vCPUs,
> 64 GB of memory):
> bin/x86_64-linux-gnu/bw_unix -P 16
> The bandwidth result was 30511.63 MB/s when percpu_pagelist_high_fraction was
> set to 8; however, the result drops to 21595.98 MB/s when
> percpu_pagelist_high_fraction is set to 0 (enabling PCP high auto-tuning).
>
> I first inspected the auto-tuning code, but the root cause of the performance
> degradation lies in the triggering threshold of the free_high heuristic:
> pcp->free_count >= (batch + pcp->high_min / 2)
>
> free_high heuristic is used to increase last level (shared) cache
> hotness via letting one core allocate cache-hot pages just freed by
> another core. The target use case is network workload.
>
> It appears that free_high heuristic hurts your performance. One
> possible reason may be that the last level cache isn't always shared on
> AMD CPU. Can you try to bind workload to one CCX and verify whether
> this is the root cause?
>
> I noticed that commit c544a95 increases this threshold, but pcp->high_min is
> relatively small when auto-tuning is enabled, and the PCP draining leads to
> the performance degradation.
>
> The problem was fixed when increasing the threshold to (batch + pcp->high / 2).
> Is it intended to use high_min instead of high in the threshold? Would it be
> more adaptive to introduce some new tunables for the free_high threshold?
>
> In general, new knob isn't welcomed in community, because it's hard for
> users to tune so many knobs already.
>
> ---
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Question] About the PCP free_high heuristic
2025-07-30 1:26 ` Huang, Ying
@ 2025-07-30 1:33 ` 史嘉成
0 siblings, 0 replies; 5+ messages in thread
From: 史嘉成 @ 2025-07-30 1:33 UTC (permalink / raw)
To: Huang, Ying; +Cc: linux-mm
Got it. It's commit f26b3fa. I have read it.
Thanks a lot for your reply!
Best Regards,
Jiacheng
----- 原始邮件 -----
发件人: "Huang, Ying" <ying.huang@linux.alibaba.com>
收件人: "Shi, Jiacheng" <billsjc@sjtu.edu.cn>
抄送: linux-mm@kvack.org
发送时间: 星期三, 2025年 7 月 30日 上午 9:26:32
主题: Re: [Question] About the PCP free_high heuristic
"Shi, Jiacheng" <billsjc@sjtu.edu.cn> writes:
> Hi, Huang, Ying,
>
> You are right. Using high_min is better when the workload is located on a single CCX.
>
> By the way, I'm wondering why the free_high heuristic is only applied to
> high-order pages. Would there also be cache misses if cache-hot order-0 pages
> are not reused?
The heuristic is mainly for network workload, which uses high-order
pages. You can use `git blame` to try to find the commit which
introduce the heuristic. But it's not a trivial work.
---
Best Regards,
Huang, Ying
> Huang, Ying <ying.huang@linux.alibaba.com> writes:
>
> Hi, Jiacheng,
>
> 史嘉成 <billsjc@sjtu.edu.cn> writes:
>
> Hi,
>
> I ran the bw_unix benchmark in lmbench on my test machine (EPYC-7T83, 32 vCPUs,
> 64 GB of memory):
> bin/x86_64-linux-gnu/bw_unix -P 16
> The bandwidth result was 30511.63 MB/s when percpu_pagelist_high_fraction was
> set to 8; however, the result drops to 21595.98 MB/s when
> percpu_pagelist_high_fraction is set to 0 (enabling PCP high auto-tuning).
>
> I first inspected the auto-tuning code, but the root cause of the performance
> degradation lies in the triggering threshold of the free_high heuristic:
> pcp->free_count >= (batch + pcp->high_min / 2)
>
> free_high heuristic is used to increase last level (shared) cache
> hotness via letting one core allocate cache-hot pages just freed by
> another core. The target use case is network workload.
>
> It appears that free_high heuristic hurts your performance. One
> possible reason may be that the last level cache isn't always shared on
> AMD CPU. Can you try to bind workload to one CCX and verify whether
> this is the root cause?
>
> I noticed that commit c544a95 increases this threshold, but pcp->high_min is
> relatively small when auto-tuning is enabled, and the PCP draining leads to
> the performance degradation.
>
> The problem was fixed when increasing the threshold to (batch + pcp->high / 2).
> Is it intended to use high_min instead of high in the threshold? Would it be
> more adaptive to introduce some new tunables for the free_high threshold?
>
> In general, new knob isn't welcomed in community, because it's hard for
> users to tune so many knobs already.
>
> ---
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-30 1:33 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-29 8:08 [Question] About the PCP free_high heuristic 史嘉成
2025-07-29 9:59 ` Huang, Ying
2025-07-29 11:29 ` Shi, Jiacheng
2025-07-30 1:26 ` Huang, Ying
2025-07-30 1:33 ` 史嘉成
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox