[FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation
@ 2025-01-07  9:17 Nikhil Dhama
  2025-01-08  5:05 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Nikhil Dhama @ 2025-01-07  9:17 UTC (permalink / raw)
  To: akpm
  Cc: Nikhil Dhama, Ying Huang, linux-mm, linux-kernel, Bharata B Rao,
	Raghavendra

In current PCP auto-tuning desgin, free_count was introduced to track
the consecutive page freeing with a counter, This counter is incremented
by the exact amount of pages that are freed, but reduced by half on
allocation. This is causing a 2-node iperf3 client to server's network
bandwidth to drop by 30% if we scale number of client-server pairs from 32
(where we achieved peak network bandwidth) to 64.

To fix this issue, on allocation, reduce free_count by the exact number
of pages that are allocated instead of halving it.

On a 2-node AMD server, one running iperf3 clients and other iperf3
sever, This patch restores the performance drop.

Fixes: 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing")

Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ying Huang <huang.ying.caritas@gmail.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Bharata B Rao <bharata@amd.com>
Cc: Raghavendra <raghavendra.kodsarathimmappa@amd.com>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cae7b93864c2..e2a8ec5584f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3037,10 +3037,10 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 
 	/*
 	 * On allocation, reduce the number of pages that are batch freed.
-	 * See nr_pcp_free() where free_factor is increased for subsequent
+	 * See free_unref_page_commit() where free_count is increased for subsequent
 	 * frees.
 	 */
-	pcp->free_count >>= 1;
+	pcp->free_count -= (1 << order);
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation
  2025-01-07  9:17 [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation Nikhil Dhama
@ 2025-01-08  5:05 ` Andrew Morton
  2025-01-09 11:42   ` Nikhil Dhama
  2025-01-15 11:19   ` [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation, Huang, Ying
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2025-01-08  5:05 UTC (permalink / raw)
  To: Nikhil Dhama
  Cc: Ying Huang, linux-mm, linux-kernel, Bharata B Rao, Raghavendra

On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@amd.com> wrote:

> In current PCP auto-tuning desgin, free_count was introduced to track
> the consecutive page freeing with a counter, This counter is incremented
> by the exact amount of pages that are freed, but reduced by half on
> allocation. This is causing a 2-node iperf3 client to server's network
> bandwidth to drop by 30% if we scale number of client-server pairs from 32
> (where we achieved peak network bandwidth) to 64.
> 
> To fix this issue, on allocation, reduce free_count by the exact number
> of pages that are allocated instead of halving it.

The present division by two appears to be somewhat randomly chosen. 
And as far as I can tell, this patch proposes replacing that with
another somewhat random adjustment.

What's the actual design here?  What are we attempting to do and why,
and why is the proposed design superior to the present one?

> On a 2-node AMD server, one running iperf3 clients and other iperf3
> sever, This patch restores the performance drop.

Nice, but might other workloads on other machines get slower?




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation
  2025-01-08  5:05 ` Andrew Morton
@ 2025-01-09 11:42   ` Nikhil Dhama
  2025-01-15 11:06     ` Huang, Ying
  2025-01-15 11:19   ` [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation, Huang, Ying
  1 sibling, 1 reply; 12+ messages in thread
From: Nikhil Dhama @ 2025-01-09 11:42 UTC (permalink / raw)
  To: akpm
  Cc: raghavendra.kodsarathimmappa, bharata, huang.ying.caritas,
	linux-mm, linux-kernel, subramaniam.kv, santosh.shukla, shivankg,
	Nikhil Dhama

> The present division by two appears to be somewhat randomly chosen.
> And as far as I can tell, this patch proposes replacing that with
> another somewhat random adjustment.
>
> What's the actual design here?  What are we attempting to do and why,
> and why is the proposed design superior to the present one?

We are further analyzing both the designs and their impact on pcp list.


> Nice, but might other workloads on other machines get slower?

We are studying the impact of this on other network workloads like netperf, 
and will post those results soon.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation
  2025-01-09 11:42   ` Nikhil Dhama
@ 2025-01-15 11:06     ` Huang, Ying
  0 siblings, 0 replies; 12+ messages in thread
From: Huang, Ying @ 2025-01-15 11:06 UTC (permalink / raw)
  To: Nikhil Dhama
  Cc: akpm, raghavendra.kodsarathimmappa, bharata, huang.ying.caritas,
	linux-mm, linux-kernel, subramaniam.kv, santosh.shukla, shivankg

Nikhil Dhama <nikhil.dhama@amd.com> writes:

>> The present division by two appears to be somewhat randomly chosen.
>> And as far as I can tell, this patch proposes replacing that with
>> another somewhat random adjustment.
>>
>> What's the actual design here?  What are we attempting to do and why,
>> and why is the proposed design superior to the present one?
>
> We are further analyzing both the designs and their impact on pcp list.
>
>
>> Nice, but might other workloads on other machines get slower?
>
> We are studying the impact of this on other network workloads like netperf, 
> and will post those results soon.

To tune this parameter, in addition to network workloads, other kinds of
workloads need be evaluated too.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation,
  2025-01-08  5:05 ` Andrew Morton
  2025-01-09 11:42   ` Nikhil Dhama
@ 2025-01-15 11:19   ` Huang, Ying
  2025-01-29  4:31     ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Huang, Ying @ 2025-01-15 11:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nikhil Dhama, Ying Huang, linux-mm, linux-kernel, Bharata B Rao,
	Raghavendra, Mel Gorman

Andrew Morton <akpm@linux-foundation.org> writes:

> On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@amd.com> wrote:
>
>> In current PCP auto-tuning desgin, free_count was introduced to track
>> the consecutive page freeing with a counter, This counter is incremented
>> by the exact amount of pages that are freed, but reduced by half on
>> allocation. This is causing a 2-node iperf3 client to server's network
>> bandwidth to drop by 30% if we scale number of client-server pairs from 32
>> (where we achieved peak network bandwidth) to 64.
>> 
>> To fix this issue, on allocation, reduce free_count by the exact number
>> of pages that are allocated instead of halving it.
>
> The present division by two appears to be somewhat randomly chosen. 
> And as far as I can tell, this patch proposes replacing that with
> another somewhat random adjustment.
>
> What's the actual design here?  What are we attempting to do and why,
> and why is the proposed design superior to the present one?

Cc Mel for the original design.

IIUC, pcp->free_count is used to identify the consecutive, pure, large
number of page freeing pattern.  For that pattern, larger batch will be
used to free pages from PCP to buddy to improve the performance.  Mixed
free/allocation pattern should not make pcp->free_count large, even if
the number of the pages freed is much larger than that of the pages
allocated in the long run.  So, pcp->free_count decreases rapidly for
the page allocation.

Hi, Mel, please correct me if my understanding isn't correct.

>> On a 2-node AMD server, one running iperf3 clients and other iperf3
>> sever, This patch restores the performance drop.
>
> Nice, but might other workloads on other machines get slower?

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation,
  2025-01-15 11:19   ` [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation, Huang, Ying
@ 2025-01-29  4:31     ` Andrew Morton
  2025-02-12  5:04       ` [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation Nikhil Dhama
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2025-01-29  4:31 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Nikhil Dhama, Ying Huang, linux-mm, linux-kernel, Bharata B Rao,
	Raghavendra, Mel Gorman

On Wed, 15 Jan 2025 19:19:02 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@amd.com> wrote:
> >
> >> In current PCP auto-tuning desgin, free_count was introduced to track
> >> the consecutive page freeing with a counter, This counter is incremented
> >> by the exact amount of pages that are freed, but reduced by half on
> >> allocation. This is causing a 2-node iperf3 client to server's network
> >> bandwidth to drop by 30% if we scale number of client-server pairs from 32
> >> (where we achieved peak network bandwidth) to 64.
> >> 
> >> To fix this issue, on allocation, reduce free_count by the exact number
> >> of pages that are allocated instead of halving it.
> >
> > The present division by two appears to be somewhat randomly chosen. 
> > And as far as I can tell, this patch proposes replacing that with
> > another somewhat random adjustment.
> >
> > What's the actual design here?  What are we attempting to do and why,
> > and why is the proposed design superior to the present one?
> 
> Cc Mel for the original design.
> 
> IIUC, pcp->free_count is used to identify the consecutive, pure, large
> number of page freeing pattern.  For that pattern, larger batch will be
> used to free pages from PCP to buddy to improve the performance.  Mixed
> free/allocation pattern should not make pcp->free_count large, even if
> the number of the pages freed is much larger than that of the pages
> allocated in the long run.  So, pcp->free_count decreases rapidly for
> the page allocation.
> 
> Hi, Mel, please correct me if my understanding isn't correct.
> 

hm, no Mel.

Nikhil, please do continue to work on this - it seems that there will
be a significant benefit to retuning this.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation
  2025-01-29  4:31     ` Andrew Morton
@ 2025-02-12  5:04       ` Nikhil Dhama
  2025-02-12  8:40         ` Huang, Ying
  0 siblings, 1 reply; 12+ messages in thread
From: Nikhil Dhama @ 2025-02-12  5:04 UTC (permalink / raw)
  To: akpm
  Cc: bharata, huang.ying.caritas, linux-kernel, linux-mm, mgorman,
	nikhil.dhama, raghavendra.kodsarathimmappa, ying.huang


On 1/29/2025 10:01 AM, Andrew Morton wrote:
>
> On Wed, 15 Jan 2025 19:19:02 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>
>> Andrew Morton <akpm@linux-foundation.org> writes:
>>
>>> On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@amd.com> wrote:
>>>
>>>> In current PCP auto-tuning desgin, free_count was introduced to track
>>>> the consecutive page freeing with a counter, This counter is incremented
>>>> by the exact amount of pages that are freed, but reduced by half on
>>>> allocation. This is causing a 2-node iperf3 client to server's network
>>>> bandwidth to drop by 30% if we scale number of client-server pairs from 32
>>>> (where we achieved peak network bandwidth) to 64.
>>>>
>>>> To fix this issue, on allocation, reduce free_count by the exact number
>>>> of pages that are allocated instead of halving it.
>>> The present division by two appears to be somewhat randomly chosen.
>>> And as far as I can tell, this patch proposes replacing that with
>>> another somewhat random adjustment.
>>>
>>> What's the actual design here?  What are we attempting to do and why,
>>> and why is the proposed design superior to the present one?
>> Cc Mel for the original design.
>>
>> IIUC, pcp->free_count is used to identify the consecutive, pure, large
>> number of page freeing pattern.  For that pattern, larger batch will be
>> used to free pages from PCP to buddy to improve the performance.  Mixed
>> free/allocation pattern should not make pcp->free_count large, even if
>> the number of the pages freed is much larger than that of the pages
>> allocated in the long run.  So, pcp->free_count decreases rapidly for
>> the page allocation.
>>
>> Hi, Mel, please correct me if my understanding isn't correct.
>>
> hm, no Mel.
>
> Nikhil, please do continue to work on this - it seems that there will
> be a significant benefit to retuning this.


Hi Andrew,

I have analyzed the performance of different memory-sensitive workloads for these
two different ways to decrement pcp->free_count. I compared the score amongst
v6.6 mainline, v6.7 mainline and v6.7 with our patch.

For all the benchmarks, I used a 2-socket AMD server with 382 logical CPUs.

Results I got are as follows:
All scores are normalized with respect to v6.6 (base).


For all the benchmarks below (iperf3, lmbench3 unix, netperf, redis, gups, xsbench),
a higher score is better.

                    iperf3    lmbench3 Unix       1-node netperf       2-node netperf
                                  (AF_UNIX)   (SCTP_STREAM_MANY)   (SCTP_STREAM_MANY)
                   -------   --------------   ------------------   ------------------
v6.6 (base)            100              100                  100                  100
v6.7                    69            113.2                   99                98.59
v6.7 with my patch     100            112.1                100.3               101.16


                  redis standard    redis core    redis L3 Heavy    Gups    xsbench
                  --------------    ----------    --------------    ----    -------
v6.6 (base)                  100           100              100      100        100
v6.7                       99.45        101.66            99.47      100      98.14
v6.7 with my patch         99.76        101.12            99.75      100      99.56


and for graph500, hashjoin, pagerank and Kbuild, a lower score is better.

                     graph500     hashjoin      hashjoin    pagerank     Kbuild
                               (THP always)   (THP never)
                    ---------  ------------   -----------   --------     ------
v6.6 (base)              100           100           100         100        100
v6.7                  101.08         101.3         101.9         100       98.8
v6.7 with my patch     99.73           100        101.66         100       99.6

from these result I can conclude that this patch is performing better
or as good as base v6.7 on almost all of these workloads.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation
  2025-02-12  5:04       ` [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation Nikhil Dhama
@ 2025-02-12  8:40         ` Huang, Ying
  2025-02-12 10:06           ` Nikhil Dhama
  2025-03-19  8:14           ` [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation Nikhil Dhama
  0 siblings, 2 replies; 12+ messages in thread
From: Huang, Ying @ 2025-02-12  8:40 UTC (permalink / raw)
  To: Nikhil Dhama
  Cc: akpm, bharata, huang.ying.caritas, linux-kernel, linux-mm,
	mgorman, raghavendra.kodsarathimmappa

Nikhil Dhama <nikhil.dhama@amd.com> writes:

> On 1/29/2025 10:01 AM, Andrew Morton wrote:
>>
>> On Wed, 15 Jan 2025 19:19:02 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>>
>>> Andrew Morton <akpm@linux-foundation.org> writes:
>>>
>>>> On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@amd.com> wrote:
>>>>
>>>>> In current PCP auto-tuning desgin, free_count was introduced to track
>>>>> the consecutive page freeing with a counter, This counter is incremented
>>>>> by the exact amount of pages that are freed, but reduced by half on
>>>>> allocation. This is causing a 2-node iperf3 client to server's network
>>>>> bandwidth to drop by 30% if we scale number of client-server pairs from 32
>>>>> (where we achieved peak network bandwidth) to 64.
>>>>>
>>>>> To fix this issue, on allocation, reduce free_count by the exact number
>>>>> of pages that are allocated instead of halving it.
>>>> The present division by two appears to be somewhat randomly chosen.
>>>> And as far as I can tell, this patch proposes replacing that with
>>>> another somewhat random adjustment.
>>>>
>>>> What's the actual design here?  What are we attempting to do and why,
>>>> and why is the proposed design superior to the present one?
>>> Cc Mel for the original design.
>>>
>>> IIUC, pcp->free_count is used to identify the consecutive, pure, large
>>> number of page freeing pattern.  For that pattern, larger batch will be
>>> used to free pages from PCP to buddy to improve the performance.  Mixed
>>> free/allocation pattern should not make pcp->free_count large, even if
>>> the number of the pages freed is much larger than that of the pages
>>> allocated in the long run.  So, pcp->free_count decreases rapidly for
>>> the page allocation.
>>>
>>> Hi, Mel, please correct me if my understanding isn't correct.
>>>
>> hm, no Mel.
>>
>> Nikhil, please do continue to work on this - it seems that there will
>> be a significant benefit to retuning this.
>
>
> Hi Andrew,
>
> I have analyzed the performance of different memory-sensitive workloads for these
> two different ways to decrement pcp->free_count. I compared the score amongst
> v6.6 mainline, v6.7 mainline and v6.7 with our patch.
>
> For all the benchmarks, I used a 2-socket AMD server with 382 logical CPUs.
>
> Results I got are as follows:
> All scores are normalized with respect to v6.6 (base).
>
>
> For all the benchmarks below (iperf3, lmbench3 unix, netperf, redis, gups, xsbench),
> a higher score is better.
>
>                     iperf3    lmbench3 Unix       1-node netperf       2-node netperf
>                                   (AF_UNIX)   (SCTP_STREAM_MANY)   (SCTP_STREAM_MANY)
>                    -------   --------------   ------------------   ------------------
> v6.6 (base)            100              100                  100                  100
> v6.7                    69            113.2                   99                98.59
> v6.7 with my patch     100            112.1                100.3               101.16
>
>
>                   redis standard    redis core    redis L3 Heavy    Gups    xsbench
>                   --------------    ----------    --------------    ----    -------
> v6.6 (base)                  100           100              100      100        100
> v6.7                       99.45        101.66            99.47      100      98.14
> v6.7 with my patch         99.76        101.12            99.75      100      99.56
>
>
> and for graph500, hashjoin, pagerank and Kbuild, a lower score is better.
>
>                      graph500     hashjoin      hashjoin    pagerank     Kbuild
>                                (THP always)   (THP never)
>                     ---------  ------------   -----------   --------     ------
> v6.6 (base)              100           100           100         100        100
> v6.7                  101.08         101.3         101.9         100       98.8
> v6.7 with my patch     99.73           100        101.66         100       99.6
>
> from these result I can conclude that this patch is performing better
> or as good as base v6.7 on almost all of these workloads.

Sorry, this change doesn't make sense to me.

For example, if a large size process exits on a CPU, pcp->free_count
will increase on this CPU.  This is good, because the process can free
pages quicker during exiting with the larger batching.  However, after
that, pcp->free_count may be kept large for a long duration unless a
large number of page allocation (without large number of page freeing)
are done on the CPU.  So, the page freeing parameter may be influenced
by some unrelated workload for long time.  That doesn't sound good.

In effect, the larger pcp->free_count will increase page freeing batch
size.  That will improve the page freeing throughput but hurt page
freeing latency.  Please check the page freeing latency too.  If larger
batch number helps performance without regressions, just increase batch
number directly instead of playing with pcp->free_count.

And, do you run network related workloads on one machine?  If so, please
try to run them on two machines instead, with clients and servers run on
different machines.  At least, please use different sockets for clients
and servers.  Because larger pcp->free_count will make it easier to
trigger free_high heuristics.  If that is the case, please try to
optimize free_high heuristics directly too.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation
  2025-02-12  8:40         ` Huang, Ying
@ 2025-02-12 10:06           ` Nikhil Dhama
  2025-03-19  8:14           ` [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation Nikhil Dhama
  1 sibling, 0 replies; 12+ messages in thread
From: Nikhil Dhama @ 2025-02-12 10:06 UTC (permalink / raw)
  To: Huang, Ying, Nikhil Dhama
  Cc: akpm, bharata, huang.ying.caritas, linux-kernel, linux-mm,
	mgorman, raghavendra.kodsarathimmappa



On 2/12/2025 2:10 PM, Huang, Ying wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>
>> On 1/29/2025 10:01 AM, Andrew Morton wrote:
>>> On Wed, 15 Jan 2025 19:19:02 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>>>
>>>> Andrew Morton <akpm@linux-foundation.org> writes:
>>>>
>>>>> On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@amd.com> wrote:
>>>>>
>>>>>> In current PCP auto-tuning desgin, free_count was introduced to track
>>>>>> the consecutive page freeing with a counter, This counter is incremented
>>>>>> by the exact amount of pages that are freed, but reduced by half on
>>>>>> allocation. This is causing a 2-node iperf3 client to server's network
>>>>>> bandwidth to drop by 30% if we scale number of client-server pairs from 32
>>>>>> (where we achieved peak network bandwidth) to 64.
>>>>>>
>>>>>> To fix this issue, on allocation, reduce free_count by the exact number
>>>>>> of pages that are allocated instead of halving it.
>>>>> The present division by two appears to be somewhat randomly chosen.
>>>>> And as far as I can tell, this patch proposes replacing that with
>>>>> another somewhat random adjustment.
>>>>>
>>>>> What's the actual design here?  What are we attempting to do and why,
>>>>> and why is the proposed design superior to the present one?
>>>> Cc Mel for the original design.
>>>>
>>>> IIUC, pcp->free_count is used to identify the consecutive, pure, large
>>>> number of page freeing pattern.  For that pattern, larger batch will be
>>>> used to free pages from PCP to buddy to improve the performance.  Mixed
>>>> free/allocation pattern should not make pcp->free_count large, even if
>>>> the number of the pages freed is much larger than that of the pages
>>>> allocated in the long run.  So, pcp->free_count decreases rapidly for
>>>> the page allocation.
>>>>
>>>> Hi, Mel, please correct me if my understanding isn't correct.
>>>>
>>> hm, no Mel.
>>>
>>> Nikhil, please do continue to work on this - it seems that there will
>>> be a significant benefit to retuning this.
>>
>> Hi Andrew,
>>
>> I have analyzed the performance of different memory-sensitive workloads for these
>> two different ways to decrement pcp->free_count. I compared the score amongst
>> v6.6 mainline, v6.7 mainline and v6.7 with our patch.
>>
>> For all the benchmarks, I used a 2-socket AMD server with 382 logical CPUs.
>>
>> Results I got are as follows:
>> All scores are normalized with respect to v6.6 (base).
>>
>>
>> For all the benchmarks below (iperf3, lmbench3 unix, netperf, redis, gups, xsbench),
>> a higher score is better.
>>
>>                      iperf3    lmbench3 Unix       1-node netperf       2-node netperf
>>                                    (AF_UNIX)   (SCTP_STREAM_MANY)   (SCTP_STREAM_MANY)
>>                     -------   --------------   ------------------   ------------------
>> v6.6 (base)            100              100                  100                  100
>> v6.7                    69            113.2                   99                98.59
>> v6.7 with my patch     100            112.1                100.3               101.16
>>
>>
>>                    redis standard    redis core    redis L3 Heavy    Gups    xsbench
>>                    --------------    ----------    --------------    ----    -------
>> v6.6 (base)                  100           100              100      100        100
>> v6.7                       99.45        101.66            99.47      100      98.14
>> v6.7 with my patch         99.76        101.12            99.75      100      99.56
>>
>>
>> and for graph500, hashjoin, pagerank and Kbuild, a lower score is better.
>>
>>                       graph500     hashjoin      hashjoin    pagerank     Kbuild
>>                                 (THP always)   (THP never)
>>                      ---------  ------------   -----------   --------     ------
>> v6.6 (base)              100           100           100         100        100
>> v6.7                  101.08         101.3         101.9         100       98.8
>> v6.7 with my patch     99.73           100        101.66         100       99.6
>>
>> from these result I can conclude that this patch is performing better
>> or as good as base v6.7 on almost all of these workloads.
> Sorry, this change doesn't make sense to me.
>
> For example, if a large size process exits on a CPU, pcp->free_count
> will increase on this CPU.  This is good, because the process can free
> pages quicker during exiting with the larger batching.  However, after
> that, pcp->free_count may be kept large for a long duration unless a
> large number of page allocation (without large number of page freeing)
> are done on the CPU.  So, the page freeing parameter may be influenced
> by some unrelated workload for long time.  That doesn't sound good.
>
> In effect, the larger pcp->free_count will increase page freeing batch
> size.  That will improve the page freeing throughput but hurt page
> freeing latency.  Please check the page freeing latency too.  If larger
> batch number helps performance without regressions, just increase batch
> number directly instead of playing with pcp->free_count.

Okay I will check the page freeing latency too. and Will check if larger
batch number helps.

> And, do you run network related workloads on one machine?  If so, please
> try to run them on two machines instead, with clients and servers run on
> different machines.  At least, please use different sockets for clients
> and servers.  Because larger pcp->free_count will make it easier to
> trigger free_high heuristics.  If that is the case, please try to
> optimize free_high heuristics directly too.

I ran iperf3 and 2-node netperf on two machines, clients and servers 
running
on different machines. And I ran 1-node netperf on one (2-socket) machine
with clients and servers running on different sockets.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-02-12  8:40         ` Huang, Ying
  2025-02-12 10:06           ` Nikhil Dhama
@ 2025-03-19  8:14           ` Nikhil Dhama
  2025-03-25  8:00             ` Raghavendra K T
  1 sibling, 1 reply; 12+ messages in thread
From: Nikhil Dhama @ 2025-03-19  8:14 UTC (permalink / raw)
  To: akpm, ying.huang
  Cc: Nikhil Dhama, Ying Huang, linux-mm, linux-kernel, Bharata B Rao,
	Raghavendra

On 2/12/2025 2:10 PM, Huang, Ying <ying.huang@linux.alibaba.com> wrote:
>
> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>
>> On 1/29/2025 10:01 AM, Andrew Morton wrote:
>>> On Wed, 15 Jan 2025 19:19:02 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>>>
>>>> Andrew Morton <akpm@linux-foundation.org> writes:
>>>>
>>>>> On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@amd.com> wrote:
>>>>>
>>>>>> In current PCP auto-tuning desgin, free_count was introduced to track
>>>>>> the consecutive page freeing with a counter, This counter is incremented
>>>>>> by the exact amount of pages that are freed, but reduced by half on
>>>>>> allocation. This is causing a 2-node iperf3 client to server's network
>>>>>> bandwidth to drop by 30% if we scale number of client-server pairs from 32
>>>>>> (where we achieved peak network bandwidth) to 64.
>>>>>>
>>>>>> To fix this issue, on allocation, reduce free_count by the exact number
>>>>>> of pages that are allocated instead of halving it.
>>>>> The present division by two appears to be somewhat randomly chosen.
>>>>> And as far as I can tell, this patch proposes replacing that with
>>>>> another somewhat random adjustment.
>>>>>
>>>>> What's the actual design here?  What are we attempting to do and why,
>>>>> and why is the proposed design superior to the present one?
>>>> Cc Mel for the original design.
>>>>
>>>> IIUC, pcp->free_count is used to identify the consecutive, pure, large
>>>> number of page freeing pattern.  For that pattern, larger batch will be
>>>> used to free pages from PCP to buddy to improve the performance.  Mixed
>>>> free/allocation pattern should not make pcp->free_count large, even if
>>>> the number of the pages freed is much larger than that of the pages
>>>> allocated in the long run.  So, pcp->free_count decreases rapidly for
>>>> the page allocation.
>>>>
>>>> Hi, Mel, please correct me if my understanding isn't correct.
>>>>
>>> hm, no Mel.
>>>
>>> Nikhil, please do continue to work on this - it seems that there will
>>> be a significant benefit to retuning this.
>>
>> Hi Andrew,
>>
>> I have analyzed the performance of different memory-sensitive workloads for these
>> two different ways to decrement pcp->free_count. I compared the score amongst
>> v6.6 mainline, v6.7 mainline and v6.7 with our patch.
>>
>> For all the benchmarks, I used a 2-socket AMD server with 382 logical CPUs.
>>
>> Results I got are as follows:
>> All scores are normalized with respect to v6.6 (base).
>>
>>
>> For all the benchmarks below (iperf3, lmbench3 unix, netperf, redis, gups, xsbench),
>> a higher score is better.
>>
>>                      iperf3    lmbench3 Unix       1-node netperf       2-node netperf
>>                                    (AF_UNIX)   (SCTP_STREAM_MANY)   (SCTP_STREAM_MANY)
>>                     -------   --------------   ------------------   ------------------
>> v6.6 (base)            100              100                  100                  100
>> v6.7                    69            113.2                   99                98.59
>> v6.7 with my patch     100            112.1                100.3               101.16
>>
>>
>>                    redis standard    redis core    redis L3 Heavy    Gups    xsbench
>>                    --------------    ----------    --------------    ----    -------
>> v6.6 (base)                  100           100              100      100        100
>> v6.7                       99.45        101.66            99.47      100      98.14
>> v6.7 with my patch         99.76        101.12            99.75      100      99.56
>>
>>
>> and for graph500, hashjoin, pagerank and Kbuild, a lower score is better.
>>
>>                       graph500     hashjoin      hashjoin    pagerank     Kbuild
>>                                 (THP always)   (THP never)
>>                      ---------  ------------   -----------   --------     ------
>> v6.6 (base)              100           100           100         100        100
>> v6.7                  101.08         101.3         101.9         100       98.8
>> v6.7 with my patch     99.73           100        101.66         100       99.6
>>
>> from these result I can conclude that this patch is performing better
>> or as good as base v6.7 on almost all of these workloads.
> Sorry, this change doesn't make sense to me.
>
> For example, if a large size process exits on a CPU, pcp->free_count
> will increase on this CPU.  This is good, because the process can free
> pages quicker during exiting with the larger batching.  However, after
> that, pcp->free_count may be kept large for a long duration unless a
> large number of page allocation (without large number of page freeing)
> are done on the CPU.  So, the page freeing parameter may be influenced
> by some unrelated workload for long time.  That doesn't sound good.
>
> In effect, the larger pcp->free_count will increase page freeing batch
> size.  That will improve the page freeing throughput but hurt page
> freeing latency.  Please check the page freeing latency too.  If larger
> batch number helps performance without regressions, just increase batch
> number directly instead of playing with pcp->free_count.

> And, do you run network related workloads on one machine?  If so, please
> try to run them on two machines instead, with clients and servers run on
> different machines.  At least, please use different sockets for clients
> and servers.  Because larger pcp->free_count will make it easier to
> trigger free_high heuristics.  If that is the case, please try to
> optimize free_high heuristics directly too.

I agree with Ying Huang, the above change is not the best possible fix for
the issue. On futher analysis I figured that root cause of the issue is
the frequent pcp high order flushes. During a 20sec iperf3 run
I observed on avg 5 pcp high order flushes in kernel v6.6, whereas, in
v6.7, I observed about 170 pcp high order flushes.
Tracing pcp->free_count, I figured with the patch v1 (patch I suggested 
earlier) free_count is going into negatives which reduces the number of 
times free_high heuristics is triggered hence reducing the high order
flushes.

As Ying Huang Suggested, it helps the performance on increasing the batch size
for free_high heuristics. I tried different scaling factors to find best
suitable batch value for free_high heuristics,


			score	# free_high
-----------		-----	-----------
v6.6 (base)		100	 	4
v6.12 (batch*1)		 69	      170
batch*2			 69	      150
batch*4			 74	      101
batch*5			100	       53
batch*6			100	       36
batch*8			100		3
  
scaling batch for free_high heuristics with a factor of 5 restores the 
performance. 

On AMD 2-node machine, score for other benchmarks with patch v2
are as follows:

                     iperf3    lmbench3            netperf         kbuild
                              (AF_UNIX)      (SCTP_STREAM_MANY)
                    -------   ---------      -----------------     ------
v6.6 (base)            100          100                  100          100
v6.12                   69          113                 98.5         98.8
v6.12 with patch v2    100        112.5                100.1         99.6 

for network workloads, clients and server are running on different
machines conneted via Mellanox Connect-7 NIC. 

number of free_high:
		     iperf3    lmbench3            netperf         kbuild
                              (AF_UNIX)      (SCTP_STREAM_MANY)
                    -------   ---------      -----------------     ------
v6.6 (base)              5          12                   6           2
v6.12                  170          11                  92           2
v6.12 with patch v2    	58          11                	34           2


Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ying Huang <huang.ying.caritas@gmail.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Bharata B Rao <bharata@amd.com>
Cc: Raghavendra <raghavendra.kodsarathimmappa@amd.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6958333054d..326d5fbae353 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * stops will be drained from vmstat refresh context.
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		free_high = (pcp->free_count >= batch &&
+		free_high = (pcp->free_count >= (batch*5) &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
 			      pcp->count >= READ_ONCE(batch)));
-- 
2.25.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-03-19  8:14           ` [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation Nikhil Dhama
@ 2025-03-25  8:00             ` Raghavendra K T
  2025-03-25 17:23               ` Nikhil Dhama
  0 siblings, 1 reply; 12+ messages in thread
From: Raghavendra K T @ 2025-03-25  8:00 UTC (permalink / raw)
  To: Nikhil Dhama, akpm, ying.huang
  Cc: Ying Huang, linux-mm, linux-kernel, Bharata B Rao, Raghavendra

On 3/19/2025 1:44 PM, Nikhil Dhama wrote:
[...]
>> And, do you run network related workloads on one machine?  If so, please
>> try to run them on two machines instead, with clients and servers run on
>> different machines.  At least, please use different sockets for clients
>> and servers.  Because larger pcp->free_count will make it easier to
>> trigger free_high heuristics.  If that is the case, please try to
>> optimize free_high heuristics directly too.
> 
> I agree with Ying Huang, the above change is not the best possible fix for
> the issue. On futher analysis I figured that root cause of the issue is
> the frequent pcp high order flushes. During a 20sec iperf3 run
> I observed on avg 5 pcp high order flushes in kernel v6.6, whereas, in
> v6.7, I observed about 170 pcp high order flushes.
> Tracing pcp->free_count, I figured with the patch v1 (patch I suggested
> earlier) free_count is going into negatives which reduces the number of
> times free_high heuristics is triggered hence reducing the high order
> flushes.
> 
> As Ying Huang Suggested, it helps the performance on increasing the batch size
> for free_high heuristics. I tried different scaling factors to find best
> suitable batch value for free_high heuristics,
> 
> 
> 			score	# free_high
> -----------		-----	-----------
> v6.6 (base)		100	 	4
> v6.12 (batch*1)		 69	      170
> batch*2			 69	      150
> batch*4			 74	      101
> batch*5			100	       53
> batch*6			100	       36
> batch*8			100		3
>    
> scaling batch for free_high heuristics with a factor of 5 restores the
> performance.

Hello Nikhil,

Thanks for looking further on this. But from design standpoint,
how a batch-size of 5 is helping here is not clear (Andrew's original
question).

Any case can you post the patch-set in a new email so that the below
patch is not lost in discussion thread?

> 
> On AMD 2-node machine, score for other benchmarks with patch v2
> are as follows:
> 
>                       iperf3    lmbench3            netperf         kbuild
>                                (AF_UNIX)      (SCTP_STREAM_MANY)
>                      -------   ---------      -----------------     ------
> v6.6 (base)            100          100                  100          100
> v6.12                   69          113                 98.5         98.8
> v6.12 with patch v2    100        112.5                100.1         99.6
> 
> for network workloads, clients and server are running on different
> machines conneted via Mellanox Connect-7 NIC.
> 
> number of free_high:
> 		     iperf3    lmbench3            netperf         kbuild
>                                (AF_UNIX)      (SCTP_STREAM_MANY)
>                      -------   ---------      -----------------     ------
> v6.6 (base)              5          12                   6           2
> v6.12                  170          11                  92           2
> v6.12 with patch v2    	58          11                	34           2
> 
> 
> Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ying Huang <huang.ying.caritas@gmail.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Bharata B Rao <bharata@amd.com>
> Cc: Raghavendra <raghavendra.kodsarathimmappa@amd.com>
> ---
>   mm/page_alloc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b6958333054d..326d5fbae353 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>   	 * stops will be drained from vmstat refresh context.
>   	 */
>   	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> -		free_high = (pcp->free_count >= batch &&
> +		free_high = (pcp->free_count >= (batch*5) &&
>   			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
>   			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
>   			      pcp->count >= READ_ONCE(batch)));



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-03-25  8:00             ` Raghavendra K T
@ 2025-03-25 17:23               ` Nikhil Dhama
  0 siblings, 0 replies; 12+ messages in thread
From: Nikhil Dhama @ 2025-03-25 17:23 UTC (permalink / raw)
  To: Raghavendra K T, Nikhil Dhama, akpm, ying.huang
  Cc: Ying Huang, linux-mm, linux-kernel, Bharata B Rao, Raghavendra


On 3/25/2025 1:30 PM, Raghavendra K T wrote:
> On 3/19/2025 1:44 PM, Nikhil Dhama wrote:
> [...]
>>> And, do you run network related workloads on one machine?  If so, 
>>> please
>>> try to run them on two machines instead, with clients and servers 
>>> run on
>>> different machines.  At least, please use different sockets for clients
>>> and servers.  Because larger pcp->free_count will make it easier to
>>> trigger free_high heuristics.  If that is the case, please try to
>>> optimize free_high heuristics directly too.
>>
>> I agree with Ying Huang, the above change is not the best possible 
>> fix for
>> the issue. On futher analysis I figured that root cause of the issue is
>> the frequent pcp high order flushes. During a 20sec iperf3 run
>> I observed on avg 5 pcp high order flushes in kernel v6.6, whereas, in
>> v6.7, I observed about 170 pcp high order flushes.
>> Tracing pcp->free_count, I figured with the patch v1 (patch I suggested
>> earlier) free_count is going into negatives which reduces the number of
>> times free_high heuristics is triggered hence reducing the high order
>> flushes.
>>
>> As Ying Huang Suggested, it helps the performance on increasing the 
>> batch size
>> for free_high heuristics. I tried different scaling factors to find best
>> suitable batch value for free_high heuristics,
>>
>>
>>             score    # free_high
>> -----------        -----    -----------
>> v6.6 (base)        100         4
>> v6.12 (batch*1)         69          170
>> batch*2             69          150
>> batch*4             74          101
>> batch*5            100           53
>> batch*6            100           36
>> batch*8            100        3
>>    scaling batch for free_high heuristics with a factor of 5 restores 
>> the
>> performance.
>
> Hello Nikhil,
>
> Thanks for looking further on this. But from design standpoint,
> how a batch-size of 5 is helping here is not clear (Andrew's original
> question).
>
> Any case can you post the patch-set in a new email so that the below
> patch is not lost in discussion thread?

Hi Raghavendra,

Thanks, I have posted the patch-set in a new email
link: 
https://lore.kernel.org/linux-mm/20250325171915.14384-1-nikhil.dhama@amd.com/ 

with a better explanation on  how scaling batch is helping here.

Thanks,
Nikhil



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-25 17:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-07  9:17 [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation Nikhil Dhama
2025-01-08  5:05 ` Andrew Morton
2025-01-09 11:42   ` Nikhil Dhama
2025-01-15 11:06     ` Huang, Ying
2025-01-15 11:19   ` [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation, Huang, Ying
2025-01-29  4:31     ` Andrew Morton
2025-02-12  5:04       ` [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page allocation Nikhil Dhama
2025-02-12  8:40         ` Huang, Ying
2025-02-12 10:06           ` Nikhil Dhama
2025-03-19  8:14           ` [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation Nikhil Dhama
2025-03-25  8:00             ` Raghavendra K T
2025-03-25 17:23               ` Nikhil Dhama

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox