[PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
@ 2025-03-25 17:19 Nikhil Dhama
  2025-03-30  6:52 ` Huang, Ying
  2025-03-31 14:10 ` kernel test robot
  0 siblings, 2 replies; 8+ messages in thread
From: Nikhil Dhama @ 2025-03-25 17:19 UTC (permalink / raw)
  To: akpm, ying.huang
  Cc: Nikhil Dhama, Ying Huang, linux-mm, linux-kernel, Bharata B Rao,
	Raghavendra

In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
which is invoked by free_pcppages_bulk(). So, it used to increase 
free_factor by 1 only when we try to reduce the size of pcp list or
flush for high order.
and free_high used to trigger only for order > 0 and order <
costly_order and free_factor > 0.

and free_factor used to scale down by a factor of 2 on every successful
allocation. 

for iperf3 I noticed that with older design in kernel v6.6, pcp list was
drained mostly when pcp->count > high (more often when count goes above
530). and most of the time free_factor was 0, triggering very few 
high order flushes.

Whereas in the current design, free_factor is changed to free_count to keep
track of the number of pages freed contiguously, 
and with this design for iperf3, pcp list is getting flushed more 
frequently because free_high heuristics is triggered more often now.

In current design, free_count is incremented on every deallocation,
irrespective of whether pcp list was reduced or not. And logic to
trigger free_high is if free_count goes above batch (which is 63) and
there are two contiguous page free without any allocation. 
(and with cache slice optimisation).

With this design, I observed that high order pcp list is drained as soon 
as both count and free_count goes about 63.

and due to this more aggressive high order flushing, applications 
doing contiguous high order allocation will require to go to global list
more frequently.

On a 2-node AMD machine with 384 vCPUs on each node, 
connected via Mellonox connectX-7, I am seeing a ~30% performance 
reduction if we scale number of iperf3 client/server pairs from 32 to 64. 

So, though this new design reduced the time to detect high order flushes, 
but for application which are allocating high order pages more
frequently it may be flushing the high order list pre-maturely.
This motivates towards tuning on how late or early we should flush
high order lists.

for free_high heuristics. I tried to scale batch and tune it, 
which will delay the free_high flushes.

			score	# free_high
-----------		-----	-----------
v6.6 (base)		100	 	4
v6.12 (batch*1)		 69	      170
batch*2			 69	      150
batch*4			 74	      101
batch*5			100	       53
batch*6			100	       36
batch*8			100		3

scaling batch for free_high heuristics with a factor of 5 or above restores
the performance, as it is reducing the number of high order flushes.

On 2-node AMD server with 384 vCPUs each,score for other benchmarks with 
patch v2 along with iperf3 are as follows:

                     iperf3    lmbench3            netperf         kbuild
                              (AF_UNIX)      (SCTP_STREAM_MANY)
                    -------   ---------      -----------------     ------
v6.6 (base)            100          100                  100          100
v6.12                   69          113                 98.5         98.8
v6.12 with patch       100        112.5                100.1         99.6 

for network workloads, clients and server are running on different
machines conneted via Mellanox Connect-7 NIC. 

number of free_high:
		     iperf3    lmbench3            netperf         kbuild
                              (AF_UNIX)      (SCTP_STREAM_MANY)
                    -------   ---------      -----------------     ------
v6.6 (base)              5          12                   6           2
v6.12                  170          11                  92           2
v6.12 with patch    	58          11                	34           2

Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ying Huang <huang.ying.caritas@gmail.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Bharata B Rao <bharata@amd.com>
Cc: Raghavendra <raghavendra.kodsarathimmappa@amd.com>

---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6958333054d..326d5fbae353 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * stops will be drained from vmstat refresh context.
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		free_high = (pcp->free_count >= batch &&
+		free_high = (pcp->free_count >= (batch*5) &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
 			      pcp->count >= READ_ONCE(batch)));
-- 
2.25.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-03-25 17:19 [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation Nikhil Dhama
@ 2025-03-30  6:52 ` Huang, Ying
  2025-03-31 14:10 ` kernel test robot
  1 sibling, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2025-03-30  6:52 UTC (permalink / raw)
  To: Nikhil Dhama
  Cc: akpm, Ying Huang, linux-mm, linux-kernel, Bharata B Rao,
	Raghavendra, Mel Gorman

Hi, Nikhil,

Nikhil Dhama <nikhil.dhama@amd.com> writes:

> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
> which is invoked by free_pcppages_bulk(). So, it used to increase 
> free_factor by 1 only when we try to reduce the size of pcp list or
> flush for high order.
> and free_high used to trigger only for order > 0 and order <
> costly_order and free_factor > 0.
>   
> and free_factor used to scale down by a factor of 2 on every successful
> allocation. 
>
> for iperf3 I noticed that with older design in kernel v6.6, pcp list was
> drained mostly when pcp->count > high (more often when count goes above
> 530). and most of the time free_factor was 0, triggering very few 
> high order flushes.
>
> Whereas in the current design, free_factor is changed to free_count to keep
> track of the number of pages freed contiguously, 
> and with this design for iperf3, pcp list is getting flushed more 
> frequently because free_high heuristics is triggered more often now.
>
> In current design, free_count is incremented on every deallocation,
> irrespective of whether pcp list was reduced or not. And logic to
> trigger free_high is if free_count goes above batch (which is 63) and
> there are two contiguous page free without any allocation. 
> (and with cache slice optimisation).
>
> With this design, I observed that high order pcp list is drained as soon 
> as both count and free_count goes about 63.
>  
> and due to this more aggressive high order flushing, applications 
> doing contiguous high order allocation will require to go to global list
> more frequently.
>
> On a 2-node AMD machine with 384 vCPUs on each node, 
> connected via Mellonox connectX-7, I am seeing a ~30% performance 
> reduction if we scale number of iperf3 client/server pairs from 32 to 64. 
>
> So, though this new design reduced the time to detect high order flushes, 
> but for application which are allocating high order pages more
> frequently it may be flushing the high order list pre-maturely.
> This motivates towards tuning on how late or early we should flush
> high order lists.
>
> for free_high heuristics. I tried to scale batch and tune it, 
> which will delay the free_high flushes.
>
>
> 			score	# free_high
> -----------		-----	-----------
> v6.6 (base)		100	 	4
> v6.12 (batch*1)		 69	      170
> batch*2			 69	      150
> batch*4			 74	      101
> batch*5			100	       53
> batch*6			100	       36
> batch*8			100		3
>   
> scaling batch for free_high heuristics with a factor of 5 or above restores
> the performance, as it is reducing the number of high order flushes.
>
> On 2-node AMD server with 384 vCPUs each,score for other benchmarks with 
> patch v2 along with iperf3 are as follows:

Em..., IIUC, this may disable the free_high optimizations.  free_high
optimization is introduced by Mel Gorman in commit f26b3fa04611
("mm/page_alloc: limit number of high-order pages on PCP during bulk
free").  So, this may trigger regression for the workloads in the
commit.  Can you try it too?

>                      iperf3    lmbench3            netperf         kbuild
>                               (AF_UNIX)      (SCTP_STREAM_MANY)
>                     -------   ---------      -----------------     ------
> v6.6 (base)            100          100                  100          100
> v6.12                   69          113                 98.5         98.8
> v6.12 with patch       100        112.5                100.1         99.6 
>
> for network workloads, clients and server are running on different
> machines conneted via Mellanox Connect-7 NIC. 
>
> number of free_high:
> 		     iperf3    lmbench3            netperf         kbuild
>                               (AF_UNIX)      (SCTP_STREAM_MANY)
>                     -------   ---------      -----------------     ------
> v6.6 (base)              5          12                   6           2
> v6.12                  170          11                  92           2
> v6.12 with patch    	58          11                	34           2
>
>
> Signed-off-by: Nikhil Dhama <nikhil.dhama@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ying Huang <huang.ying.caritas@gmail.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Bharata B Rao <bharata@amd.com>
> Cc: Raghavendra <raghavendra.kodsarathimmappa@amd.com>
>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b6958333054d..326d5fbae353 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>  	 * stops will be drained from vmstat refresh context.
>  	 */
>  	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> -		free_high = (pcp->free_count >= batch &&
> +		free_high = (pcp->free_count >= (batch*5) &&
>  			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
>  			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
>  			      pcp->count >= READ_ONCE(batch)));

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-03-25 17:19 [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation Nikhil Dhama
  2025-03-30  6:52 ` Huang, Ying
@ 2025-03-31 14:10 ` kernel test robot
  2025-04-01 13:56   ` Nikhil Dhama
  1 sibling, 1 reply; 8+ messages in thread
From: kernel test robot @ 2025-03-31 14:10 UTC (permalink / raw)
  To: Nikhil Dhama
  Cc: oe-lkp, lkp, Andrew Morton, Ying Huang, Bharata B Rao,
	Raghavendra, linux-mm, ying.huang, Nikhil Dhama, linux-kernel,
	oliver.sang


Hello,

kernel test robot noticed a 32.2% improvement of lmbench3.TCP.socket.bandwidth.10MB.MB/sec on:


commit: 6570c41610d1d2d3b143c253010b38ce9cbc0012 ("[PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation")
url: https://github.com/intel-lab-lkp/linux/commits/Nikhil-Dhama/mm-pcp-scale-batch-to-reduce-number-of-high-order-pcp-flushes-on-deallocation/20250326-012247
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20250325171915.14384-1-nikhil.dhama@amd.com/
patch subject: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation

testcase: lmbench3
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory
parameters:

	test_memory_size: 50%
	nr_threads: 100%
	mode: development
	test: TCP
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250331/202503312148.c74b0351-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_threads/rootfs/tbox_group/test/test_memory_size/testcase:
  gcc-12/performance/x86_64-rhel-9.4/development/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/TCP/50%/lmbench3

commit: 
  7514d3cb91 ("foo")
  6570c41610 ("mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation")

7514d3cb916f9344 6570c41610d1d2d3b143c253010 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    143.28 ± 38%     +49.0%     213.49 ± 20%  numa-vmstat.node1.nr_anon_transparent_hugepages
    118.00 ± 21%     +50.3%     177.33 ± 17%  perf-c2c.DRAM.local
    182485           +32.2%     241267        lmbench3.TCP.socket.bandwidth.10MB.MB/sec
  40582104 ±  6%    +114.4%   87026622 ±  2%  lmbench3.time.involuntary_context_switches
      0.46 ±  2%      +0.1        0.52 ±  3%  mpstat.cpu.all.irq%
      4.57 ± 11%      +1.4        5.96 ±  6%  mpstat.cpu.all.soft%
    291657 ± 38%     +49.6%     436355 ± 20%  numa-meminfo.node1.AnonHugePages
   4728254 ± 36%     +32.0%    6241931 ± 26%  numa-meminfo.node1.MemUsed
      0.40           -24.4%       0.30 ± 12%  perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
     13.88 ±  3%     -78.2%       3.03 ±157%  perf-sched.wait_time.max.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.50 ±  4%    +670.3%      11.58 ± 38%  perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
 1.209e+09 ±  3%      +6.5%  1.288e+09        proc-vmstat.numa_hit
 1.209e+09 ±  3%      +6.5%  1.287e+09        proc-vmstat.numa_local
 9.644e+09 ±  3%      +6.6%  1.028e+10        proc-vmstat.pgalloc_normal
 9.644e+09 ±  3%      +6.6%  1.028e+10        proc-vmstat.pgfree
  92870937 ± 14%     -17.9%   76271910 ±  8%  sched_debug.cfs_rq:/.avg_vruntime.avg
      2343 ± 10%     -17.3%       1938 ± 17%  sched_debug.cfs_rq:/.load.min
  92870938 ± 14%     -17.9%   76271910 ±  8%  sched_debug.cfs_rq:/.min_vruntime.avg
     13803 ± 10%     -22.2%      10740 ± 14%  sched_debug.cpu.curr->pid.min
      2.87 ±  9%     +69.1%       4.85 ±  4%  perf-stat.i.MPKI
      0.31 ±  6%      +0.0        0.34 ±  3%  perf-stat.i.branch-miss-rate%
     13.92            +1.1       15.06        perf-stat.i.cache-miss-rate%
 2.719e+08 ±  9%     +27.6%  3.469e+08 ±  4%  perf-stat.i.cache-misses
 5.658e+11            -2.5%  5.516e+11        perf-stat.i.cpu-cycles
 3.618e+11 ±  7%     +10.5%  3.996e+11 ±  4%  perf-stat.i.instructions
      1.64 ±  9%     -42.0%       0.95 ± 70%  perf-stat.overall.cpi
      2233 ± 11%     -50.7%       1100 ± 71%  perf-stat.overall.cycles-between-cache-misses
 5.691e+11           -35.0%  3.702e+11 ± 70%  perf-stat.ps.cpu-cycles




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-03-31 14:10 ` kernel test robot
@ 2025-04-01 13:56   ` Nikhil Dhama
  2025-04-03  1:36     ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Nikhil Dhama @ 2025-04-01 13:56 UTC (permalink / raw)
  To: ying.huang, akpm
  Cc: bharata, huang.ying.caritas, linux-kernel, linux-mm, mgorman,
	raghavendra.kodsarathimmappa, oe-lkp, lkp, Nikhil Dhama


On 3/30/2025 12:22 PM, Huang, Ying wrote:

>
> Hi, Nikhil,
>
> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>
>> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
>> which is invoked by free_pcppages_bulk(). So, it used to increase
>> free_factor by 1 only when we try to reduce the size of pcp list or
>> flush for high order.
>> and free_high used to trigger only for order > 0 and order <
>> costly_order and free_factor > 0.
>>
>> and free_factor used to scale down by a factor of 2 on every successful
>> allocation.
>>
>> for iperf3 I noticed that with older design in kernel v6.6, pcp list was
>> drained mostly when pcp->count > high (more often when count goes above
>> 530). and most of the time free_factor was 0, triggering very few
>> high order flushes.
>>
>> Whereas in the current design, free_factor is changed to free_count to keep
>> track of the number of pages freed contiguously,
>> and with this design for iperf3, pcp list is getting flushed more
>> frequently because free_high heuristics is triggered more often now.
>>
>> In current design, free_count is incremented on every deallocation,
>> irrespective of whether pcp list was reduced or not. And logic to
>> trigger free_high is if free_count goes above batch (which is 63) and
>> there are two contiguous page free without any allocation.
>> (and with cache slice optimisation).
>>
>> With this design, I observed that high order pcp list is drained as soon
>> as both count and free_count goes about 63.
>>
>> and due to this more aggressive high order flushing, applications
>> doing contiguous high order allocation will require to go to global list
>> more frequently.
>>
>> On a 2-node AMD machine with 384 vCPUs on each node,
>> connected via Mellonox connectX-7, I am seeing a ~30% performance
>> reduction if we scale number of iperf3 client/server pairs from 32 to 64.
>>
>> So, though this new design reduced the time to detect high order flushes,
>> but for application which are allocating high order pages more
>> frequently it may be flushing the high order list pre-maturely.
>> This motivates towards tuning on how late or early we should flush
>> high order lists.
>>
>> for free_high heuristics. I tried to scale batch and tune it,
>> which will delay the free_high flushes.
>>
>>
>>                       score   # free_high
>> -----------           -----   -----------
>> v6.6 (base)           100             4
>> v6.12 (batch*1)        69           170
>> batch*2                69           150
>> batch*4                74           101
>> batch*5               100            53
>> batch*6               100            36
>> batch*8               100             3
>>
>> scaling batch for free_high heuristics with a factor of 5 or above restores
>> the performance, as it is reducing the number of high order flushes.
>>
>> On 2-node AMD server with 384 vCPUs each,score for other benchmarks with
>> patch v2 along with iperf3 are as follows:
>
> Em..., IIUC, this may disable the free_high optimizations.  free_high
> optimization is introduced by Mel Gorman in commit f26b3fa04611
> ("mm/page_alloc: limit number of high-order pages on PCP during bulk
> free").  So, this may trigger regression for the workloads in the
> commit.  Can you try it too?
>

Hi, I ran netperf-tcp as in commit f26b3fa04611 ("mm/page_alloc: limit 
number of high-order pages on PCP during bulk free"),

On a 2-node AMD server with 384 vCPUs, results I observed are as follows:

                                  6.12                     6.12
                               vanilla   freehigh-heuristicsopt
Hmean     64         732.14 (   0.00%)        736.90 (   0.65%)
Hmean     128       1417.46 (   0.00%)       1421.54 (   0.29%)
Hmean     256       2679.67 (   0.00%)       2689.68 (   0.37%)
Hmean     1024      8328.52 (   0.00%)       8413.94 (   1.03%)
Hmean     2048     12716.98 (   0.00%)      12838.94 (   0.96%)
Hmean     3312     15787.79 (   0.00%)      15822.40 (   0.22%)
Hmean     4096     17311.91 (   0.00%)      17328.74 (   0.10%)
Hmean     8192     20310.73 (   0.00%)      20447.12 (   0.67%)

It is not regressing for netperf-tcp. 

Thanks,
Nikhil Dhama


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-04-01 13:56   ` Nikhil Dhama
@ 2025-04-03  1:36     ` Huang, Ying
  2025-04-07  6:32       ` Nikhil Dhama
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2025-04-03  1:36 UTC (permalink / raw)
  To: Nikhil Dhama
  Cc: akpm, bharata, huang.ying.caritas, linux-kernel, linux-mm,
	mgorman, raghavendra.kodsarathimmappa, oe-lkp, lkp

Nikhil Dhama <nikhil.dhama@amd.com> writes:

> On 3/30/2025 12:22 PM, Huang, Ying wrote:
>
>>
>> Hi, Nikhil,
>>
>> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>>
>>> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
>>> which is invoked by free_pcppages_bulk(). So, it used to increase
>>> free_factor by 1 only when we try to reduce the size of pcp list or
>>> flush for high order.
>>> and free_high used to trigger only for order > 0 and order <
>>> costly_order and free_factor > 0.
>>>
>>> and free_factor used to scale down by a factor of 2 on every successful
>>> allocation.
>>>
>>> for iperf3 I noticed that with older design in kernel v6.6, pcp list was
>>> drained mostly when pcp->count > high (more often when count goes above
>>> 530). and most of the time free_factor was 0, triggering very few
>>> high order flushes.
>>>
>>> Whereas in the current design, free_factor is changed to free_count to keep
>>> track of the number of pages freed contiguously,
>>> and with this design for iperf3, pcp list is getting flushed more
>>> frequently because free_high heuristics is triggered more often now.
>>>
>>> In current design, free_count is incremented on every deallocation,
>>> irrespective of whether pcp list was reduced or not. And logic to
>>> trigger free_high is if free_count goes above batch (which is 63) and
>>> there are two contiguous page free without any allocation.
>>> (and with cache slice optimisation).
>>>
>>> With this design, I observed that high order pcp list is drained as soon
>>> as both count and free_count goes about 63.
>>>
>>> and due to this more aggressive high order flushing, applications
>>> doing contiguous high order allocation will require to go to global list
>>> more frequently.
>>>
>>> On a 2-node AMD machine with 384 vCPUs on each node,
>>> connected via Mellonox connectX-7, I am seeing a ~30% performance
>>> reduction if we scale number of iperf3 client/server pairs from 32 to 64.
>>>
>>> So, though this new design reduced the time to detect high order flushes,
>>> but for application which are allocating high order pages more
>>> frequently it may be flushing the high order list pre-maturely.
>>> This motivates towards tuning on how late or early we should flush
>>> high order lists.
>>>
>>> for free_high heuristics. I tried to scale batch and tune it,
>>> which will delay the free_high flushes.
>>>
>>>
>>>                       score   # free_high
>>> -----------           -----   -----------
>>> v6.6 (base)           100             4
>>> v6.12 (batch*1)        69           170
>>> batch*2                69           150
>>> batch*4                74           101
>>> batch*5               100            53
>>> batch*6               100            36
>>> batch*8               100             3
>>>
>>> scaling batch for free_high heuristics with a factor of 5 or above restores
>>> the performance, as it is reducing the number of high order flushes.
>>>
>>> On 2-node AMD server with 384 vCPUs each,score for other benchmarks with
>>> patch v2 along with iperf3 are as follows:
>>
>> Em..., IIUC, this may disable the free_high optimizations.  free_high
>> optimization is introduced by Mel Gorman in commit f26b3fa04611
>> ("mm/page_alloc: limit number of high-order pages on PCP during bulk
>> free").  So, this may trigger regression for the workloads in the
>> commit.  Can you try it too?
>>
>
> Hi, I ran netperf-tcp as in commit f26b3fa04611 ("mm/page_alloc: limit 
> number of high-order pages on PCP during bulk free"),
>
> On a 2-node AMD server with 384 vCPUs, results I observed are as follows:
>
>                                   6.12                     6.12
>                                vanilla   freehigh-heuristicsopt
> Hmean     64         732.14 (   0.00%)        736.90 (   0.65%)
> Hmean     128       1417.46 (   0.00%)       1421.54 (   0.29%)
> Hmean     256       2679.67 (   0.00%)       2689.68 (   0.37%)
> Hmean     1024      8328.52 (   0.00%)       8413.94 (   1.03%)
> Hmean     2048     12716.98 (   0.00%)      12838.94 (   0.96%)
> Hmean     3312     15787.79 (   0.00%)      15822.40 (   0.22%)
> Hmean     4096     17311.91 (   0.00%)      17328.74 (   0.10%)
> Hmean     8192     20310.73 (   0.00%)      20447.12 (   0.67%)
>
> It is not regressing for netperf-tcp. 

Thanks a lot for your data!

Think about this again.  Compared with the pcp->free_factor solution,
the pcp->free_count solution will trigger free_high heuristics more
early, this causes performance regression in your workloads.  So, it's
reasonable to raise the bar to trigger free_high.  And, it's also
reasonable to use a stricter threshold, as you have done in this patch.
However, "5 * batch" appears too magic and adapt to one type of machine.

Let's step back to do some analysis.  In the original pcp->free_factor
solution, free_high is triggered for contiguous freeing with size
ranging from "batch" to "pcp->high + batch".  So, the average value is
about "batch + pcp->high / 2".  While in the pcp->free_count solution,
free_high will be triggered for contiguous freeing with size "batch".
So, to restore the original behavior, it seems that we can use the
threshold "batch + pcp->high_min / 2".  Do you think that this is
reasonable?  If so, can you give it a try?

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-04-03  1:36     ` Huang, Ying
@ 2025-04-07  6:32       ` Nikhil Dhama
  2025-04-07  7:38         ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Nikhil Dhama @ 2025-04-07  6:32 UTC (permalink / raw)
  To: ying.huang
  Cc: akpm, bharata, huang.ying.caritas, linux-kernel, linux-mm,
	mgorman, raghavendra.kodsarathimmappa, oe-lkp, lkp, Nikhil Dhama

On 4/3/2025 7:06 AM, Huang, Ying wrote:

>
> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>
>> On 3/30/2025 12:22 PM, Huang, Ying wrote:
>>
>>>
>>> Hi, Nikhil,
>>>
>>> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>>>
>>>> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
>>>> which is invoked by free_pcppages_bulk(). So, it used to increase
>>>> free_factor by 1 only when we try to reduce the size of pcp list or
>>>> flush for high order.
>>>> and free_high used to trigger only for order > 0 and order <
>>>> costly_order and free_factor > 0.
>>>>
>>>> and free_factor used to scale down by a factor of 2 on every successful
>>>> allocation.
>>>>
>>>> for iperf3 I noticed that with older design in kernel v6.6, pcp list was
>>>> drained mostly when pcp->count > high (more often when count goes above
>>>> 530). and most of the time free_factor was 0, triggering very few
>>>> high order flushes.
>>>>
>>>> Whereas in the current design, free_factor is changed to free_count to keep
>>>> track of the number of pages freed contiguously,
>>>> and with this design for iperf3, pcp list is getting flushed more
>>>> frequently because free_high heuristics is triggered more often now.
>>>>
>>>> In current design, free_count is incremented on every deallocation,
>>>> irrespective of whether pcp list was reduced or not. And logic to
>>>> trigger free_high is if free_count goes above batch (which is 63) and
>>>> there are two contiguous page free without any allocation.
>>>> (and with cache slice optimisation).
>>>>
>>>> With this design, I observed that high order pcp list is drained as soon
>>>> as both count and free_count goes about 63.
>>>>
>>>> and due to this more aggressive high order flushing, applications
>>>> doing contiguous high order allocation will require to go to global list
>>>> more frequently.
>>>>
>>>> On a 2-node AMD machine with 384 vCPUs on each node,
>>>> connected via Mellonox connectX-7, I am seeing a ~30% performance
>>>> reduction if we scale number of iperf3 client/server pairs from 32 to 64.
>>>>
>>>> So, though this new design reduced the time to detect high order flushes,
>>>> but for application which are allocating high order pages more
>>>> frequently it may be flushing the high order list pre-maturely.
>>>> This motivates towards tuning on how late or early we should flush
>>>> high order lists.
>>>>
>>>> for free_high heuristics. I tried to scale batch and tune it,
>>>> which will delay the free_high flushes.
>>>>
>>>>
>>>>                       score   # free_high
>>>> -----------           -----   -----------
>>>> v6.6 (base)           100             4
>>>> v6.12 (batch*1)        69           170
>>>> batch*2                69           150
>>>> batch*4                74           101
>>>> batch*5               100            53
>>>> batch*6               100            36
>>>> batch*8               100             3
>>>>
>>>> scaling batch for free_high heuristics with a factor of 5 or above restores
>>>> the performance, as it is reducing the number of high order flushes.
>>>>
>>>> On 2-node AMD server with 384 vCPUs each,score for other benchmarks with
>>>> patch v2 along with iperf3 are as follows:
>>>
>>> Em..., IIUC, this may disable the free_high optimizations.  free_high
>>> optimization is introduced by Mel Gorman in commit f26b3fa04611
>>> ("mm/page_alloc: limit number of high-order pages on PCP during bulk
>>> free").  So, this may trigger regression for the workloads in the
>>> commit.  Can you try it too?
>>>
>>
>> Hi, I ran netperf-tcp as in commit f26b3fa04611 ("mm/page_alloc: limit
>> number of high-order pages on PCP during bulk free"),
>>
>> On a 2-node AMD server with 384 vCPUs, results I observed are as follows:
>>
>>                                   6.12                     6.12
>>                                vanilla   freehigh-heuristicsopt
>> Hmean     64         732.14 (   0.00%)        736.90 (   0.65%)
>> Hmean     128       1417.46 (   0.00%)       1421.54 (   0.29%)
>> Hmean     256       2679.67 (   0.00%)       2689.68 (   0.37%)
>> Hmean     1024      8328.52 (   0.00%)       8413.94 (   1.03%)
>> Hmean     2048     12716.98 (   0.00%)      12838.94 (   0.96%)
>> Hmean     3312     15787.79 (   0.00%)      15822.40 (   0.22%)
>> Hmean     4096     17311.91 (   0.00%)      17328.74 (   0.10%)
>> Hmean     8192     20310.73 (   0.00%)      20447.12 (   0.67%)
>>
>> It is not regressing for netperf-tcp.
>
> Thanks a lot for your data!
>
> Think about this again.  Compared with the pcp->free_factor solution,
> the pcp->free_count solution will trigger free_high heuristics more
> early, this causes performance regression in your workloads.  So, it's
> reasonable to raise the bar to trigger free_high.  And, it's also
> reasonable to use a stricter threshold, as you have done in this patch.
> However, "5 * batch" appears too magic and adapt to one type of machine.
>
> Let's step back to do some analysis.  In the original pcp->free_factor
> solution, free_high is triggered for contiguous freeing with size
> ranging from "batch" to "pcp->high + batch".  So, the average value is
> about "batch + pcp->high / 2".  While in the pcp->free_count solution,
> free_high will be triggered for contiguous freeing with size "batch".
> So, to restore the original behavior, it seems that we can use the
> threshold "batch + pcp->high_min / 2".  Do you think that this is
> reasonable?  If so, can you give it a try?

Hi, 

I have tried your suggestion as setting threshold to "batch + pcp->high_min / 2",
scores for different benchmarks on the same machine 
(2-Node AMD server with 384 vCPUs each) are as follows:

                      iperf3    lmbench3            netperf         kbuild
                               (AF_UNIX)      (SCTP_STREAM_MANY)
                     -------   ---------      -----------------     ------
v6.6  vanilla (base)    100          100                  100          100
v6.12 vanilla            69          113                 98.5         98.8
v6.12 avg_threshold     100        110.3                100.2         99.3

and for netperf-tcp, it is as follows:

                                  6.12                     6.12
                               vanilla   avg_free_high_threshold
Hmean     64         732.14 (   0.00%)        730.45 (  -0.23%)
Hmean     128       1417.46 (   0.00%)       1419.44 (   0.14%)
Hmean     256       2679.67 (   0.00%)       2676.45 (  -0.12%)
Hmean     1024      8328.52 (   0.00%)       8339.34 (   0.13%)
Hmean     2048     12716.98 (   0.00%)      12743.68 (   0.21%)
Hmean     3312     15787.79 (   0.00%)      15887.25 (   0.63%)
Hmean     4096     17311.91 (   0.00%)      17332.68 (   0.12%)
Hmean     8192     20310.73 (   0.00%)      20465.09 (   0.76%)

Thanks,
Nikhil Dhama


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-04-07  6:32       ` Nikhil Dhama
@ 2025-04-07  7:38         ` Huang, Ying
  2025-04-07 11:03           ` Nikhil Dhama
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2025-04-07  7:38 UTC (permalink / raw)
  To: Nikhil Dhama
  Cc: akpm, bharata, huang.ying.caritas, linux-kernel, linux-mm,
	mgorman, raghavendra.kodsarathimmappa, oe-lkp, lkp

Nikhil Dhama <nikhil.dhama@amd.com> writes:

> On 4/3/2025 7:06 AM, Huang, Ying wrote:
>
>>
>> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>>
>>> On 3/30/2025 12:22 PM, Huang, Ying wrote:
>>>
>>>>
>>>> Hi, Nikhil,
>>>>
>>>> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>>>>
>>>>> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
>>>>> which is invoked by free_pcppages_bulk(). So, it used to increase
>>>>> free_factor by 1 only when we try to reduce the size of pcp list or
>>>>> flush for high order.
>>>>> and free_high used to trigger only for order > 0 and order <
>>>>> costly_order and free_factor > 0.
>>>>>
>>>>> and free_factor used to scale down by a factor of 2 on every successful
>>>>> allocation.
>>>>>
>>>>> for iperf3 I noticed that with older design in kernel v6.6, pcp list was
>>>>> drained mostly when pcp->count > high (more often when count goes above
>>>>> 530). and most of the time free_factor was 0, triggering very few
>>>>> high order flushes.
>>>>>
>>>>> Whereas in the current design, free_factor is changed to free_count to keep
>>>>> track of the number of pages freed contiguously,
>>>>> and with this design for iperf3, pcp list is getting flushed more
>>>>> frequently because free_high heuristics is triggered more often now.
>>>>>
>>>>> In current design, free_count is incremented on every deallocation,
>>>>> irrespective of whether pcp list was reduced or not. And logic to
>>>>> trigger free_high is if free_count goes above batch (which is 63) and
>>>>> there are two contiguous page free without any allocation.
>>>>> (and with cache slice optimisation).
>>>>>
>>>>> With this design, I observed that high order pcp list is drained as soon
>>>>> as both count and free_count goes about 63.
>>>>>
>>>>> and due to this more aggressive high order flushing, applications
>>>>> doing contiguous high order allocation will require to go to global list
>>>>> more frequently.
>>>>>
>>>>> On a 2-node AMD machine with 384 vCPUs on each node,
>>>>> connected via Mellonox connectX-7, I am seeing a ~30% performance
>>>>> reduction if we scale number of iperf3 client/server pairs from 32 to 64.
>>>>>
>>>>> So, though this new design reduced the time to detect high order flushes,
>>>>> but for application which are allocating high order pages more
>>>>> frequently it may be flushing the high order list pre-maturely.
>>>>> This motivates towards tuning on how late or early we should flush
>>>>> high order lists.
>>>>>
>>>>> for free_high heuristics. I tried to scale batch and tune it,
>>>>> which will delay the free_high flushes.
>>>>>
>>>>>
>>>>>                       score   # free_high
>>>>> -----------           -----   -----------
>>>>> v6.6 (base)           100             4
>>>>> v6.12 (batch*1)        69           170
>>>>> batch*2                69           150
>>>>> batch*4                74           101
>>>>> batch*5               100            53
>>>>> batch*6               100            36
>>>>> batch*8               100             3
>>>>>
>>>>> scaling batch for free_high heuristics with a factor of 5 or above restores
>>>>> the performance, as it is reducing the number of high order flushes.
>>>>>
>>>>> On 2-node AMD server with 384 vCPUs each,score for other benchmarks with
>>>>> patch v2 along with iperf3 are as follows:
>>>>
>>>> Em..., IIUC, this may disable the free_high optimizations.  free_high
>>>> optimization is introduced by Mel Gorman in commit f26b3fa04611
>>>> ("mm/page_alloc: limit number of high-order pages on PCP during bulk
>>>> free").  So, this may trigger regression for the workloads in the
>>>> commit.  Can you try it too?
>>>>
>>>
>>> Hi, I ran netperf-tcp as in commit f26b3fa04611 ("mm/page_alloc: limit
>>> number of high-order pages on PCP during bulk free"),
>>>
>>> On a 2-node AMD server with 384 vCPUs, results I observed are as follows:
>>>
>>>                                   6.12                     6.12
>>>                                vanilla   freehigh-heuristicsopt
>>> Hmean     64         732.14 (   0.00%)        736.90 (   0.65%)
>>> Hmean     128       1417.46 (   0.00%)       1421.54 (   0.29%)
>>> Hmean     256       2679.67 (   0.00%)       2689.68 (   0.37%)
>>> Hmean     1024      8328.52 (   0.00%)       8413.94 (   1.03%)
>>> Hmean     2048     12716.98 (   0.00%)      12838.94 (   0.96%)
>>> Hmean     3312     15787.79 (   0.00%)      15822.40 (   0.22%)
>>> Hmean     4096     17311.91 (   0.00%)      17328.74 (   0.10%)
>>> Hmean     8192     20310.73 (   0.00%)      20447.12 (   0.67%)
>>>
>>> It is not regressing for netperf-tcp.
>>
>> Thanks a lot for your data!
>>
>> Think about this again.  Compared with the pcp->free_factor solution,
>> the pcp->free_count solution will trigger free_high heuristics more
>> early, this causes performance regression in your workloads.  So, it's
>> reasonable to raise the bar to trigger free_high.  And, it's also
>> reasonable to use a stricter threshold, as you have done in this patch.
>> However, "5 * batch" appears too magic and adapt to one type of machine.
>>
>> Let's step back to do some analysis.  In the original pcp->free_factor
>> solution, free_high is triggered for contiguous freeing with size
>> ranging from "batch" to "pcp->high + batch".  So, the average value is
>> about "batch + pcp->high / 2".  While in the pcp->free_count solution,
>> free_high will be triggered for contiguous freeing with size "batch".
>> So, to restore the original behavior, it seems that we can use the
>> threshold "batch + pcp->high_min / 2".  Do you think that this is
>> reasonable?  If so, can you give it a try?
>
> Hi, 
>
> I have tried your suggestion as setting threshold to "batch + pcp->high_min / 2",
> scores for different benchmarks on the same machine 
> (2-Node AMD server with 384 vCPUs each) are as follows:
>
>                       iperf3    lmbench3            netperf         kbuild
>                                (AF_UNIX)      (SCTP_STREAM_MANY)
>                      -------   ---------      -----------------     ------
> v6.6  vanilla (base)    100          100                  100          100
> v6.12 vanilla            69          113                 98.5         98.8
> v6.12 avg_threshold     100        110.3                100.2         99.3
>
> and for netperf-tcp, it is as follows:
>
>                                   6.12                     6.12
>                                vanilla   avg_free_high_threshold
> Hmean     64         732.14 (   0.00%)        730.45 (  -0.23%)
> Hmean     128       1417.46 (   0.00%)       1419.44 (   0.14%)
> Hmean     256       2679.67 (   0.00%)       2676.45 (  -0.12%)
> Hmean     1024      8328.52 (   0.00%)       8339.34 (   0.13%)
> Hmean     2048     12716.98 (   0.00%)      12743.68 (   0.21%)
> Hmean     3312     15787.79 (   0.00%)      15887.25 (   0.63%)
> Hmean     4096     17311.91 (   0.00%)      17332.68 (   0.12%)
> Hmean     8192     20310.73 (   0.00%)      20465.09 (   0.76%)

Thanks a lot for test and results!

It looks good to me.  Can you submit a formal patch?

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation
  2025-04-07  7:38         ` Huang, Ying
@ 2025-04-07 11:03           ` Nikhil Dhama
  0 siblings, 0 replies; 8+ messages in thread
From: Nikhil Dhama @ 2025-04-07 11:03 UTC (permalink / raw)
  To: Huang, Ying, Nikhil Dhama
  Cc: akpm, bharata, huang.ying.caritas, linux-kernel, linux-mm,
	mgorman, raghavendra.kodsarathimmappa, oe-lkp, lkp



On 4/7/2025 1:08 PM, Huang, Ying wrote:
> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>
>> On 4/3/2025 7:06 AM, Huang, Ying wrote:
>>
>>> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>>>
>>>> On 3/30/2025 12:22 PM, Huang, Ying wrote:
>>>>
>>>>> Hi, Nikhil,
>>>>>
>>>>> Nikhil Dhama <nikhil.dhama@amd.com> writes:
>>>>>
>>>>>> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
>>>>>> which is invoked by free_pcppages_bulk(). So, it used to increase
>>>>>> free_factor by 1 only when we try to reduce the size of pcp list or
>>>>>> flush for high order.
>>>>>> and free_high used to trigger only for order > 0 and order <
>>>>>> costly_order and free_factor > 0.
>>>>>>
>>>>>> and free_factor used to scale down by a factor of 2 on every successful
>>>>>> allocation.
>>>>>>
>>>>>> for iperf3 I noticed that with older design in kernel v6.6, pcp list was
>>>>>> drained mostly when pcp->count > high (more often when count goes above
>>>>>> 530). and most of the time free_factor was 0, triggering very few
>>>>>> high order flushes.
>>>>>>
>>>>>> Whereas in the current design, free_factor is changed to free_count to keep
>>>>>> track of the number of pages freed contiguously,
>>>>>> and with this design for iperf3, pcp list is getting flushed more
>>>>>> frequently because free_high heuristics is triggered more often now.
>>>>>>
>>>>>> In current design, free_count is incremented on every deallocation,
>>>>>> irrespective of whether pcp list was reduced or not. And logic to
>>>>>> trigger free_high is if free_count goes above batch (which is 63) and
>>>>>> there are two contiguous page free without any allocation.
>>>>>> (and with cache slice optimisation).
>>>>>>
>>>>>> With this design, I observed that high order pcp list is drained as soon
>>>>>> as both count and free_count goes about 63.
>>>>>>
>>>>>> and due to this more aggressive high order flushing, applications
>>>>>> doing contiguous high order allocation will require to go to global list
>>>>>> more frequently.
>>>>>>
>>>>>> On a 2-node AMD machine with 384 vCPUs on each node,
>>>>>> connected via Mellonox connectX-7, I am seeing a ~30% performance
>>>>>> reduction if we scale number of iperf3 client/server pairs from 32 to 64.
>>>>>>
>>>>>> So, though this new design reduced the time to detect high order flushes,
>>>>>> but for application which are allocating high order pages more
>>>>>> frequently it may be flushing the high order list pre-maturely.
>>>>>> This motivates towards tuning on how late or early we should flush
>>>>>> high order lists.
>>>>>>
>>>>>> for free_high heuristics. I tried to scale batch and tune it,
>>>>>> which will delay the free_high flushes.
>>>>>>
>>>>>>
>>>>>>                        score   # free_high
>>>>>> -----------           -----   -----------
>>>>>> v6.6 (base)           100             4
>>>>>> v6.12 (batch*1)        69           170
>>>>>> batch*2                69           150
>>>>>> batch*4                74           101
>>>>>> batch*5               100            53
>>>>>> batch*6               100            36
>>>>>> batch*8               100             3
>>>>>>
>>>>>> scaling batch for free_high heuristics with a factor of 5 or above restores
>>>>>> the performance, as it is reducing the number of high order flushes.
>>>>>>
>>>>>> On 2-node AMD server with 384 vCPUs each,score for other benchmarks with
>>>>>> patch v2 along with iperf3 are as follows:
>>>>> Em..., IIUC, this may disable the free_high optimizations.  free_high
>>>>> optimization is introduced by Mel Gorman in commit f26b3fa04611
>>>>> ("mm/page_alloc: limit number of high-order pages on PCP during bulk
>>>>> free").  So, this may trigger regression for the workloads in the
>>>>> commit.  Can you try it too?
>>>>>
>>>> Hi, I ran netperf-tcp as in commit f26b3fa04611 ("mm/page_alloc: limit
>>>> number of high-order pages on PCP during bulk free"),
>>>>
>>>> On a 2-node AMD server with 384 vCPUs, results I observed are as follows:
>>>>
>>>>                                    6.12                     6.12
>>>>                                 vanilla   freehigh-heuristicsopt
>>>> Hmean     64         732.14 (   0.00%)        736.90 (   0.65%)
>>>> Hmean     128       1417.46 (   0.00%)       1421.54 (   0.29%)
>>>> Hmean     256       2679.67 (   0.00%)       2689.68 (   0.37%)
>>>> Hmean     1024      8328.52 (   0.00%)       8413.94 (   1.03%)
>>>> Hmean     2048     12716.98 (   0.00%)      12838.94 (   0.96%)
>>>> Hmean     3312     15787.79 (   0.00%)      15822.40 (   0.22%)
>>>> Hmean     4096     17311.91 (   0.00%)      17328.74 (   0.10%)
>>>> Hmean     8192     20310.73 (   0.00%)      20447.12 (   0.67%)
>>>>
>>>> It is not regressing for netperf-tcp.
>>> Thanks a lot for your data!
>>>
>>> Think about this again.  Compared with the pcp->free_factor solution,
>>> the pcp->free_count solution will trigger free_high heuristics more
>>> early, this causes performance regression in your workloads.  So, it's
>>> reasonable to raise the bar to trigger free_high.  And, it's also
>>> reasonable to use a stricter threshold, as you have done in this patch.
>>> However, "5 * batch" appears too magic and adapt to one type of machine.
>>>
>>> Let's step back to do some analysis.  In the original pcp->free_factor
>>> solution, free_high is triggered for contiguous freeing with size
>>> ranging from "batch" to "pcp->high + batch".  So, the average value is
>>> about "batch + pcp->high / 2".  While in the pcp->free_count solution,
>>> free_high will be triggered for contiguous freeing with size "batch".
>>> So, to restore the original behavior, it seems that we can use the
>>> threshold "batch + pcp->high_min / 2".  Do you think that this is
>>> reasonable?  If so, can you give it a try?
>> Hi,
>>
>> I have tried your suggestion as setting threshold to "batch + pcp->high_min / 2",
>> scores for different benchmarks on the same machine
>> (2-Node AMD server with 384 vCPUs each) are as follows:
>>
>>                        iperf3    lmbench3            netperf         kbuild
>>                                 (AF_UNIX)      (SCTP_STREAM_MANY)
>>                       -------   ---------      -----------------     ------
>> v6.6  vanilla (base)    100          100                  100          100
>> v6.12 vanilla            69          113                 98.5         98.8
>> v6.12 avg_threshold     100        110.3                100.2         99.3
>>
>> and for netperf-tcp, it is as follows:
>>
>>                                    6.12                     6.12
>>                                 vanilla   avg_free_high_threshold
>> Hmean     64         732.14 (   0.00%)        730.45 (  -0.23%)
>> Hmean     128       1417.46 (   0.00%)       1419.44 (   0.14%)
>> Hmean     256       2679.67 (   0.00%)       2676.45 (  -0.12%)
>> Hmean     1024      8328.52 (   0.00%)       8339.34 (   0.13%)
>> Hmean     2048     12716.98 (   0.00%)      12743.68 (   0.21%)
>> Hmean     3312     15787.79 (   0.00%)      15887.25 (   0.63%)
>> Hmean     4096     17311.91 (   0.00%)      17332.68 (   0.12%)
>> Hmean     8192     20310.73 (   0.00%)      20465.09 (   0.76%)
> Thanks a lot for test and results!
>
> It looks good to me.  Can you submit a formal patch?

Thank you Huang Ying,  Yes,  I have submitted a formal patch with this.
Patch v3: 
https://lore.kernel.org/linux-mm/20250407105219.55351-1-nikhil.dhama@amd.com/
---
Thanks,
Nikhil Dhama


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-07 11:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-25 17:19 [PATCH] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation Nikhil Dhama
2025-03-30  6:52 ` Huang, Ying
2025-03-31 14:10 ` kernel test robot
2025-04-01 13:56   ` Nikhil Dhama
2025-04-03  1:36     ` Huang, Ying
2025-04-07  6:32       ` Nikhil Dhama
2025-04-07  7:38         ` Huang, Ying
2025-04-07 11:03           ` Nikhil Dhama

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox