[linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [linus:master] [mm]  d2136d749d:  vm-scalability.throughput -7.1% regression
@ 2024-06-20  2:39 kernel test robot
  2024-06-20  6:07 ` Baolin Wang
  0 siblings, 1 reply; 5+ messages in thread
From: kernel test robot @ 2024-06-20  2:39 UTC (permalink / raw)
  To: Baolin Wang
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huang, Ying,
	David Hildenbrand, John Hubbard, Kefeng Wang, Mel Gorman,
	Ryan Roberts, linux-mm, feng.tang, fengwei.yin, oliver.sang



Hello,

kernel test robot noticed a -7.1% regression of vm-scalability.throughput on:


commit: d2136d749d76af980b3accd72704eea4eab625bd ("mm: support multi-size THP numa balancing")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[still regression on linus/master 92e5605a199efbaee59fb19e15d6cc2103a04ec2]


testcase: vm-scalability
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
parameters:

	runtime: 300s
	size: 512G
	test: anon-cow-rand-hugetlb
	cpufreq_governor: performance



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202406201010.a1344783-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240620/202406201010.a1344783-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
  gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability

commit: 
  6b0ed7b3c7 ("mm: factor out the numa mapping rebuilding into a new helper")
  d2136d749d ("mm: support multi-size THP numa balancing")

6b0ed7b3c77547d2 d2136d749d76af980b3accd7270 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     12.02            -1.3       10.72 ±  4%  mpstat.cpu.all.sys%
   1228757            +3.0%    1265679        proc-vmstat.pgfault
   7392513            -7.1%    6865649        vm-scalability.throughput
     17356            +9.4%      18986        vm-scalability.time.user_time
      0.32 ± 22%     -36.9%       0.20 ± 17%  sched_debug.cfs_rq:/.h_nr_running.stddev
     28657 ± 86%     -90.8%       2640 ± 19%  sched_debug.cfs_rq:/.load.stddev
      0.28 ± 35%     -52.1%       0.13 ± 29%  sched_debug.cfs_rq:/.nr_running.stddev
    299.88 ± 27%     -39.6%     181.04 ± 23%  sched_debug.cfs_rq:/.runnable_avg.stddev
    284.88 ± 32%     -44.0%     159.65 ± 27%  sched_debug.cfs_rq:/.util_avg.stddev
      0.32 ± 22%     -37.2%       0.20 ± 17%  sched_debug.cpu.nr_running.stddev
 1.584e+10 ±  2%      -6.9%  1.476e+10 ±  3%  perf-stat.i.branch-instructions
  11673151 ±  3%      -6.3%   10935072 ±  4%  perf-stat.i.branch-misses
      4.90            +3.5%       5.07        perf-stat.i.cpi
    333.40            +7.5%     358.32        perf-stat.i.cycles-between-cache-misses
 6.787e+10 ±  2%      -6.8%  6.324e+10 ±  3%  perf-stat.i.instructions
      0.25            -6.2%       0.24        perf-stat.i.ipc
      4.19            +7.5%       4.51        perf-stat.overall.cpi
    323.02            +7.4%     346.94        perf-stat.overall.cycles-between-cache-misses
      0.24            -7.0%       0.22        perf-stat.overall.ipc
 1.549e+10 ±  2%      -6.8%  1.444e+10 ±  3%  perf-stat.ps.branch-instructions
 6.634e+10 ±  2%      -6.7%  6.186e+10 ±  3%  perf-stat.ps.instructions
     17.33 ± 77%     -10.6        6.72 ±169%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
     17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
     17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
     17.27 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
     13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
     13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
     13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
     13.23 ± 76%      -8.1        5.13 ±168%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
      3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
     17.35 ± 77%     -10.6        6.73 ±169%  perf-profile.children.cycles-pp.asm_exc_page_fault
     17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.do_user_addr_fault
     17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.exc_page_fault
     17.30 ± 77%     -10.6        6.71 ±168%  perf-profile.children.cycles-pp.handle_mm_fault
     17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.children.cycles-pp.hugetlb_fault
     13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.children.cycles-pp.hugetlb_wp
     13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_user_large_folio
     13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_subpage
     13.34 ± 76%      -8.2        5.17 ±168%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
      3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.children.cycles-pp.__mutex_lock
     13.24 ± 76%      -8.1        5.13 ±168%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression
  2024-06-20  2:39 [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression kernel test robot
@ 2024-06-20  6:07 ` Baolin Wang
  2024-06-20  7:38   ` Huang, Ying
  0 siblings, 1 reply; 5+ messages in thread
From: Baolin Wang @ 2024-06-20  6:07 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Huang, Ying,
	David Hildenbrand, John Hubbard, Kefeng Wang, Mel Gorman,
	Ryan Roberts, linux-mm, feng.tang, fengwei.yin



On 2024/6/20 10:39, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -7.1% regression of vm-scalability.throughput on:
> 
> 
> commit: d2136d749d76af980b3accd72704eea4eab625bd ("mm: support multi-size THP numa balancing")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> [still regression on linus/master 92e5605a199efbaee59fb19e15d6cc2103a04ec2]
> 
> 
> testcase: vm-scalability
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> parameters:
> 
> 	runtime: 300s
> 	size: 512G
> 	test: anon-cow-rand-hugetlb
> 	cpufreq_governor: performance

Thanks for reporting. IIUC numa balancing will not scan hugetlb VMA, I'm 
not sure how this patch affects the performance of hugetlb cow, but let 
me try to reproduce it.


> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202406201010.a1344783-oliver.sang@intel.com
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240620/202406201010.a1344783-oliver.sang@intel.com
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>    gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
> 
> commit:
>    6b0ed7b3c7 ("mm: factor out the numa mapping rebuilding into a new helper")
>    d2136d749d ("mm: support multi-size THP numa balancing")
> 
> 6b0ed7b3c77547d2 d2136d749d76af980b3accd7270
> ---------------- ---------------------------
>           %stddev     %change         %stddev
>               \          |                \
>       12.02            -1.3       10.72 ±  4%  mpstat.cpu.all.sys%
>     1228757            +3.0%    1265679        proc-vmstat.pgfault
>     7392513            -7.1%    6865649        vm-scalability.throughput
>       17356            +9.4%      18986        vm-scalability.time.user_time
>        0.32 ± 22%     -36.9%       0.20 ± 17%  sched_debug.cfs_rq:/.h_nr_running.stddev
>       28657 ± 86%     -90.8%       2640 ± 19%  sched_debug.cfs_rq:/.load.stddev
>        0.28 ± 35%     -52.1%       0.13 ± 29%  sched_debug.cfs_rq:/.nr_running.stddev
>      299.88 ± 27%     -39.6%     181.04 ± 23%  sched_debug.cfs_rq:/.runnable_avg.stddev
>      284.88 ± 32%     -44.0%     159.65 ± 27%  sched_debug.cfs_rq:/.util_avg.stddev
>        0.32 ± 22%     -37.2%       0.20 ± 17%  sched_debug.cpu.nr_running.stddev
>   1.584e+10 ±  2%      -6.9%  1.476e+10 ±  3%  perf-stat.i.branch-instructions
>    11673151 ±  3%      -6.3%   10935072 ±  4%  perf-stat.i.branch-misses
>        4.90            +3.5%       5.07        perf-stat.i.cpi
>      333.40            +7.5%     358.32        perf-stat.i.cycles-between-cache-misses
>   6.787e+10 ±  2%      -6.8%  6.324e+10 ±  3%  perf-stat.i.instructions
>        0.25            -6.2%       0.24        perf-stat.i.ipc
>        4.19            +7.5%       4.51        perf-stat.overall.cpi
>      323.02            +7.4%     346.94        perf-stat.overall.cycles-between-cache-misses
>        0.24            -7.0%       0.22        perf-stat.overall.ipc
>   1.549e+10 ±  2%      -6.8%  1.444e+10 ±  3%  perf-stat.ps.branch-instructions
>   6.634e+10 ±  2%      -6.7%  6.186e+10 ±  3%  perf-stat.ps.instructions
>       17.33 ± 77%     -10.6        6.72 ±169%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
>       17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
>       17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>       17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>       17.27 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>       13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>       13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
>       13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
>       13.23 ± 76%      -8.1        5.13 ±168%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
>        3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>       17.35 ± 77%     -10.6        6.73 ±169%  perf-profile.children.cycles-pp.asm_exc_page_fault
>       17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.do_user_addr_fault
>       17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.exc_page_fault
>       17.30 ± 77%     -10.6        6.71 ±168%  perf-profile.children.cycles-pp.handle_mm_fault
>       17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.children.cycles-pp.hugetlb_fault
>       13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.children.cycles-pp.hugetlb_wp
>       13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_user_large_folio
>       13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_subpage
>       13.34 ± 76%      -8.2        5.17 ±168%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
>        3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.children.cycles-pp.__mutex_lock
>       13.24 ± 76%      -8.1        5.13 ±168%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression
  2024-06-20  6:07 ` Baolin Wang
@ 2024-06-20  7:38   ` Huang, Ying
  2024-06-20  8:44     ` Baolin Wang
  2024-06-20 11:13     ` Baolin Wang
  0 siblings, 2 replies; 5+ messages in thread
From: Huang, Ying @ 2024-06-20  7:38 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Andrew Morton,
	David Hildenbrand, John Hubbard, Kefeng Wang, Mel Gorman,
	Ryan Roberts, linux-mm, feng.tang, fengwei.yin

Baolin Wang <baolin.wang@linux.alibaba.com> writes:

> On 2024/6/20 10:39, kernel test robot wrote:
>> Hello,
>> kernel test robot noticed a -7.1% regression of
>> vm-scalability.throughput on:
>> commit: d2136d749d76af980b3accd72704eea4eab625bd ("mm: support
>> multi-size THP numa balancing")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>> [still regression on linus/master
>> 92e5605a199efbaee59fb19e15d6cc2103a04ec2]
>> testcase: vm-scalability
>> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
>> parameters:
>> 	runtime: 300s
>> 	size: 512G
>> 	test: anon-cow-rand-hugetlb
>> 	cpufreq_governor: performance
>
> Thanks for reporting. IIUC numa balancing will not scan hugetlb VMA,
> I'm not sure how this patch affects the performance of hugetlb cow,
> but let me try to reproduce it.
>
>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>> | Closes: https://lore.kernel.org/oe-lkp/202406201010.a1344783-oliver.sang@intel.com
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>> The kernel config and materials to reproduce are available at:
>> https://download.01.org/0day-ci/archive/20240620/202406201010.a1344783-oliver.sang@intel.com
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>>    gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
>> commit:
>>    6b0ed7b3c7 ("mm: factor out the numa mapping rebuilding into a new helper")
>>    d2136d749d ("mm: support multi-size THP numa balancing")
>> 6b0ed7b3c77547d2 d2136d749d76af980b3accd7270
>> ---------------- ---------------------------
>>           %stddev     %change         %stddev
>>               \          |                \
>>       12.02            -1.3       10.72 ±  4%  mpstat.cpu.all.sys%
>>     1228757            +3.0%    1265679        proc-vmstat.pgfault

Also from other proc-vmstat stats,

     21770  36%      +6.1%      23098  28%  proc-vmstat.numa_hint_faults
      6168 107%     +48.8%       9180  18%  proc-vmstat.numa_hint_faults_local
    154537  15%     +23.5%     190883  17%  proc-vmstat.numa_pte_updates

After your patch, more hint page faults occurs, I think this is expected.

Then, tasks may be moved between sockets because of that, so that some
hugetlb page access becomes remote?

>>     7392513            -7.1%    6865649        vm-scalability.throughput
>>       17356            +9.4%      18986        vm-scalability.time.user_time
>>        0.32 ± 22%     -36.9%       0.20 ± 17%  sched_debug.cfs_rq:/.h_nr_running.stddev
>>       28657 ± 86%     -90.8%       2640 ± 19%  sched_debug.cfs_rq:/.load.stddev
>>        0.28 ± 35%     -52.1%       0.13 ± 29%  sched_debug.cfs_rq:/.nr_running.stddev
>>      299.88 ± 27%     -39.6%     181.04 ± 23%  sched_debug.cfs_rq:/.runnable_avg.stddev
>>      284.88 ± 32%     -44.0%     159.65 ± 27%  sched_debug.cfs_rq:/.util_avg.stddev
>>        0.32 ± 22%     -37.2%       0.20 ± 17%  sched_debug.cpu.nr_running.stddev
>>   1.584e+10 ±  2%      -6.9%  1.476e+10 ±  3%  perf-stat.i.branch-instructions
>>    11673151 ±  3%      -6.3%   10935072 ±  4%  perf-stat.i.branch-misses
>>        4.90            +3.5%       5.07        perf-stat.i.cpi
>>      333.40            +7.5%     358.32        perf-stat.i.cycles-between-cache-misses
>>   6.787e+10 ±  2%      -6.8%  6.324e+10 ±  3%  perf-stat.i.instructions
>>        0.25            -6.2%       0.24        perf-stat.i.ipc
>>        4.19            +7.5%       4.51        perf-stat.overall.cpi
>>      323.02            +7.4%     346.94        perf-stat.overall.cycles-between-cache-misses
>>        0.24            -7.0%       0.22        perf-stat.overall.ipc
>>   1.549e+10 ±  2%      -6.8%  1.444e+10 ±  3%  perf-stat.ps.branch-instructions
>>   6.634e+10 ±  2%      -6.7%  6.186e+10 ±  3%  perf-stat.ps.instructions
>>       17.33 ± 77%     -10.6        6.72 ±169%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
>>       17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
>>       17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>>       17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>>       17.27 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>>       13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>>       13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
>>       13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
>>       13.23 ± 76%      -8.1        5.13 ±168%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
>>        3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>>       17.35 ± 77%     -10.6        6.73 ±169%  perf-profile.children.cycles-pp.asm_exc_page_fault
>>       17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.do_user_addr_fault
>>       17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.exc_page_fault
>>       17.30 ± 77%     -10.6        6.71 ±168%  perf-profile.children.cycles-pp.handle_mm_fault
>>       17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.children.cycles-pp.hugetlb_fault
>>       13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.children.cycles-pp.hugetlb_wp
>>       13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_user_large_folio
>>       13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_subpage
>>       13.34 ± 76%      -8.2        5.17 ±168%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
>>        3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.children.cycles-pp.__mutex_lock
>>       13.24 ± 76%      -8.1        5.13 ±168%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
>> Disclaimer:
>> Results have been estimated based on internal Intel analysis and are provided
>> for informational purposes only. Any difference in system hardware or software
>> design or configuration may affect actual performance.
>> 

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression
  2024-06-20  7:38   ` Huang, Ying
@ 2024-06-20  8:44     ` Baolin Wang
  2024-06-20 11:13     ` Baolin Wang
  1 sibling, 0 replies; 5+ messages in thread
From: Baolin Wang @ 2024-06-20  8:44 UTC (permalink / raw)
  To: Huang, Ying
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Andrew Morton,
	David Hildenbrand, John Hubbard, Kefeng Wang, Mel Gorman,
	Ryan Roberts, linux-mm, feng.tang, fengwei.yin



On 2024/6/20 15:38, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 2024/6/20 10:39, kernel test robot wrote:
>>> Hello,
>>> kernel test robot noticed a -7.1% regression of
>>> vm-scalability.throughput on:
>>> commit: d2136d749d76af980b3accd72704eea4eab625bd ("mm: support
>>> multi-size THP numa balancing")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>> [still regression on linus/master
>>> 92e5605a199efbaee59fb19e15d6cc2103a04ec2]
>>> testcase: vm-scalability
>>> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
>>> parameters:
>>> 	runtime: 300s
>>> 	size: 512G
>>> 	test: anon-cow-rand-hugetlb
>>> 	cpufreq_governor: performance
>>
>> Thanks for reporting. IIUC numa balancing will not scan hugetlb VMA,
>> I'm not sure how this patch affects the performance of hugetlb cow,
>> but let me try to reproduce it.
>>
>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>> | Closes: https://lore.kernel.org/oe-lkp/202406201010.a1344783-oliver.sang@intel.com
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>> The kernel config and materials to reproduce are available at:
>>> https://download.01.org/0day-ci/archive/20240620/202406201010.a1344783-oliver.sang@intel.com
>>> =========================================================================================
>>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>>>     gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
>>> commit:
>>>     6b0ed7b3c7 ("mm: factor out the numa mapping rebuilding into a new helper")
>>>     d2136d749d ("mm: support multi-size THP numa balancing")
>>> 6b0ed7b3c77547d2 d2136d749d76af980b3accd7270
>>> ---------------- ---------------------------
>>>            %stddev     %change         %stddev
>>>                \          |                \
>>>        12.02            -1.3       10.72 ±  4%  mpstat.cpu.all.sys%
>>>      1228757            +3.0%    1265679        proc-vmstat.pgfault
> 
> Also from other proc-vmstat stats,
> 
>       21770  36%      +6.1%      23098  28%  proc-vmstat.numa_hint_faults
>        6168 107%     +48.8%       9180  18%  proc-vmstat.numa_hint_faults_local
>      154537  15%     +23.5%     190883  17%  proc-vmstat.numa_pte_updates
> 
> After your patch, more hint page faults occurs, I think this is expected.

This is exactly my confusion, why are there more numa hint faults? The 
hugetlb VMAs will be skipped from scanning, so other VMAs of the 
application will use mTHP or large folio?

> Then, tasks may be moved between sockets because of that, so that some
> hugetlb page access becomes remote?

Yes, that is possible if the application uses some large folio.

>>>      7392513            -7.1%    6865649        vm-scalability.throughput
>>>        17356            +9.4%      18986        vm-scalability.time.user_time
>>>         0.32 ± 22%     -36.9%       0.20 ± 17%  sched_debug.cfs_rq:/.h_nr_running.stddev
>>>        28657 ± 86%     -90.8%       2640 ± 19%  sched_debug.cfs_rq:/.load.stddev
>>>         0.28 ± 35%     -52.1%       0.13 ± 29%  sched_debug.cfs_rq:/.nr_running.stddev
>>>       299.88 ± 27%     -39.6%     181.04 ± 23%  sched_debug.cfs_rq:/.runnable_avg.stddev
>>>       284.88 ± 32%     -44.0%     159.65 ± 27%  sched_debug.cfs_rq:/.util_avg.stddev
>>>         0.32 ± 22%     -37.2%       0.20 ± 17%  sched_debug.cpu.nr_running.stddev
>>>    1.584e+10 ±  2%      -6.9%  1.476e+10 ±  3%  perf-stat.i.branch-instructions
>>>     11673151 ±  3%      -6.3%   10935072 ±  4%  perf-stat.i.branch-misses
>>>         4.90            +3.5%       5.07        perf-stat.i.cpi
>>>       333.40            +7.5%     358.32        perf-stat.i.cycles-between-cache-misses
>>>    6.787e+10 ±  2%      -6.8%  6.324e+10 ±  3%  perf-stat.i.instructions
>>>         0.25            -6.2%       0.24        perf-stat.i.ipc
>>>         4.19            +7.5%       4.51        perf-stat.overall.cpi
>>>       323.02            +7.4%     346.94        perf-stat.overall.cycles-between-cache-misses
>>>         0.24            -7.0%       0.22        perf-stat.overall.ipc
>>>    1.549e+10 ±  2%      -6.8%  1.444e+10 ±  3%  perf-stat.ps.branch-instructions
>>>    6.634e+10 ±  2%      -6.7%  6.186e+10 ±  3%  perf-stat.ps.instructions
>>>        17.33 ± 77%     -10.6        6.72 ±169%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.do_access
>>>        17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.do_access
>>>        17.30 ± 77%     -10.6        6.71 ±169%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>>>        17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.do_access
>>>        17.27 ± 77%     -10.6        6.70 ±169%  perf-profile.calltrace.cycles-pp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
>>>        13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.calltrace.cycles-pp.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>>>        13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault.do_user_addr_fault
>>>        13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.calltrace.cycles-pp.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault.handle_mm_fault
>>>        13.23 ± 76%      -8.1        5.13 ±168%  perf-profile.calltrace.cycles-pp.copy_mc_enhanced_fast_string.copy_subpage.copy_user_large_folio.hugetlb_wp.hugetlb_fault
>>>         3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.calltrace.cycles-pp.__mutex_lock.hugetlb_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
>>>        17.35 ± 77%     -10.6        6.73 ±169%  perf-profile.children.cycles-pp.asm_exc_page_fault
>>>        17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.do_user_addr_fault
>>>        17.32 ± 77%     -10.6        6.72 ±168%  perf-profile.children.cycles-pp.exc_page_fault
>>>        17.30 ± 77%     -10.6        6.71 ±168%  perf-profile.children.cycles-pp.handle_mm_fault
>>>        17.28 ± 77%     -10.6        6.70 ±169%  perf-profile.children.cycles-pp.hugetlb_fault
>>>        13.65 ± 76%      -8.4        5.29 ±168%  perf-profile.children.cycles-pp.hugetlb_wp
>>>        13.37 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_user_large_folio
>>>        13.35 ± 76%      -8.2        5.18 ±168%  perf-profile.children.cycles-pp.copy_subpage
>>>        13.34 ± 76%      -8.2        5.17 ±168%  perf-profile.children.cycles-pp.copy_mc_enhanced_fast_string
>>>         3.59 ± 78%      -2.2        1.39 ±169%  perf-profile.children.cycles-pp.__mutex_lock
>>>        13.24 ± 76%      -8.1        5.13 ±168%  perf-profile.self.cycles-pp.copy_mc_enhanced_fast_string
>>> Disclaimer:
>>> Results have been estimated based on internal Intel analysis and are provided
>>> for informational purposes only. Any difference in system hardware or software
>>> design or configuration may affect actual performance.
>>>
> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression
  2024-06-20  7:38   ` Huang, Ying
  2024-06-20  8:44     ` Baolin Wang
@ 2024-06-20 11:13     ` Baolin Wang
  1 sibling, 0 replies; 5+ messages in thread
From: Baolin Wang @ 2024-06-20 11:13 UTC (permalink / raw)
  To: Huang, Ying
  Cc: kernel test robot, oe-lkp, lkp, linux-kernel, Andrew Morton,
	David Hildenbrand, John Hubbard, Kefeng Wang, Mel Gorman,
	Ryan Roberts, linux-mm, feng.tang, fengwei.yin



On 2024/6/20 15:38, Huang, Ying wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> writes:
> 
>> On 2024/6/20 10:39, kernel test robot wrote:
>>> Hello,
>>> kernel test robot noticed a -7.1% regression of
>>> vm-scalability.throughput on:
>>> commit: d2136d749d76af980b3accd72704eea4eab625bd ("mm: support
>>> multi-size THP numa balancing")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>> [still regression on linus/master
>>> 92e5605a199efbaee59fb19e15d6cc2103a04ec2]
>>> testcase: vm-scalability
>>> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
>>> parameters:
>>> 	runtime: 300s
>>> 	size: 512G
>>> 	test: anon-cow-rand-hugetlb
>>> 	cpufreq_governor: performance
>>
>> Thanks for reporting. IIUC numa balancing will not scan hugetlb VMA,
>> I'm not sure how this patch affects the performance of hugetlb cow,
>> but let me try to reproduce it.
>>
>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>> | Closes: https://lore.kernel.org/oe-lkp/202406201010.a1344783-oliver.sang@intel.com
>>> Details are as below:
>>> -------------------------------------------------------------------------------------------------->
>>> The kernel config and materials to reproduce are available at:
>>> https://download.01.org/0day-ci/archive/20240620/202406201010.a1344783-oliver.sang@intel.com
>>> =========================================================================================
>>> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
>>>     gcc-13/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/300s/512G/lkp-icl-2sp2/anon-cow-rand-hugetlb/vm-scalability
>>> commit:
>>>     6b0ed7b3c7 ("mm: factor out the numa mapping rebuilding into a new helper")
>>>     d2136d749d ("mm: support multi-size THP numa balancing")
>>> 6b0ed7b3c77547d2 d2136d749d76af980b3accd7270
>>> ---------------- ---------------------------
>>>            %stddev     %change         %stddev
>>>                \          |                \
>>>        12.02            -1.3       10.72 ±  4%  mpstat.cpu.all.sys%
>>>      1228757            +3.0%    1265679        proc-vmstat.pgfault
> 
> Also from other proc-vmstat stats,
> 
>       21770  36%      +6.1%      23098  28%  proc-vmstat.numa_hint_faults
>        6168 107%     +48.8%       9180  18%  proc-vmstat.numa_hint_faults_local
>      154537  15%     +23.5%     190883  17%  proc-vmstat.numa_pte_updates
> 
> After your patch, more hint page faults occurs, I think this is expected.
> 
> Then, tasks may be moved between sockets because of that, so that some
> hugetlb page access becomes remote?

After trying to reproduce this case, I also find that more hint page 
faults occur. And I think that is casued by changing 
"folio_ref_count(folio) != 1" to "folio_likely_mapped_shared(folio)", 
which results in scanning more exclusive pages, so I think this is 
expected from the previous discussion.

Yes, I think your analysis is correct, some hugetlb page accesses become 
remote due to task migration.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-06-20 11:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-20  2:39 [linus:master] [mm] d2136d749d: vm-scalability.throughput -7.1% regression kernel test robot
2024-06-20  6:07 ` Baolin Wang
2024-06-20  7:38   ` Huang, Ying
2024-06-20  8:44     ` Baolin Wang
2024-06-20 11:13     ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox