* [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression
@ 2025-04-16 7:01 kernel test robot
2025-04-16 8:07 ` David Hildenbrand
0 siblings, 1 reply; 3+ messages in thread
From: kernel test robot @ 2025-04-16 7:01 UTC (permalink / raw)
To: David Hildenbrand
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Andy Lutomirks^H^Hski,
Borislav Betkov, Dave Hansen, Ingo Molnar, Jann Horn,
Johannes Weiner, Jonathan Corbet, Kirill A. Shutemov, Lance Yang,
Liam Howlett, Lorenzo Stoakes, Matthew Wilcow, Michal Koutn,
Muchun Song, tejun heo, Thomas Gleixner, Vlastimil Babka,
Zefan Li, linux-mm, oliver.sang
Hello,
kernel test robot noticed a 7.8% regression of vm-scalability.throughput on:
commit: 6af8cb80d3a9a6bbd521d8a7c949b4eafb7dba5d ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
testcase: vm-scalability
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
parameters:
runtime: 300s
size: 8T
test: anon-cow-seq
cpufreq_governor: performance
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202504152235.188dcce9-lkp@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250415/202504152235.188dcce9-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-9.4/debian-12-x86_64-20240206.cgz/300s/8T/lkp-srf-2sp1/anon-cow-seq/vm-scalability
commit:
448854478a ("mm/rmap: use folio_large_nr_pages() in add/remove functions")
6af8cb80d3 ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
448854478ab2f477 6af8cb80d3a9a6bbd521d8a7c94
---------------- ---------------------------
%stddev %change %stddev
\ | \
53400 +12.6% 60113 uptime.idle
1.457e+10 ± 3% +45.8% 2.125e+10 ± 3% cpuidle..time
9999188 ± 2% +41.4% 14142800 ± 2% cpuidle..usage
19401 ± 7% -21.8% 15176 ± 15% perf-c2c.DRAM.local
42430 ± 4% +51.6% 64345 ± 12% perf-c2c.HITM.local
52676 ± 6% +44.2% 75955 ± 11% perf-c2c.HITM.total
8.894e+08 -15.0% 7.557e+08 ± 7% numa-numastat.node0.local_node
8.895e+08 -15.0% 7.559e+08 ± 7% numa-numastat.node0.numa_hit
8.879e+08 -18.0% 7.284e+08 ± 8% numa-numastat.node1.local_node
8.88e+08 -18.0% 7.285e+08 ± 8% numa-numastat.node1.numa_hit
15.74 ± 4% +54.4% 24.31 ± 4% vmstat.cpu.id
214.36 -12.0% 188.68 vmstat.procs.r
9162 -6.0% 8609 ± 2% vmstat.system.cs
359131 -3.1% 347966 vmstat.system.in
15.07 ± 4% +8.8 23.85 ± 4% mpstat.cpu.all.idle%
0.67 -0.1 0.57 mpstat.cpu.all.irq%
0.17 ± 7% -0.0 0.14 ± 5% mpstat.cpu.all.soft%
75.25 -7.8 67.47 mpstat.cpu.all.sys%
4.42 -0.9 3.52 mpstat.cpu.all.usr%
30105953 -40.3% 17974636 ± 2% meminfo.Active
30105953 -40.3% 17974636 ± 2% meminfo.Active(anon)
28734383 -40.0% 17245061 ± 2% meminfo.AnonPages
88649090 +12.1% 99375037 meminfo.MemAvailable
89179467 +12.0% 99904712 meminfo.MemFree
42488096 -25.2% 31762847 meminfo.Memused
73422718 -16.1% 61622071 ± 5% meminfo.max_used_kB
14976533 ± 2% -40.8% 8859130 ± 2% numa-meminfo.node0.Active
14976533 ± 2% -40.8% 8859130 ± 2% numa-meminfo.node0.Active(anon)
14404444 ± 2% -40.1% 8625820 ± 2% numa-meminfo.node0.AnonPages
2961490 ± 40% -58.4% 1231139 ±129% numa-meminfo.node0.FilePages
43362532 ± 2% +16.4% 50494564 ± 3% numa-meminfo.node0.MemFree
22336223 ± 5% -31.9% 15204191 ± 11% numa-meminfo.node0.MemUsed
15156929 -39.9% 9115542 ± 3% numa-meminfo.node1.Active
15156929 -39.9% 9115542 ± 3% numa-meminfo.node1.Active(anon)
14334430 -39.9% 8615995 ± 3% numa-meminfo.node1.AnonPages
31426803 ± 4% -6.1% 29506154 ± 4% numa-meminfo.node1.AnonPages.max
20191977 ± 6% -18.0% 16564588 ± 8% numa-meminfo.node1.MemUsed
138520 -22.3% 107667 vm-scalability.median
3.14 ± 3% +6.4 9.57 ± 9% vm-scalability.median_stddev%
2.35 ± 8% +12.7 15.05 ± 7% vm-scalability.stddev%
36255698 -7.8% 33421568 vm-scalability.throughput
589253 ± 2% -36.1% 376589 ± 5% vm-scalability.time.involuntary_context_switches
1.772e+09 -16.5% 1.48e+09 vm-scalability.time.minor_page_faults
20491 -10.6% 18325 vm-scalability.time.percent_of_cpu_this_job_got
58690 -9.8% 52967 vm-scalability.time.system_time
3314 -19.5% 2668 vm-scalability.time.user_time
423126 -25.5% 315143 ± 2% vm-scalability.time.voluntary_context_switches
7.968e+09 -16.5% 6.652e+09 vm-scalability.workload
3752122 ± 2% -40.9% 2218075 ± 3% numa-vmstat.node0.nr_active_anon
3604567 ± 2% -40.1% 2157950 ± 2% numa-vmstat.node0.nr_anon_pages
740400 ± 40% -58.4% 307780 ±129% numa-vmstat.node0.nr_file_pages
10838075 ± 2% +16.4% 12619573 ± 3% numa-vmstat.node0.nr_free_pages
3751375 ± 2% -40.9% 2217819 ± 3% numa-vmstat.node0.nr_zone_active_anon
8.895e+08 -15.0% 7.559e+08 ± 7% numa-vmstat.node0.numa_hit
8.894e+08 -15.0% 7.557e+08 ± 7% numa-vmstat.node0.numa_local
3814298 -40.7% 2260493 ± 2% numa-vmstat.node1.nr_active_anon
3602611 -40.7% 2135607 ± 2% numa-vmstat.node1.nr_anon_pages
3813553 -40.7% 2260182 ± 2% numa-vmstat.node1.nr_zone_active_anon
8.88e+08 -18.0% 7.285e+08 ± 8% numa-vmstat.node1.numa_hit
8.879e+08 -18.0% 7.284e+08 ± 8% numa-vmstat.node1.numa_local
29828859 ± 10% -25.9% 22114899 ± 12% sched_debug.cfs_rq:/.avg_vruntime.avg
31729867 ± 10% -23.1% 24394442 ± 12% sched_debug.cfs_rq:/.avg_vruntime.max
27439158 ± 11% -30.2% 19146267 ± 13% sched_debug.cfs_rq:/.avg_vruntime.min
1805955 ± 23% -31.8% 1231121 ± 22% sched_debug.cfs_rq:/.left_deadline.stddev
1805948 ± 23% -31.8% 1231114 ± 22% sched_debug.cfs_rq:/.left_vruntime.stddev
29828862 ± 10% -25.9% 22114901 ± 12% sched_debug.cfs_rq:/.min_vruntime.avg
31729867 ± 10% -23.1% 24394442 ± 12% sched_debug.cfs_rq:/.min_vruntime.max
27439158 ± 11% -30.2% 19146267 ± 13% sched_debug.cfs_rq:/.min_vruntime.min
1805948 ± 23% -31.8% 1231116 ± 22% sched_debug.cfs_rq:/.right_vruntime.stddev
468.50 ± 5% -25.1% 351.00 ± 24% sched_debug.cfs_rq:/.util_est.avg
1538 ± 7% -27.1% 1121 ± 7% sched_debug.cfs_rq:/.util_est.max
291424 ± 4% -4.1% 279460 ± 4% sched_debug.cpu.clock_task.min
1210 ± 2% +12.2% 1357 ± 3% sched_debug.cpu.clock_task.stddev
24329 ± 7% -13.3% 21085 ± 6% sched_debug.cpu.curr->pid.max
5671 ± 7% -11.3% 5028 ± 10% sched_debug.cpu.nr_switches.avg
31944 ± 6% +25.3% 40025 ± 17% sched_debug.cpu.nr_switches.max
3279 ± 8% -18.3% 2678 ± 8% sched_debug.cpu.nr_switches.min
3073 ± 9% +29.1% 3967 ± 15% sched_debug.cpu.nr_switches.stddev
7540874 -40.6% 4480163 ± 2% proc-vmstat.nr_active_anon
7192890 -40.2% 4298978 ± 2% proc-vmstat.nr_anon_pages
2209141 +12.2% 2479543 proc-vmstat.nr_dirty_background_threshold
4423684 +12.2% 4965150 proc-vmstat.nr_dirty_threshold
22285247 +12.1% 24987745 proc-vmstat.nr_free_pages
45347 -1.1% 44832 proc-vmstat.nr_kernel_stack
34897 -8.5% 31928 proc-vmstat.nr_page_table_pages
7540867 -40.6% 4480162 ± 2% proc-vmstat.nr_zone_active_anon
144536 ± 6% +40.8% 203521 ± 9% proc-vmstat.numa_hint_faults_local
1.778e+09 -16.5% 1.484e+09 proc-vmstat.numa_hit
1475274 +57.2% 2319414 ± 2% proc-vmstat.numa_huge_pte_updates
1.777e+09 -16.5% 1.484e+09 proc-vmstat.numa_local
267417 -0.9% 264919 proc-vmstat.numa_other
1726476 ± 4% +182.5% 4877430 ± 3% proc-vmstat.numa_pages_migrated
7.943e+08 +51.1% 1.2e+09 ± 2% proc-vmstat.numa_pte_updates
1.793e+09 -16.2% 1.503e+09 proc-vmstat.pgalloc_normal
1.774e+09 -16.5% 1.481e+09 proc-vmstat.pgfault
1.793e+09 -16.2% 1.502e+09 proc-vmstat.pgfree
7187712 +18.2% 8494165 ± 2% proc-vmstat.pgmigrate_fail
1726476 ± 4% +182.5% 4877430 ± 3% proc-vmstat.pgmigrate_success
180904 -10.4% 162040 proc-vmstat.pgreuse
13401 -16.5% 11190 proc-vmstat.thp_fault_alloc
14046 +18.2% 16598 ± 2% proc-vmstat.thp_migration_fail
3199 ± 4% +193.0% 9375 ± 4% proc-vmstat.thp_migration_success
3430623 -16.5% 2864565 proc-vmstat.thp_split_pmd
8447 +16.5% 9841 proc-vmstat.unevictable_pgs_culled
3.45 +12.3% 3.87 perf-stat.i.MPKI
3.587e+10 -20.3% 2.859e+10 perf-stat.i.branch-instructions
0.06 +0.0 0.10 ± 6% perf-stat.i.branch-miss-rate%
16449735 +19.0% 19582727 perf-stat.i.branch-misses
54.86 +2.2 57.09 perf-stat.i.cache-miss-rate%
4.407e+08 -11.2% 3.915e+08 perf-stat.i.cache-misses
8.022e+08 -14.3% 6.875e+08 perf-stat.i.cache-references
9138 -4.7% 8709 ± 2% perf-stat.i.context-switches
4.31 +45.0% 6.25 ± 6% perf-stat.i.cpi
5.52e+11 -9.4% 5e+11 perf-stat.i.cpu-cycles
803.00 -9.9% 723.58 perf-stat.i.cpu-migrations
1255 +28.5% 1612 ± 6% perf-stat.i.cycles-between-cache-misses
1.278e+11 -20.4% 1.017e+11 perf-stat.i.instructions
0.24 -8.5% 0.22 perf-stat.i.ipc
45.69 -16.2% 38.30 perf-stat.i.metric.K/sec
5871870 -18.2% 4801162 perf-stat.i.minor-faults
5871868 -18.2% 4801161 perf-stat.i.page-faults
3.45 +11.8% 3.86 perf-stat.overall.MPKI
0.04 +0.0 0.07 perf-stat.overall.branch-miss-rate%
54.96 +2.0 56.96 perf-stat.overall.cache-miss-rate%
4.32 +12.5% 4.86 perf-stat.overall.cpi
0.23 -11.1% 0.21 perf-stat.overall.ipc
4825 -3.0% 4679 perf-stat.overall.path-length
3.562e+10 -19.1% 2.882e+10 perf-stat.ps.branch-instructions
15804911 +21.0% 19125930 perf-stat.ps.branch-misses
4.379e+08 -9.8% 3.95e+08 perf-stat.ps.cache-misses
7.968e+08 -13.0% 6.935e+08 perf-stat.ps.cache-references
9012 -5.3% 8538 ± 2% perf-stat.ps.context-switches
5.481e+11 -9.2% 4.975e+11 perf-stat.ps.cpu-cycles
793.13 -10.7% 708.06 perf-stat.ps.cpu-migrations
1.269e+11 -19.3% 1.024e+11 perf-stat.ps.instructions
5836046 -16.8% 4855858 perf-stat.ps.minor-faults
5836046 -16.8% 4855859 perf-stat.ps.page-faults
3.845e+13 -19.0% 3.113e+13 perf-stat.total.instructions
42.65 ±218% -99.7% 0.11 ±191% perf-sched.sch_delay.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.alloc_pages_noprof.__pmd_alloc
0.48 ±146% -99.8% 0.00 ±223% perf-sched.sch_delay.avg.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
2.29 ± 24% -60.6% 0.90 ± 36% perf-sched.sch_delay.avg.ms.__cond_resched.__split_huge_pmd.do_huge_pmd_wp_page.__handle_mm_fault.handle_mm_fault
0.25 ± 89% -93.4% 0.02 ± 59% perf-sched.sch_delay.avg.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary
4.09 ± 79% -92.4% 0.31 ±177% perf-sched.sch_delay.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
3.78 ± 97% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
2.93 ± 26% -64.2% 1.05 ± 24% perf-sched.sch_delay.avg.ms.__cond_resched.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
0.82 ± 26% -49.9% 0.41 ± 52% perf-sched.sch_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
3.27 ± 23% +49.0% 4.87 ± 14% perf-sched.sch_delay.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
3.66 ± 22% -62.4% 1.38 ± 34% perf-sched.sch_delay.avg.ms.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
0.07 ± 23% +89.3% 0.14 ± 62% perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.98 ± 10% +30.9% 1.28 ± 23% perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
170.74 ±217% -99.9% 0.20 ±200% perf-sched.sch_delay.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.alloc_pages_noprof.__pmd_alloc
0.87 ±161% -99.9% 0.00 ±223% perf-sched.sch_delay.max.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
10.33 ± 20% -51.8% 4.98 ± 36% perf-sched.sch_delay.max.ms.__cond_resched.__split_huge_pmd.do_huge_pmd_wp_page.__handle_mm_fault.handle_mm_fault
11.45 ± 47% -83.9% 1.84 ±145% perf-sched.sch_delay.max.ms.__cond_resched.change_pmd_range.isra.0.change_pud_range
1.13 ± 96% -98.2% 0.02 ± 54% perf-sched.sch_delay.max.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary
11.78 ± 66% -95.1% 0.58 ±190% perf-sched.sch_delay.max.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
12.61 ±121% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
73.69 ±137% -83.0% 12.50 ± 35% perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
1.00 ±137% +1001.5% 11.02 ±162% perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.folio_lock_anon_vma_read
0.09 ± 56% +319.7% 0.37 ± 65% perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
9.93 ± 55% -77.0% 2.29 ±104% perf-sched.sch_delay.max.ms.wait_for_partner.fifo_open.do_dentry_open.vfs_open
13.75 ± 31% +1908.6% 276.09 ±203% perf-sched.sch_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
9.29 ± 13% +78.0% 16.54 ± 22% perf-sched.wait_and_delay.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
818.50 ± 23% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
164.97 ±107% +241.3% 563.02 ± 31% perf-sched.wait_and_delay.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
7.60 ± 8% -16.3% 6.36 ± 5% perf-sched.wait_and_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
9.12 ± 9% +41.1% 12.86 ± 7% perf-sched.wait_and_delay.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
830.40 ± 18% -100.0% 0.00 perf-sched.wait_and_delay.avg.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
15.56 ± 4% -17.7% 12.81 ± 8% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
1608 ± 19% -81.7% 293.67 ± 15% perf-sched.wait_and_delay.count.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
2.50 ± 68% -100.0% 0.00 perf-sched.wait_and_delay.count.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
27.67 ± 70% -84.3% 4.33 ±134% perf-sched.wait_and_delay.count.__cond_resched.refresh_cpu_vm_stats.vmstat_update.process_one_work.worker_thread
3154 ± 20% -73.3% 842.50 ± 17% perf-sched.wait_and_delay.count.__cond_resched.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
586.50 ± 5% +26.4% 741.50 ± 8% perf-sched.wait_and_delay.count.devkmsg_read.vfs_read.ksys_read.do_syscall_64
426.17 ± 25% -88.3% 49.67 ±141% perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
3.67 ± 20% -100.0% 0.00 perf-sched.wait_and_delay.count.pipe_write.vfs_write.ksys_write.do_syscall_64
946.33 ± 12% +48.7% 1407 ± 17% perf-sched.wait_and_delay.count.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll
611.17 ± 6% +23.9% 757.17 ± 7% perf-sched.wait_and_delay.count.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
1027 ± 3% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
1553 ± 5% -100.0% 0.00 perf-sched.wait_and_delay.max.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
5.87 ± 11% +126.7% 13.30 ± 25% perf-sched.wait_time.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
498.88 ± 57% -100.0% 0.00 ±223% perf-sched.wait_time.avg.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
3.55 ±169% +514.3% 21.79 ± 30% perf-sched.wait_time.avg.ms.__cond_resched.__kmalloc_noprof.inotify_handle_inode_event.send_to_group.fsnotify
4.09 ± 79% -92.4% 0.31 ±179% perf-sched.wait_time.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
814.72 ± 24% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
0.93 ± 73% -86.9% 0.12 ±122% perf-sched.wait_time.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
165.07 ±106% +240.8% 562.58 ± 31% perf-sched.wait_time.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
6.78 ± 6% -12.2% 5.95 ± 3% perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
5.85 ± 5% +36.7% 8.00 ± 8% perf-sched.wait_time.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
829.13 ± 19% -99.9% 0.91 ±223% perf-sched.wait_time.avg.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
15.07 ± 4% -17.5% 12.43 ± 7% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
6.88 ± 33% -51.4% 3.34 ± 38% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
80.29 ± 10% +31.1% 105.26 ± 20% perf-sched.wait_time.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
830.56 ± 44% -100.0% 0.00 ±223% perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
6.25 ±188% +2184.2% 142.83 ± 55% perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_noprof.inotify_handle_inode_event.send_to_group.fsnotify
11.78 ± 66% -95.1% 0.58 ±190% perf-sched.wait_time.max.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
1020 ± 2% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
14.39 ± 57% -83.7% 2.34 ±137% perf-sched.wait_time.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
58.23 ± 39% +83.4% 106.83 ± 31% perf-sched.wait_time.max.ms.io_schedule.migration_entry_wait_on_locked.migration_entry_wait.do_swap_page
1553 ± 5% -99.9% 1.13 ±223% perf-sched.wait_time.max.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
34.14 ± 80% +138.5% 81.42 ± 38% perf-sched.wait_time.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
175.44 ±210% -98.7% 2.29 ±104% perf-sched.wait_time.max.ms.wait_for_partner.fifo_open.do_dentry_open.vfs_open
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression
2025-04-16 7:01 [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression kernel test robot
@ 2025-04-16 8:07 ` David Hildenbrand
2025-04-16 9:16 ` David Hildenbrand
0 siblings, 1 reply; 3+ messages in thread
From: David Hildenbrand @ 2025-04-16 8:07 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Andy Lutomirks^H^Hski,
Borislav Betkov, Dave Hansen, Ingo Molnar, Jann Horn,
Johannes Weiner, Jonathan Corbet, Kirill A. Shutemov, Lance Yang,
Liam Howlett, Lorenzo Stoakes, Matthew Wilcow, Michal Koutn,
Muchun Song, tejun heo, Thomas Gleixner, Vlastimil Babka,
Zefan Li, linux-mm
On 16.04.25 09:01, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 7.8% regression of vm-scalability.throughput on:
>
>
> commit: 6af8cb80d3a9a6bbd521d8a7c949b4eafb7dba5d ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
>
> testcase: vm-scalability
> config: x86_64-rhel-9.4
> compiler: gcc-12
> test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
> parameters:
>
> runtime: 300s
> size: 8T
> test: anon-cow-seq
> cpufreq_governor: performance
>
This should be the scenario with THP enabled. At first, I thought the
problem would be contention on the per-folio spinlock, but what makes me
scratch my head is the following:
13401 -16.5% 11190 proc-vmstat.thp_fault_alloc
... 3430623 -16.5% 2864565 proc-vmstat.thp_split_pmd
If we allocate less THP, performance of the benchmark will obviously be
worse with less THPs.
We allocated 2211 less THPs and had 566058 less THP PMD->PTE remappings.
566058 / 2211 = 256, which is exactly the number of threads ->
vm-scalability fork'ed child processes.
So it was in fact the benchmark that was effectively using 16.5% less THPs.
I don't see how this patch would affect the allocation of THPs in any
way (and I don't think it does).
Regarding possible contention on the spinlock, I was already expecting a
slight hit once we have that many threads over multiple sockets. From my
cover letter:
"
Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1%
stdev, which is nice.
So far, I did not get my hands on a similarly large system with multiple
sockets.
"
And further:
"
If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it
was freed.
[...] Adding that logic to stop adds more code to the hot path, so I
avoided that for now.
"
So while I am planning on looking into optimizing the locking at some
point, it has low priority for me because (a) it adds more complexity
(b) has the potential to affect the hot path (not shared) and (c) this
benchmark in that scale is not a compelling argument.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression
2025-04-16 8:07 ` David Hildenbrand
@ 2025-04-16 9:16 ` David Hildenbrand
0 siblings, 0 replies; 3+ messages in thread
From: David Hildenbrand @ 2025-04-16 9:16 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Andy Lutomirks^H^Hski,
Borislav Betkov, Dave Hansen, Ingo Molnar, Jann Horn,
Johannes Weiner, Jonathan Corbet, Kirill A. Shutemov, Lance Yang,
Liam Howlett, Lorenzo Stoakes, Matthew Wilcow, Michal Koutn,
Muchun Song, tejun heo, Thomas Gleixner, Vlastimil Babka,
Zefan Li, linux-mm
On 16.04.25 10:07, David Hildenbrand wrote:
> On 16.04.25 09:01, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed a 7.8% regression of vm-scalability.throughput on:
>>
>>
>> commit: 6af8cb80d3a9a6bbd521d8a7c949b4eafb7dba5d ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>>
>> testcase: vm-scalability
>> config: x86_64-rhel-9.4
>> compiler: gcc-12
>> test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
>> parameters:
>>
>> runtime: 300s
>> size: 8T
>> test: anon-cow-seq
>> cpufreq_governor: performance
>>
>
> This should be the scenario with THP enabled. At first, I thought the
> problem would be contention on the per-folio spinlock, but what makes me
> scratch my head is the following:
>
> 13401 -16.5% 11190 proc-vmstat.thp_fault_alloc
> ... 3430623 -16.5% 2864565 proc-vmstat.thp_split_pmd
>
>
> If we allocate less THP, performance of the benchmark will obviously be
> worse with less THPs.
>
> We allocated 2211 less THPs and had 566058 less THP PMD->PTE remappings.
>
> 566058 / 2211 = 256, which is exactly the number of threads ->
> vm-scalability fork'ed child processes.
>
> So it was in fact the benchmark that was effectively using 16.5% less THPs.
>
> I don't see how this patch would affect the allocation of THPs in any
> way (and I don't think it does).
Thinking about this some more: Assuming both runs execute the same test
executions, we would expect the number of allocated THPs to not change
(unless we really have fragmentation that results in less THP getting
allocated).
Assuming we run into a timeout after 300s and abort the test earlier, we
could end up with a difference in executions and, therefore THP allocations.
I recall that usually we try to have the same benchmark executions and
not run into the timeout (otherwise some of these stats, like THP
allocations are completely unreliable).
Maybe
7.968e+09 -16.5% 6.652e+09 vm-scalability.workload
indicates that we ended up with less executions? At least the
"repro-script" seems to indicate that we always execute a fixed number
of executions, but maybe the repo-script is aborted by the benchmark
framework.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-04-16 9:16 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-16 7:01 [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression kernel test robot
2025-04-16 8:07 ` David Hildenbrand
2025-04-16 9:16 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox