[linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [linus:master] [mm/rmap]  6af8cb80d3:  vm-scalability.throughput 7.8% regression
@ 2025-04-16  7:01 kernel test robot
  2025-04-16  8:07 ` David Hildenbrand
  0 siblings, 1 reply; 3+ messages in thread
From: kernel test robot @ 2025-04-16  7:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Andy Lutomirks^H^Hski,
	Borislav Betkov, Dave Hansen, Ingo Molnar, Jann Horn,
	Johannes Weiner, Jonathan Corbet, Kirill A. Shutemov, Lance Yang,
	Liam Howlett, Lorenzo Stoakes, Matthew Wilcow, Michal Koutn,
	Muchun Song, tejun heo, Thomas Gleixner, Vlastimil Babka,
	Zefan Li, linux-mm, oliver.sang



Hello,

kernel test robot noticed a 7.8% regression of vm-scalability.throughput on:


commit: 6af8cb80d3a9a6bbd521d8a7c949b4eafb7dba5d ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


testcase: vm-scalability
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
parameters:

	runtime: 300s
	size: 8T
	test: anon-cow-seq
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202504152235.188dcce9-lkp@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250415/202504152235.188dcce9-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase:
  gcc-12/performance/x86_64-rhel-9.4/debian-12-x86_64-20240206.cgz/300s/8T/lkp-srf-2sp1/anon-cow-seq/vm-scalability

commit: 
  448854478a ("mm/rmap: use folio_large_nr_pages() in add/remove functions")
  6af8cb80d3 ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")

448854478ab2f477 6af8cb80d3a9a6bbd521d8a7c94 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     53400           +12.6%      60113        uptime.idle
 1.457e+10 ±  3%     +45.8%  2.125e+10 ±  3%  cpuidle..time
   9999188 ±  2%     +41.4%   14142800 ±  2%  cpuidle..usage
     19401 ±  7%     -21.8%      15176 ± 15%  perf-c2c.DRAM.local
     42430 ±  4%     +51.6%      64345 ± 12%  perf-c2c.HITM.local
     52676 ±  6%     +44.2%      75955 ± 11%  perf-c2c.HITM.total
 8.894e+08           -15.0%  7.557e+08 ±  7%  numa-numastat.node0.local_node
 8.895e+08           -15.0%  7.559e+08 ±  7%  numa-numastat.node0.numa_hit
 8.879e+08           -18.0%  7.284e+08 ±  8%  numa-numastat.node1.local_node
  8.88e+08           -18.0%  7.285e+08 ±  8%  numa-numastat.node1.numa_hit
     15.74 ±  4%     +54.4%      24.31 ±  4%  vmstat.cpu.id
    214.36           -12.0%     188.68        vmstat.procs.r
      9162            -6.0%       8609 ±  2%  vmstat.system.cs
    359131            -3.1%     347966        vmstat.system.in
     15.07 ±  4%      +8.8       23.85 ±  4%  mpstat.cpu.all.idle%
      0.67            -0.1        0.57        mpstat.cpu.all.irq%
      0.17 ±  7%      -0.0        0.14 ±  5%  mpstat.cpu.all.soft%
     75.25            -7.8       67.47        mpstat.cpu.all.sys%
      4.42            -0.9        3.52        mpstat.cpu.all.usr%
  30105953           -40.3%   17974636 ±  2%  meminfo.Active
  30105953           -40.3%   17974636 ±  2%  meminfo.Active(anon)
  28734383           -40.0%   17245061 ±  2%  meminfo.AnonPages
  88649090           +12.1%   99375037        meminfo.MemAvailable
  89179467           +12.0%   99904712        meminfo.MemFree
  42488096           -25.2%   31762847        meminfo.Memused
  73422718           -16.1%   61622071 ±  5%  meminfo.max_used_kB
  14976533 ±  2%     -40.8%    8859130 ±  2%  numa-meminfo.node0.Active
  14976533 ±  2%     -40.8%    8859130 ±  2%  numa-meminfo.node0.Active(anon)
  14404444 ±  2%     -40.1%    8625820 ±  2%  numa-meminfo.node0.AnonPages
   2961490 ± 40%     -58.4%    1231139 ±129%  numa-meminfo.node0.FilePages
  43362532 ±  2%     +16.4%   50494564 ±  3%  numa-meminfo.node0.MemFree
  22336223 ±  5%     -31.9%   15204191 ± 11%  numa-meminfo.node0.MemUsed
  15156929           -39.9%    9115542 ±  3%  numa-meminfo.node1.Active
  15156929           -39.9%    9115542 ±  3%  numa-meminfo.node1.Active(anon)
  14334430           -39.9%    8615995 ±  3%  numa-meminfo.node1.AnonPages
  31426803 ±  4%      -6.1%   29506154 ±  4%  numa-meminfo.node1.AnonPages.max
  20191977 ±  6%     -18.0%   16564588 ±  8%  numa-meminfo.node1.MemUsed
    138520           -22.3%     107667        vm-scalability.median
      3.14 ±  3%      +6.4        9.57 ±  9%  vm-scalability.median_stddev%
      2.35 ±  8%     +12.7       15.05 ±  7%  vm-scalability.stddev%
  36255698            -7.8%   33421568        vm-scalability.throughput
    589253 ±  2%     -36.1%     376589 ±  5%  vm-scalability.time.involuntary_context_switches
 1.772e+09           -16.5%   1.48e+09        vm-scalability.time.minor_page_faults
     20491           -10.6%      18325        vm-scalability.time.percent_of_cpu_this_job_got
     58690            -9.8%      52967        vm-scalability.time.system_time
      3314           -19.5%       2668        vm-scalability.time.user_time
    423126           -25.5%     315143 ±  2%  vm-scalability.time.voluntary_context_switches
 7.968e+09           -16.5%  6.652e+09        vm-scalability.workload
   3752122 ±  2%     -40.9%    2218075 ±  3%  numa-vmstat.node0.nr_active_anon
   3604567 ±  2%     -40.1%    2157950 ±  2%  numa-vmstat.node0.nr_anon_pages
    740400 ± 40%     -58.4%     307780 ±129%  numa-vmstat.node0.nr_file_pages
  10838075 ±  2%     +16.4%   12619573 ±  3%  numa-vmstat.node0.nr_free_pages
   3751375 ±  2%     -40.9%    2217819 ±  3%  numa-vmstat.node0.nr_zone_active_anon
 8.895e+08           -15.0%  7.559e+08 ±  7%  numa-vmstat.node0.numa_hit
 8.894e+08           -15.0%  7.557e+08 ±  7%  numa-vmstat.node0.numa_local
   3814298           -40.7%    2260493 ±  2%  numa-vmstat.node1.nr_active_anon
   3602611           -40.7%    2135607 ±  2%  numa-vmstat.node1.nr_anon_pages
   3813553           -40.7%    2260182 ±  2%  numa-vmstat.node1.nr_zone_active_anon
  8.88e+08           -18.0%  7.285e+08 ±  8%  numa-vmstat.node1.numa_hit
 8.879e+08           -18.0%  7.284e+08 ±  8%  numa-vmstat.node1.numa_local
  29828859 ± 10%     -25.9%   22114899 ± 12%  sched_debug.cfs_rq:/.avg_vruntime.avg
  31729867 ± 10%     -23.1%   24394442 ± 12%  sched_debug.cfs_rq:/.avg_vruntime.max
  27439158 ± 11%     -30.2%   19146267 ± 13%  sched_debug.cfs_rq:/.avg_vruntime.min
   1805955 ± 23%     -31.8%    1231121 ± 22%  sched_debug.cfs_rq:/.left_deadline.stddev
   1805948 ± 23%     -31.8%    1231114 ± 22%  sched_debug.cfs_rq:/.left_vruntime.stddev
  29828862 ± 10%     -25.9%   22114901 ± 12%  sched_debug.cfs_rq:/.min_vruntime.avg
  31729867 ± 10%     -23.1%   24394442 ± 12%  sched_debug.cfs_rq:/.min_vruntime.max
  27439158 ± 11%     -30.2%   19146267 ± 13%  sched_debug.cfs_rq:/.min_vruntime.min
   1805948 ± 23%     -31.8%    1231116 ± 22%  sched_debug.cfs_rq:/.right_vruntime.stddev
    468.50 ±  5%     -25.1%     351.00 ± 24%  sched_debug.cfs_rq:/.util_est.avg
      1538 ±  7%     -27.1%       1121 ±  7%  sched_debug.cfs_rq:/.util_est.max
    291424 ±  4%      -4.1%     279460 ±  4%  sched_debug.cpu.clock_task.min
      1210 ±  2%     +12.2%       1357 ±  3%  sched_debug.cpu.clock_task.stddev
     24329 ±  7%     -13.3%      21085 ±  6%  sched_debug.cpu.curr->pid.max
      5671 ±  7%     -11.3%       5028 ± 10%  sched_debug.cpu.nr_switches.avg
     31944 ±  6%     +25.3%      40025 ± 17%  sched_debug.cpu.nr_switches.max
      3279 ±  8%     -18.3%       2678 ±  8%  sched_debug.cpu.nr_switches.min
      3073 ±  9%     +29.1%       3967 ± 15%  sched_debug.cpu.nr_switches.stddev
   7540874           -40.6%    4480163 ±  2%  proc-vmstat.nr_active_anon
   7192890           -40.2%    4298978 ±  2%  proc-vmstat.nr_anon_pages
   2209141           +12.2%    2479543        proc-vmstat.nr_dirty_background_threshold
   4423684           +12.2%    4965150        proc-vmstat.nr_dirty_threshold
  22285247           +12.1%   24987745        proc-vmstat.nr_free_pages
     45347            -1.1%      44832        proc-vmstat.nr_kernel_stack
     34897            -8.5%      31928        proc-vmstat.nr_page_table_pages
   7540867           -40.6%    4480162 ±  2%  proc-vmstat.nr_zone_active_anon
    144536 ±  6%     +40.8%     203521 ±  9%  proc-vmstat.numa_hint_faults_local
 1.778e+09           -16.5%  1.484e+09        proc-vmstat.numa_hit
   1475274           +57.2%    2319414 ±  2%  proc-vmstat.numa_huge_pte_updates
 1.777e+09           -16.5%  1.484e+09        proc-vmstat.numa_local
    267417            -0.9%     264919        proc-vmstat.numa_other
   1726476 ±  4%    +182.5%    4877430 ±  3%  proc-vmstat.numa_pages_migrated
 7.943e+08           +51.1%    1.2e+09 ±  2%  proc-vmstat.numa_pte_updates
 1.793e+09           -16.2%  1.503e+09        proc-vmstat.pgalloc_normal
 1.774e+09           -16.5%  1.481e+09        proc-vmstat.pgfault
 1.793e+09           -16.2%  1.502e+09        proc-vmstat.pgfree
   7187712           +18.2%    8494165 ±  2%  proc-vmstat.pgmigrate_fail
   1726476 ±  4%    +182.5%    4877430 ±  3%  proc-vmstat.pgmigrate_success
    180904           -10.4%     162040        proc-vmstat.pgreuse
     13401           -16.5%      11190        proc-vmstat.thp_fault_alloc
     14046           +18.2%      16598 ±  2%  proc-vmstat.thp_migration_fail
      3199 ±  4%    +193.0%       9375 ±  4%  proc-vmstat.thp_migration_success
   3430623           -16.5%    2864565        proc-vmstat.thp_split_pmd
      8447           +16.5%       9841        proc-vmstat.unevictable_pgs_culled
      3.45           +12.3%       3.87        perf-stat.i.MPKI
 3.587e+10           -20.3%  2.859e+10        perf-stat.i.branch-instructions
      0.06            +0.0        0.10 ±  6%  perf-stat.i.branch-miss-rate%
  16449735           +19.0%   19582727        perf-stat.i.branch-misses
     54.86            +2.2       57.09        perf-stat.i.cache-miss-rate%
 4.407e+08           -11.2%  3.915e+08        perf-stat.i.cache-misses
 8.022e+08           -14.3%  6.875e+08        perf-stat.i.cache-references
      9138            -4.7%       8709 ±  2%  perf-stat.i.context-switches
      4.31           +45.0%       6.25 ±  6%  perf-stat.i.cpi
  5.52e+11            -9.4%      5e+11        perf-stat.i.cpu-cycles
    803.00            -9.9%     723.58        perf-stat.i.cpu-migrations
      1255           +28.5%       1612 ±  6%  perf-stat.i.cycles-between-cache-misses
 1.278e+11           -20.4%  1.017e+11        perf-stat.i.instructions
      0.24            -8.5%       0.22        perf-stat.i.ipc
     45.69           -16.2%      38.30        perf-stat.i.metric.K/sec
   5871870           -18.2%    4801162        perf-stat.i.minor-faults
   5871868           -18.2%    4801161        perf-stat.i.page-faults
      3.45           +11.8%       3.86        perf-stat.overall.MPKI
      0.04            +0.0        0.07        perf-stat.overall.branch-miss-rate%
     54.96            +2.0       56.96        perf-stat.overall.cache-miss-rate%
      4.32           +12.5%       4.86        perf-stat.overall.cpi
      0.23           -11.1%       0.21        perf-stat.overall.ipc
      4825            -3.0%       4679        perf-stat.overall.path-length
 3.562e+10           -19.1%  2.882e+10        perf-stat.ps.branch-instructions
  15804911           +21.0%   19125930        perf-stat.ps.branch-misses
 4.379e+08            -9.8%   3.95e+08        perf-stat.ps.cache-misses
 7.968e+08           -13.0%  6.935e+08        perf-stat.ps.cache-references
      9012            -5.3%       8538 ±  2%  perf-stat.ps.context-switches
 5.481e+11            -9.2%  4.975e+11        perf-stat.ps.cpu-cycles
    793.13           -10.7%     708.06        perf-stat.ps.cpu-migrations
 1.269e+11           -19.3%  1.024e+11        perf-stat.ps.instructions
   5836046           -16.8%    4855858        perf-stat.ps.minor-faults
   5836046           -16.8%    4855859        perf-stat.ps.page-faults
 3.845e+13           -19.0%  3.113e+13        perf-stat.total.instructions
     42.65 ±218%     -99.7%       0.11 ±191%  perf-sched.sch_delay.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.alloc_pages_noprof.__pmd_alloc
      0.48 ±146%     -99.8%       0.00 ±223%  perf-sched.sch_delay.avg.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
      2.29 ± 24%     -60.6%       0.90 ± 36%  perf-sched.sch_delay.avg.ms.__cond_resched.__split_huge_pmd.do_huge_pmd_wp_page.__handle_mm_fault.handle_mm_fault
      0.25 ± 89%     -93.4%       0.02 ± 59%  perf-sched.sch_delay.avg.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary
      4.09 ± 79%     -92.4%       0.31 ±177%  perf-sched.sch_delay.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
      3.78 ± 97%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
      2.93 ± 26%     -64.2%       1.05 ± 24%  perf-sched.sch_delay.avg.ms.__cond_resched.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
      0.82 ± 26%     -49.9%       0.41 ± 52%  perf-sched.sch_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
      3.27 ± 23%     +49.0%       4.87 ± 14%  perf-sched.sch_delay.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
      3.66 ± 22%     -62.4%       1.38 ± 34%  perf-sched.sch_delay.avg.ms.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
      0.07 ± 23%     +89.3%       0.14 ± 62%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.98 ± 10%     +30.9%       1.28 ± 23%  perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
    170.74 ±217%     -99.9%       0.20 ±200%  perf-sched.sch_delay.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.alloc_pages_noprof.__pmd_alloc
      0.87 ±161%     -99.9%       0.00 ±223%  perf-sched.sch_delay.max.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
     10.33 ± 20%     -51.8%       4.98 ± 36%  perf-sched.sch_delay.max.ms.__cond_resched.__split_huge_pmd.do_huge_pmd_wp_page.__handle_mm_fault.handle_mm_fault
     11.45 ± 47%     -83.9%       1.84 ±145%  perf-sched.sch_delay.max.ms.__cond_resched.change_pmd_range.isra.0.change_pud_range
      1.13 ± 96%     -98.2%       0.02 ± 54%  perf-sched.sch_delay.max.ms.__cond_resched.down_write_killable.exec_mmap.begin_new_exec.load_elf_binary
     11.78 ± 66%     -95.1%       0.58 ±190%  perf-sched.sch_delay.max.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
     12.61 ±121%    -100.0%       0.00        perf-sched.sch_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
     73.69 ±137%     -83.0%      12.50 ± 35%  perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      1.00 ±137%   +1001.5%      11.02 ±162%  perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_read_slowpath.down_read.folio_lock_anon_vma_read
      0.09 ± 56%    +319.7%       0.37 ± 65%  perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
      9.93 ± 55%     -77.0%       2.29 ±104%  perf-sched.sch_delay.max.ms.wait_for_partner.fifo_open.do_dentry_open.vfs_open
     13.75 ± 31%   +1908.6%     276.09 ±203%  perf-sched.sch_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      9.29 ± 13%     +78.0%      16.54 ± 22%  perf-sched.wait_and_delay.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
    818.50 ± 23%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
    164.97 ±107%    +241.3%     563.02 ± 31%  perf-sched.wait_and_delay.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      7.60 ±  8%     -16.3%       6.36 ±  5%  perf-sched.wait_and_delay.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
      9.12 ±  9%     +41.1%      12.86 ±  7%  perf-sched.wait_and_delay.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
    830.40 ± 18%    -100.0%       0.00        perf-sched.wait_and_delay.avg.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
     15.56 ±  4%     -17.7%      12.81 ±  8%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
      1608 ± 19%     -81.7%     293.67 ± 15%  perf-sched.wait_and_delay.count.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
      2.50 ± 68%    -100.0%       0.00        perf-sched.wait_and_delay.count.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
     27.67 ± 70%     -84.3%       4.33 ±134%  perf-sched.wait_and_delay.count.__cond_resched.refresh_cpu_vm_stats.vmstat_update.process_one_work.worker_thread
      3154 ± 20%     -73.3%     842.50 ± 17%  perf-sched.wait_and_delay.count.__cond_resched.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
    586.50 ±  5%     +26.4%     741.50 ±  8%  perf-sched.wait_and_delay.count.devkmsg_read.vfs_read.ksys_read.do_syscall_64
    426.17 ± 25%     -88.3%      49.67 ±141%  perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt.[unknown]
      3.67 ± 20%    -100.0%       0.00        perf-sched.wait_and_delay.count.pipe_write.vfs_write.ksys_write.do_syscall_64
    946.33 ± 12%     +48.7%       1407 ± 17%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range.do_poll.constprop.0.do_sys_poll
    611.17 ±  6%     +23.9%     757.17 ±  7%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
      1027 ±  3%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
      1553 ±  5%    -100.0%       0.00        perf-sched.wait_and_delay.max.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
      5.87 ± 11%    +126.7%      13.30 ± 25%  perf-sched.wait_time.avg.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
    498.88 ± 57%    -100.0%       0.00 ±223%  perf-sched.wait_time.avg.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
      3.55 ±169%    +514.3%      21.79 ± 30%  perf-sched.wait_time.avg.ms.__cond_resched.__kmalloc_noprof.inotify_handle_inode_event.send_to_group.fsnotify
      4.09 ± 79%     -92.4%       0.31 ±179%  perf-sched.wait_time.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
    814.72 ± 24%    -100.0%       0.00        perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
      0.93 ± 73%     -86.9%       0.12 ±122%  perf-sched.wait_time.avg.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
    165.07 ±106%    +240.8%     562.58 ± 31%  perf-sched.wait_time.avg.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      6.78 ±  6%     -12.2%       5.95 ±  3%  perf-sched.wait_time.avg.ms.devkmsg_read.vfs_read.ksys_read.do_syscall_64
      5.85 ±  5%     +36.7%       8.00 ±  8%  perf-sched.wait_time.avg.ms.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
    829.13 ± 19%     -99.9%       0.91 ±223%  perf-sched.wait_time.avg.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
     15.07 ±  4%     -17.5%      12.43 ±  7%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
      6.88 ± 33%     -51.4%       3.34 ± 38%  perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
     80.29 ± 10%     +31.1%     105.26 ± 20%  perf-sched.wait_time.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.vma_alloc_folio_noprof
    830.56 ± 44%    -100.0%       0.00 ±223%  perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_cache_noprof.single_open.single_open_size.do_dentry_open
      6.25 ±188%   +2184.2%     142.83 ± 55%  perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_noprof.inotify_handle_inode_event.send_to_group.fsnotify
     11.78 ± 66%     -95.1%       0.58 ±190%  perf-sched.wait_time.max.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
      1020 ±  2%    -100.0%       0.00        perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
     14.39 ± 57%     -83.7%       2.34 ±137%  perf-sched.wait_time.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
     58.23 ± 39%     +83.4%     106.83 ± 31%  perf-sched.wait_time.max.ms.io_schedule.migration_entry_wait_on_locked.migration_entry_wait.do_swap_page
      1553 ±  5%     -99.9%       1.13 ±223%  perf-sched.wait_time.max.ms.pipe_write.vfs_write.ksys_write.do_syscall_64
     34.14 ± 80%    +138.5%      81.42 ± 38%  perf-sched.wait_time.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
    175.44 ±210%     -98.7%       2.29 ±104%  perf-sched.wait_time.max.ms.wait_for_partner.fifo_open.do_dentry_open.vfs_open




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression
  2025-04-16  7:01 [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression kernel test robot
@ 2025-04-16  8:07 ` David Hildenbrand
  2025-04-16  9:16   ` David Hildenbrand
  0 siblings, 1 reply; 3+ messages in thread
From: David Hildenbrand @ 2025-04-16  8:07 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Andy Lutomirks^H^Hski,
	Borislav Betkov, Dave Hansen, Ingo Molnar, Jann Horn,
	Johannes Weiner, Jonathan Corbet, Kirill A. Shutemov, Lance Yang,
	Liam Howlett, Lorenzo Stoakes, Matthew Wilcow, Michal Koutn,
	Muchun Song, tejun heo, Thomas Gleixner, Vlastimil Babka,
	Zefan Li, linux-mm

On 16.04.25 09:01, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a 7.8% regression of vm-scalability.throughput on:
> 
> 
> commit: 6af8cb80d3a9a6bbd521d8a7c949b4eafb7dba5d ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 
> testcase: vm-scalability
> config: x86_64-rhel-9.4
> compiler: gcc-12
> test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
> parameters:
> 
> 	runtime: 300s
> 	size: 8T
> 	test: anon-cow-seq
> 	cpufreq_governor: performance
> 

This should be the scenario with THP enabled. At first, I thought the 
problem would be contention on the per-folio spinlock, but what makes me 
scratch my head is the following:

      13401           -16.5%      11190        proc-vmstat.thp_fault_alloc
...   3430623           -16.5%    2864565        proc-vmstat.thp_split_pmd

If we allocate less THP, performance of the benchmark will obviously be 
worse with less THPs.

We allocated 2211 less THPs and had 566058 less THP PMD->PTE remappings.

566058 / 2211 =  256, which is exactly the number of threads -> 
vm-scalability fork'ed child processes.

So it was in fact the benchmark that was effectively using 16.5% less THPs.

I don't see how this patch would affect the allocation of THPs in any 
way (and I don't think it does).

Regarding possible contention on the spinlock, I was already expecting a 
slight hit once we have that many threads over multiple sockets. From my 
cover letter:

"
Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% 
stdev, which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.
"

And further:

"
If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it
was freed.

[...] Adding that logic to stop adds more code to the hot path, so I 
avoided that for now.
"

So while I am planning on looking into optimizing the locking at some 
point, it has low priority for me because (a) it adds more complexity 
(b) has the potential to affect the hot path (not shared) and (c) this 
benchmark in that scale is not a compelling argument.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression
  2025-04-16  8:07 ` David Hildenbrand
@ 2025-04-16  9:16   ` David Hildenbrand
  0 siblings, 0 replies; 3+ messages in thread
From: David Hildenbrand @ 2025-04-16  9:16 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Andy Lutomirks^H^Hski,
	Borislav Betkov, Dave Hansen, Ingo Molnar, Jann Horn,
	Johannes Weiner, Jonathan Corbet, Kirill A. Shutemov, Lance Yang,
	Liam Howlett, Lorenzo Stoakes, Matthew Wilcow, Michal Koutn,
	Muchun Song, tejun heo, Thomas Gleixner, Vlastimil Babka,
	Zefan Li, linux-mm

On 16.04.25 10:07, David Hildenbrand wrote:
> On 16.04.25 09:01, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed a 7.8% regression of vm-scalability.throughput on:
>>
>>
>> commit: 6af8cb80d3a9a6bbd521d8a7c949b4eafb7dba5d ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>>
>> testcase: vm-scalability
>> config: x86_64-rhel-9.4
>> compiler: gcc-12
>> test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
>> parameters:
>>
>> 	runtime: 300s
>> 	size: 8T
>> 	test: anon-cow-seq
>> 	cpufreq_governor: performance
>>
> 
> This should be the scenario with THP enabled. At first, I thought the
> problem would be contention on the per-folio spinlock, but what makes me
> scratch my head is the following:
> 
>        13401           -16.5%      11190        proc-vmstat.thp_fault_alloc
> ...   3430623           -16.5%    2864565        proc-vmstat.thp_split_pmd
> 
> 
> If we allocate less THP, performance of the benchmark will obviously be
> worse with less THPs.
> 
> We allocated 2211 less THPs and had 566058 less THP PMD->PTE remappings.
> 
> 566058 / 2211 =  256, which is exactly the number of threads ->
> vm-scalability fork'ed child processes.
> 
> So it was in fact the benchmark that was effectively using 16.5% less THPs.
> 
> I don't see how this patch would affect the allocation of THPs in any
> way (and I don't think it does).

Thinking about this some more: Assuming both runs execute the same test 
executions, we would expect the number of allocated THPs to not change 
(unless we really have fragmentation that results in less THP getting 
allocated).

Assuming we run into a timeout after 300s and abort the test earlier, we 
could end up with a difference in executions and, therefore THP allocations.

I recall that usually we try to have the same benchmark executions and 
not run into the timeout (otherwise some of these stats, like THP 
allocations are completely unreliable).

Maybe

  7.968e+09           -16.5%  6.652e+09        vm-scalability.workload

indicates that we ended up with less executions? At least the 
"repro-script" seems to indicate that we always execute a fixed number 
of executions, but maybe the repo-script is aborted by the benchmark 
framework.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-04-16  9:16 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-16  7:01 [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression kernel test robot
2025-04-16  8:07 ` David Hildenbrand
2025-04-16  9:16   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox