* [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression
@ 2025-11-25 14:46 kernel test robot
2025-11-26 9:49 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 4+ messages in thread
From: kernel test robot @ 2025-11-25 14:46 UTC (permalink / raw)
To: Kefeng Wang
Cc: oe-lkp, lkp, Andrew Morton, David Hildenbrand, Sidhartha Kumar,
Lorenzo Stoakes, Baolin Wang, Barry Song, Dev Jain, Lance Yang,
Liam Howlett, Ryan Roberts, Zi Yan, linux-mm, oliver.sang
Hello,
kernel test robot noticed a 46.9% regression of stress-ng.rawsock.ops_per_sec on:
commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory: use folio_can_map_prot_numa() for pmd folio")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
[still regression on linux-next/master d724c6f85e80a23ed46b7ebc6e38b527c09d64f5]
testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 192 threads 2 sockets Intel(R) Xeon(R) 6740E CPU @ 2.4GHz (Sierra Forest) with 256G memory
parameters:
nr_threads: 100%
testtime: 60s
test: rawsock
cpufreq_governor: performance
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202511252236.33b899b3-lkp@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251125/202511252236.33b899b3-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-srf-2sp3/rawsock/stress-ng/60s
commit:
ca43034cdb ("mm: mprotect: convert to folio_can_map_prot_numa()")
f66e2727dd ("mm: huge_memory: use folio_can_map_prot_numa() for pmd folio")
ca43034cdb224131 f66e2727ddfcbbe3dbb459e8098
---------------- ---------------------------
%stddev %change %stddev
\ | \
1569273 ± 5% -10.8% 1399094 ± 6% numa-numastat.node1.numa_hit
5.156e+08 ± 9% -27.6% 3.735e+08 ± 5% cpuidle..time
376420 ± 5% -56.5% 163676 ± 5% cpuidle..usage
256283 ± 29% -64.8% 90309 ± 60% numa-meminfo.node1.Mapped
774069 ± 15% -58.8% 319020 ± 47% numa-meminfo.node1.Shmem
89117 ± 17% -62.6% 33337 ± 50% perf-c2c.DRAM.remote
78134 ± 16% -62.4% 29353 ± 50% perf-c2c.HITM.remote
78.45 ± 33% -100.0% 0.01 ±141% numa-vmstat.node0.nr_isolated_anon
65056 ± 29% -64.4% 23147 ± 59% numa-vmstat.node1.nr_mapped
197053 ± 15% -59.0% 80737 ± 47% numa-vmstat.node1.nr_shmem
1568725 ± 5% -10.3% 1407667 ± 6% numa-vmstat.node1.numa_hit
1608874 ± 2% -30.7% 1115409 ± 6% meminfo.Active
1608858 ± 2% -30.7% 1115393 ± 6% meminfo.Active(anon)
4549412 -10.4% 4076550 meminfo.Cached
368664 ± 8% -51.2% 179972 ± 36% meminfo.Mapped
870107 ± 4% -54.4% 397179 ± 19% meminfo.Shmem
0.70 ± 4% -0.7 0.00 mpstat.cpu.all.iowait%
0.24 +0.0 0.28 ± 3% mpstat.cpu.all.irq%
72.55 +13.7 86.20 mpstat.cpu.all.soft%
22.29 -11.8 10.51 ± 2% mpstat.cpu.all.sys%
0.88 ± 3% -0.5 0.39 ± 3% mpstat.cpu.all.usr%
46.33 ± 14% -78.8% 9.83 ± 21% mpstat.max_utilization.seconds
12.47 ± 5% +41.6% 17.65 ± 5% perf-sched.sch_delay.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
12.47 ± 5% +41.6% 17.65 ± 5% perf-sched.total_sch_delay.average.ms
59.96 ± 4% +40.2% 84.05 ± 5% perf-sched.total_wait_and_delay.average.ms
47.50 ± 3% +39.8% 66.40 ± 5% perf-sched.total_wait_time.average.ms
59.96 ± 4% +40.2% 84.05 ± 5% perf-sched.wait_and_delay.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
47.50 ± 3% +39.8% 66.40 ± 5% perf-sched.wait_time.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
4.18 ± 9% -1.2 2.97 ± 3% turbostat.C1%
4.16 ± 10% -28.8% 2.96 ± 3% turbostat.CPU%c1
0.15 -20.0% 0.12 turbostat.IPC
30643618 ± 6% -33.8% 20273146 ± 3% turbostat.IRQ
9235121 ± 11% -53.5% 4290976 ± 8% turbostat.NMI
20.28 -5.2% 19.23 turbostat.RAMWatt
2.80 ± 2% -84.0% 0.45 ± 12% stress-ng.rawsock.MB_recv'd_per_sec
1.263e+09 ± 2% -46.9% 6.705e+08 ± 4% stress-ng.rawsock.ops
21068817 ± 2% -46.9% 11184756 ± 4% stress-ng.rawsock.ops_per_sec
615583 -27.5% 446065 ± 9% stress-ng.time.involuntary_context_switches
168521 -52.0% 80837 stress-ng.time.minor_page_faults
4458 -53.7% 2062 ± 3% stress-ng.time.percent_of_cpu_this_job_got
2605 -54.0% 1199 ± 3% stress-ng.time.system_time
73.90 ± 2% -46.3% 39.67 ± 4% stress-ng.time.user_time
106021 ± 2% -81.2% 19900 ± 7% stress-ng.time.voluntary_context_switches
404635 ± 3% -31.0% 279044 ± 7% proc-vmstat.nr_active_anon
185837 -2.9% 180494 proc-vmstat.nr_anon_pages
1139738 -10.6% 1019325 proc-vmstat.nr_file_pages
153.81 ± 11% -100.0% 0.01 ±152% proc-vmstat.nr_isolated_anon
92592 ± 8% -51.0% 45333 ± 34% proc-vmstat.nr_mapped
219912 ± 5% -54.8% 99481 ± 19% proc-vmstat.nr_shmem
38659 -2.0% 37872 proc-vmstat.nr_slab_reclaimable
148828 +4.3% 155195 proc-vmstat.nr_slab_unreclaimable
404635 ± 3% -31.0% 279044 ± 7% proc-vmstat.nr_zone_active_anon
21570 ± 18% -61.8% 8245 ± 47% proc-vmstat.numa_hint_faults
16919 ± 21% -56.8% 7311 ± 51% proc-vmstat.numa_hint_faults_local
963.83 ± 7% -100.0% 0.00 proc-vmstat.numa_huge_pte_updates
111261 ± 3% -99.2% 909.67 ± 20% proc-vmstat.numa_pages_migrated
526245 ± 7% -97.7% 12122 ± 45% proc-vmstat.numa_pte_updates
535799 -22.9% 413207 ± 2% proc-vmstat.pgfault
111261 ± 3% -99.2% 909.67 ± 20% proc-vmstat.pgmigrate_success
21679 ± 7% -9.8% 19545 ± 2% proc-vmstat.pgreuse
1799808 -57.1% 772136 ± 9% sched_debug.cfs_rq:/.avg_vruntime.avg
2946020 ± 8% +251.1% 10342560 ± 10% sched_debug.cfs_rq:/.avg_vruntime.max
1006872 ± 8% -87.4% 126618 ± 18% sched_debug.cfs_rq:/.avg_vruntime.min
356117 ± 4% +262.9% 1292236 ± 17% sched_debug.cfs_rq:/.avg_vruntime.stddev
64.34 ± 33% +59.2% 102.44 ± 8% sched_debug.cfs_rq:/.load_avg.avg
1799808 -57.1% 772137 ± 9% sched_debug.cfs_rq:/.min_vruntime.avg
2946020 ± 8% +251.1% 10342560 ± 10% sched_debug.cfs_rq:/.min_vruntime.max
1006872 ± 8% -87.4% 126618 ± 18% sched_debug.cfs_rq:/.min_vruntime.min
356117 ± 4% +262.9% 1292236 ± 17% sched_debug.cfs_rq:/.min_vruntime.stddev
1008 ± 8% +15.7% 1166 sched_debug.cfs_rq:/.runnable_avg.avg
3848 ± 74% -67.7% 1243 ± 13% sched_debug.cfs_rq:/.util_avg.max
8.67 ±213% +2513.5% 226.50 ± 32% sched_debug.cfs_rq:/.util_avg.min
431.22 ± 40% -66.2% 145.58 ± 13% sched_debug.cfs_rq:/.util_avg.stddev
808.82 ± 11% -52.0% 387.95 ± 6% sched_debug.cfs_rq:/.util_est.avg
1337827 ± 3% -26.5% 983709 ± 2% sched_debug.cpu.avg_idle.avg
197988 ± 29% -97.1% 5680 ± 16% sched_debug.cpu.avg_idle.min
483058 ± 8% +27.5% 616103 ± 4% sched_debug.cpu.avg_idle.stddev
1721 ± 3% +243.6% 5914 ± 9% sched_debug.cpu.clock_task.stddev
1029615 -42.2% 594917 ± 2% sched_debug.cpu.max_idle_balance_cost.avg
1448914 ± 2% -11.7% 1279232 ± 6% sched_debug.cpu.max_idle_balance_cost.max
931867 -46.3% 500000 sched_debug.cpu.max_idle_balance_cost.min
99448 ± 18% +54.7% 153845 ± 9% sched_debug.cpu.max_idle_balance_cost.stddev
4446 -17.7% 3660 ± 5% sched_debug.cpu.nr_switches.avg
59305 ± 8% +23.6% 73291 ± 17% sched_debug.cpu.nr_switches.max
2177 ± 3% -47.3% 1147 ± 6% sched_debug.cpu.nr_switches.min
5185 ± 8% +26.1% 6537 ± 13% sched_debug.cpu.nr_switches.stddev
-99.50 -70.2% -29.67 sched_debug.cpu.nr_uninterruptible.min
30.41 ± 6% -54.1% 13.96 ± 30% sched_debug.cpu.nr_uninterruptible.stddev
0.18 ± 24% +59.1% 0.29 ± 4% sched_debug.dl_rq:.dl_nr_running.avg
2.23 -30.8% 1.54 ± 2% perf-stat.i.MPKI
1.973e+10 -11.8% 1.74e+10 perf-stat.i.branch-instructions
0.59 -0.2 0.41 ± 3% perf-stat.i.branch-miss-rate%
1.171e+08 -38.6% 71869487 ± 3% perf-stat.i.branch-misses
11.83 ± 2% -5.6 6.24 ± 2% perf-stat.i.cache-miss-rate%
2.013e+08 ± 2% -43.3% 1.14e+08 ± 3% perf-stat.i.cache-misses
1.706e+09 +10.7% 1.889e+09 perf-stat.i.cache-references
17575 -32.4% 11879 ± 9% perf-stat.i.context-switches
6.69 +22.6% 8.21 perf-stat.i.cpi
2421 ± 3% -76.6% 567.43 ± 6% perf-stat.i.cpu-migrations
3014 ± 2% +82.0% 5485 ± 4% perf-stat.i.cycles-between-cache-misses
9.022e+10 -18.1% 7.391e+10 perf-stat.i.instructions
0.15 -17.9% 0.12 perf-stat.i.ipc
7721 ± 4% -34.5% 5055 ± 5% perf-stat.i.minor-faults
7724 ± 4% -34.5% 5055 ± 5% perf-stat.i.page-faults
2.22 -30.4% 1.55 ± 2% perf-stat.overall.MPKI
0.59 -0.2 0.41 ± 3% perf-stat.overall.branch-miss-rate%
11.72 ± 2% -5.7 6.05 ± 3% perf-stat.overall.cache-miss-rate%
6.69 +22.9% 8.21 perf-stat.overall.cpi
3010 ± 2% +76.8% 5321 ± 3% perf-stat.overall.cycles-between-cache-misses
0.15 -18.6% 0.12 perf-stat.overall.ipc
1.911e+10 -11.6% 1.689e+10 perf-stat.ps.branch-instructions
1.129e+08 -38.0% 69941891 ± 3% perf-stat.ps.branch-misses
1.939e+08 -42.8% 1.109e+08 ± 3% perf-stat.ps.cache-misses
1.655e+09 +10.7% 1.833e+09 perf-stat.ps.cache-references
16423 -31.2% 11302 ± 10% perf-stat.ps.context-switches
2129 -77.4% 481.20 ± 4% perf-stat.ps.cpu-migrations
8.731e+10 -17.8% 7.177e+10 perf-stat.ps.instructions
6585 ± 4% -34.4% 4320 ± 5% perf-stat.ps.minor-faults
6587 ± 4% -34.4% 4320 ± 5% perf-stat.ps.page-faults
5.348e+12 -19.0% 4.332e+12 perf-stat.total.instructions
21.08 ± 3% -13.2 7.92 ± 9% perf-profile.calltrace.cycles-pp.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver
20.80 ± 3% -13.1 7.75 ± 10% perf-profile.calltrace.cycles-pp.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
18.41 ± 3% -11.8 6.61 ± 10% perf-profile.calltrace.cycles-pp.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu
13.19 ± 5% -9.1 4.08 ± 11% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input
10.51 ± 6% -7.5 2.98 ± 12% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv
7.69 ± 12% -7.5 0.20 ±223% perf-profile.calltrace.cycles-pp.__put_partials.kmem_cache_free.raw_recvmsg.inet_recvmsg.sock_recvmsg
8.67 ± 10% -7.4 1.30 ± 47% perf-profile.calltrace.cycles-pp.kmem_cache_free.raw_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
7.46 ± 12% -7.3 0.15 ±223% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__put_partials.kmem_cache_free.raw_recvmsg.inet_recvmsg
7.30 ± 12% -7.2 0.13 ±223% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__put_partials.kmem_cache_free.raw_recvmsg
8.98 ± 4% -6.8 2.19 ± 13% perf-profile.calltrace.cycles-pp.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver
5.22 ± 8% -4.7 0.56 ± 11% perf-profile.calltrace.cycles-pp.kmem_cache_alloc_noprof.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
3.48 ± 2% -2.0 1.46 ± 15% perf-profile.calltrace.cycles-pp.__copy_skb_header.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
2.30 -1.3 1.03 ± 9% perf-profile.calltrace.cycles-pp.sock_def_readable.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input
2.36 -1.2 1.12 ± 8% perf-profile.calltrace.cycles-pp.sk_filter_trim_cap.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu
29.52 ± 3% -21.5 7.99 ± 18% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
25.26 ± 3% -19.8 5.47 ± 18% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
21.08 ± 3% -13.2 7.92 ± 9% perf-profile.children.cycles-pp.raw_rcv
20.81 ± 3% -13.1 7.75 ± 10% perf-profile.children.cycles-pp.sock_queue_rcv_skb_reason
18.42 ± 3% -11.8 6.62 ± 10% perf-profile.children.cycles-pp.__sock_queue_rcv_skb
9.07 ± 10% -8.4 0.66 ± 51% perf-profile.children.cycles-pp.__put_partials
10.74 ± 8% -7.9 2.88 ± 39% perf-profile.children.cycles-pp.kmem_cache_free
8.99 ± 4% -6.8 2.19 ± 13% perf-profile.children.cycles-pp.skb_clone
5.25 ± 8% -4.6 0.61 ± 10% perf-profile.children.cycles-pp.kmem_cache_alloc_noprof
5.09 ± 8% -4.5 0.58 ± 11% perf-profile.children.cycles-pp.___slab_alloc
4.58 ± 9% -4.4 0.23 ± 22% perf-profile.children.cycles-pp.get_partial_node
3.48 ± 2% -2.0 1.46 ± 15% perf-profile.children.cycles-pp.__copy_skb_header
2.30 -1.3 1.03 ± 9% perf-profile.children.cycles-pp.sock_def_readable
2.36 -1.2 1.12 ± 8% perf-profile.children.cycles-pp.sk_filter_trim_cap
0.11 ± 4% -0.0 0.06 ± 11% perf-profile.children.cycles-pp.ipv4_pktinfo_prepare
0.12 ± 4% -0.0 0.07 ± 8% perf-profile.children.cycles-pp.dst_release
0.13 +0.0 0.16 ± 5% perf-profile.children.cycles-pp.update_process_times
0.15 ± 2% +0.0 0.19 ± 5% perf-profile.children.cycles-pp.tick_nohz_handler
0.15 ± 2% +0.0 0.19 ± 3% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.18 ± 3% +0.0 0.23 ± 6% perf-profile.children.cycles-pp.skb_defer_free_flush
0.21 ± 2% +0.1 0.27 ± 4% perf-profile.children.cycles-pp.hrtimer_interrupt
0.22 +0.1 0.27 ± 4% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.13 ± 8% +0.1 0.19 ± 9% perf-profile.children.cycles-pp.kmalloc_reserve
0.19 ± 5% +0.1 0.27 ± 18% perf-profile.children.cycles-pp.__ip_select_ident
0.00 +0.1 0.10 ± 51% perf-profile.children.cycles-pp.sk_skb_reason_drop
0.00 +0.1 0.11 ± 67% perf-profile.children.cycles-pp.inet_sock_destruct
0.00 +0.1 0.11 ± 66% perf-profile.children.cycles-pp.__sk_destruct
0.23 ± 4% +0.2 0.38 ± 3% perf-profile.children.cycles-pp.run_ksoftirqd
0.23 ± 4% +0.2 0.38 ± 3% perf-profile.children.cycles-pp.smpboot_thread_fn
0.23 ± 3% +0.2 0.43 ± 7% perf-profile.children.cycles-pp.kthread
0.23 ± 3% +0.2 0.43 ± 7% perf-profile.children.cycles-pp.ret_from_fork
0.23 ± 3% +0.2 0.43 ± 7% perf-profile.children.cycles-pp.ret_from_fork_asm
0.24 ± 3% +0.2 0.44 ± 18% perf-profile.children.cycles-pp.dst_destroy
0.00 +0.2 0.21 ± 77% perf-profile.children.cycles-pp.__mmput
0.00 +0.2 0.21 ± 77% perf-profile.children.cycles-pp.exit_mmap
0.00 +0.2 0.22 ± 76% perf-profile.children.cycles-pp.exit_mm
0.00 +0.2 0.22 ± 80% perf-profile.children.cycles-pp.__irq_exit_rcu
0.00 +0.2 0.22 ± 76% perf-profile.children.cycles-pp.__x64_sys_exit_group
0.00 +0.2 0.22 ± 75% perf-profile.children.cycles-pp.do_exit
0.00 +0.2 0.22 ± 75% perf-profile.children.cycles-pp.do_group_exit
0.00 +0.2 0.23 ± 73% perf-profile.children.cycles-pp.x64_sys_call
0.00 +0.2 0.24 ± 79% perf-profile.children.cycles-pp.acpi_idle_do_entry
0.00 +0.2 0.24 ± 79% perf-profile.children.cycles-pp.acpi_idle_enter
0.00 +0.2 0.24 ± 79% perf-profile.children.cycles-pp.acpi_safe_halt
0.00 +0.2 0.24 ± 79% perf-profile.children.cycles-pp.cpuidle_enter
0.00 +0.2 0.24 ± 79% perf-profile.children.cycles-pp.cpuidle_enter_state
0.00 +0.2 0.24 ± 79% perf-profile.children.cycles-pp.pv_native_safe_halt
0.00 +0.2 0.24 ± 78% perf-profile.children.cycles-pp.cpuidle_idle_call
0.00 +0.2 0.24 ± 78% perf-profile.children.cycles-pp.start_secondary
0.00 +0.2 0.24 ± 77% perf-profile.children.cycles-pp.common_startup_64
0.00 +0.2 0.24 ± 77% perf-profile.children.cycles-pp.cpu_startup_entry
0.00 +0.2 0.24 ± 77% perf-profile.children.cycles-pp.do_idle
0.23 ± 2% +0.3 0.49 ± 37% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.25 +0.3 0.52 ± 36% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.24 ± 4% +0.4 0.60 ± 29% perf-profile.children.cycles-pp.rcu_do_batch
0.24 ± 4% +0.4 0.60 ± 29% perf-profile.children.cycles-pp.rcu_core
0.86 ± 3% +0.7 1.52 ± 41% perf-profile.children.cycles-pp.__slab_free
25.25 ± 3% -19.8 5.46 ± 18% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
3.47 ± 2% -2.0 1.45 ± 15% perf-profile.self.cycles-pp.__copy_skb_header
4.26 -1.7 2.53 ± 18% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
2.87 -1.4 1.48 ± 8% perf-profile.self.cycles-pp.__sock_queue_rcv_skb
2.29 -1.3 1.03 ± 9% perf-profile.self.cycles-pp.sock_def_readable
2.34 -1.2 1.12 ± 8% perf-profile.self.cycles-pp.sk_filter_trim_cap
0.46 ± 2% -0.3 0.16 ± 16% perf-profile.self.cycles-pp.get_partial_node
0.50 ± 2% -0.2 0.34 ± 4% perf-profile.self.cycles-pp.___slab_alloc
0.31 -0.1 0.16 ± 11% perf-profile.self.cycles-pp.kmem_cache_alloc_noprof
0.28 -0.1 0.16 ± 8% perf-profile.self.cycles-pp.skb_clone
0.10 ± 4% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.dst_release
0.06 +0.0 0.08 ± 9% perf-profile.self.cycles-pp.kmem_cache_alloc_node_noprof
0.12 ± 3% +0.0 0.15 ± 7% perf-profile.self.cycles-pp.skb_defer_free_flush
0.00 +0.1 0.05 ± 7% perf-profile.self.cycles-pp.ktime_get
44.51 +26.8 71.33 ± 8% perf-profile.self.cycles-pp.raw_v4_input
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression
2025-11-25 14:46 [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression kernel test robot
@ 2025-11-26 9:49 ` David Hildenbrand (Red Hat)
2025-12-01 20:56 ` Sid Kumar
0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-26 9:49 UTC (permalink / raw)
To: kernel test robot, Kefeng Wang
Cc: oe-lkp, lkp, Andrew Morton, Sidhartha Kumar, Lorenzo Stoakes,
Baolin Wang, Barry Song, Dev Jain, Lance Yang, Liam Howlett,
Ryan Roberts, Zi Yan, linux-mm
On 11/25/25 15:46, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 46.9% regression of stress-ng.rawsock.ops_per_sec on:
>
>
> commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory: use folio_can_map_prot_numa() for pmd folio")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
Unexpected, but maybe simple a symptom of doing the right thing?
"which skips unsuitable folio, i.e. zone device, shared folios (KSM,
CoW), non-movable dma pinned, dirty file folio and folios that already
have the expected node affinity."
I suspect skipping shared folios or dirty file folios might make the
difference. The benchmark results would be misleading in that case: as
we shouldn't have migrated these pages in the first place beforehand.
--
Cheers
David
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression
2025-11-26 9:49 ` David Hildenbrand (Red Hat)
@ 2025-12-01 20:56 ` Sid Kumar
2025-12-01 21:13 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 4+ messages in thread
From: Sid Kumar @ 2025-12-01 20:56 UTC (permalink / raw)
To: David Hildenbrand (Red Hat), kernel test robot, Kefeng Wang
Cc: oe-lkp, lkp, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
Barry Song, Dev Jain, Lance Yang, Liam Howlett, Ryan Roberts,
Zi Yan, linux-mm
On 11/26/25 3:49 AM, David Hildenbrand (Red Hat) wrote:
> On 11/25/25 15:46, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed a 46.9% regression of
>> stress-ng.rawsock.ops_per_sec on:
>>
>>
>> commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory:
>> use folio_can_map_prot_numa() for pmd folio")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> Unexpected, but maybe simple a symptom of doing the right thing?
>
> "which skips unsuitable folio, i.e. zone device, shared folios (KSM,
> CoW), non-movable dma pinned, dirty file folio and folios that already
> have the expected node affinity."
>
> I suspect skipping shared folios or dirty file folios might make the
> difference. The benchmark results would be misleading in that case: as
> we shouldn't have migrated these pages in the first place beforehand.
Reproducing the benchmark and adding prints to show which condition the
return false occurs in shows that:
/* Also skip shared copy-on-write folios */
if (is_cow_mapping(vma->vm_flags) &&
folio_maybe_mapped_shared(folio)) {
printk("false at is_Cow_mapping\n");
return false;
}
virtme-ng% dmesg | grep is_Cow_mapping | wc -l
25302
is the condition that now fails and leads to the regression.
Thanks,
Sid
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression
2025-12-01 20:56 ` Sid Kumar
@ 2025-12-01 21:13 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 4+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-01 21:13 UTC (permalink / raw)
To: Sid Kumar, kernel test robot, Kefeng Wang
Cc: oe-lkp, lkp, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
Barry Song, Dev Jain, Lance Yang, Liam Howlett, Ryan Roberts,
Zi Yan, linux-mm
On 12/1/25 21:56, Sid Kumar wrote:
>
> On 11/26/25 3:49 AM, David Hildenbrand (Red Hat) wrote:
>> On 11/25/25 15:46, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a 46.9% regression of
>>> stress-ng.rawsock.ops_per_sec on:
>>>
>>>
>>> commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory:
>>> use folio_can_map_prot_numa() for pmd folio")
>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> Unexpected, but maybe simple a symptom of doing the right thing?
>>
>> "which skips unsuitable folio, i.e. zone device, shared folios (KSM,
>> CoW), non-movable dma pinned, dirty file folio and folios that already
>> have the expected node affinity."
>>
>> I suspect skipping shared folios or dirty file folios might make the
>> difference. The benchmark results would be misleading in that case: as
>> we shouldn't have migrated these pages in the first place beforehand.
>
>
> Reproducing the benchmark and adding prints to show which condition the
> return false occurs in shows that:
>
> /* Also skip shared copy-on-write folios */
> if (is_cow_mapping(vma->vm_flags) &&
> folio_maybe_mapped_shared(folio)) {
> printk("false at is_Cow_mapping\n");
> return false;
> }
>
> virtme-ng% dmesg | grep is_Cow_mapping | wc -l
> 25302
>
> is the condition that now fails and leads to the regression.
Okay, as I thought, it's rather a "doing the right thing". At
least doing the same thing we do during PTE faults :)
This check dates back to:
commit 859d4adc3415a64ccb8b0c50dc4e3a888dcb5805
Author: Henry Willard <henry.willard@oracle.com>
Date: Wed Jan 31 16:21:07 2018 -0800
mm: numa: do not trap faults on shared data section pages.
Workloads consisting of a large number of processes running the same
program with a very large shared data segment may experience performance
problems when numa balancing attempts to migrate the shared cow pages.
This manifests itself with many processes or tasks in
TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.
The program listed below simulates the conditions with these results
when run with 288 processes on a 144 core/8 socket machine.
Average throughput Average throughput Average throughput
with numa_balancing=0 with numa_balancing=1 with numa_balancing=1
without the patch with the patch
--------------------- --------------------- ---------------------
2118782 2021534 2107979
Complex production environments show less variability and fewer poorly
performing outliers accompanied with a smaller number of processes
waiting on NUMA page migration with this patch applied. In some cases,
%iowait drops from 16%-26% to 0.
I think the reproducer would actually not care about anonymous folios, but
not sure if that would make a difference for the benchmark here.
--
Cheers
David
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-01 21:13 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-25 14:46 [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression kernel test robot
2025-11-26 9:49 ` David Hildenbrand (Red Hat)
2025-12-01 20:56 ` Sid Kumar
2025-12-01 21:13 ` David Hildenbrand (Red Hat)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox