[linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [linux-next:master] [mm]  f66e2727dd:  stress-ng.rawsock.ops_per_sec 46.9% regression
@ 2025-11-25 14:46 kernel test robot
  2025-11-26  9:49 ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 4+ messages in thread
From: kernel test robot @ 2025-11-25 14:46 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: oe-lkp, lkp, Andrew Morton, David Hildenbrand, Sidhartha Kumar,
	Lorenzo Stoakes, Baolin Wang, Barry Song, Dev Jain, Lance Yang,
	Liam Howlett, Ryan Roberts, Zi Yan, linux-mm, oliver.sang



Hello,

kernel test robot noticed a 46.9% regression of stress-ng.rawsock.ops_per_sec on:


commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory: use folio_can_map_prot_numa() for pmd folio")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

[still regression on linux-next/master d724c6f85e80a23ed46b7ebc6e38b527c09d64f5]

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 192 threads 2 sockets Intel(R) Xeon(R) 6740E  CPU @ 2.4GHz (Sierra Forest) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: rawsock
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202511252236.33b899b3-lkp@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251125/202511252236.33b899b3-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-14/performance/x86_64-rhel-9.4/100%/debian-13-x86_64-20250902.cgz/lkp-srf-2sp3/rawsock/stress-ng/60s

commit: 
  ca43034cdb ("mm: mprotect: convert to folio_can_map_prot_numa()")
  f66e2727dd ("mm: huge_memory: use folio_can_map_prot_numa() for pmd folio")

ca43034cdb224131 f66e2727ddfcbbe3dbb459e8098 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1569273 ±  5%     -10.8%    1399094 ±  6%  numa-numastat.node1.numa_hit
 5.156e+08 ±  9%     -27.6%  3.735e+08 ±  5%  cpuidle..time
    376420 ±  5%     -56.5%     163676 ±  5%  cpuidle..usage
    256283 ± 29%     -64.8%      90309 ± 60%  numa-meminfo.node1.Mapped
    774069 ± 15%     -58.8%     319020 ± 47%  numa-meminfo.node1.Shmem
     89117 ± 17%     -62.6%      33337 ± 50%  perf-c2c.DRAM.remote
     78134 ± 16%     -62.4%      29353 ± 50%  perf-c2c.HITM.remote
     78.45 ± 33%    -100.0%       0.01 ±141%  numa-vmstat.node0.nr_isolated_anon
     65056 ± 29%     -64.4%      23147 ± 59%  numa-vmstat.node1.nr_mapped
    197053 ± 15%     -59.0%      80737 ± 47%  numa-vmstat.node1.nr_shmem
   1568725 ±  5%     -10.3%    1407667 ±  6%  numa-vmstat.node1.numa_hit
   1608874 ±  2%     -30.7%    1115409 ±  6%  meminfo.Active
   1608858 ±  2%     -30.7%    1115393 ±  6%  meminfo.Active(anon)
   4549412           -10.4%    4076550        meminfo.Cached
    368664 ±  8%     -51.2%     179972 ± 36%  meminfo.Mapped
    870107 ±  4%     -54.4%     397179 ± 19%  meminfo.Shmem
      0.70 ±  4%      -0.7        0.00        mpstat.cpu.all.iowait%
      0.24            +0.0        0.28 ±  3%  mpstat.cpu.all.irq%
     72.55           +13.7       86.20        mpstat.cpu.all.soft%
     22.29           -11.8       10.51 ±  2%  mpstat.cpu.all.sys%
      0.88 ±  3%      -0.5        0.39 ±  3%  mpstat.cpu.all.usr%
     46.33 ± 14%     -78.8%       9.83 ± 21%  mpstat.max_utilization.seconds
     12.47 ±  5%     +41.6%      17.65 ±  5%  perf-sched.sch_delay.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
     12.47 ±  5%     +41.6%      17.65 ±  5%  perf-sched.total_sch_delay.average.ms
     59.96 ±  4%     +40.2%      84.05 ±  5%  perf-sched.total_wait_and_delay.average.ms
     47.50 ±  3%     +39.8%      66.40 ±  5%  perf-sched.total_wait_time.average.ms
     59.96 ±  4%     +40.2%      84.05 ±  5%  perf-sched.wait_and_delay.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
     47.50 ±  3%     +39.8%      66.40 ±  5%  perf-sched.wait_time.avg.ms.[unknown].[unknown].[unknown].[unknown].[unknown]
      4.18 ±  9%      -1.2        2.97 ±  3%  turbostat.C1%
      4.16 ± 10%     -28.8%       2.96 ±  3%  turbostat.CPU%c1
      0.15           -20.0%       0.12        turbostat.IPC
  30643618 ±  6%     -33.8%   20273146 ±  3%  turbostat.IRQ
   9235121 ± 11%     -53.5%    4290976 ±  8%  turbostat.NMI
     20.28            -5.2%      19.23        turbostat.RAMWatt
      2.80 ±  2%     -84.0%       0.45 ± 12%  stress-ng.rawsock.MB_recv'd_per_sec
 1.263e+09 ±  2%     -46.9%  6.705e+08 ±  4%  stress-ng.rawsock.ops
  21068817 ±  2%     -46.9%   11184756 ±  4%  stress-ng.rawsock.ops_per_sec
    615583           -27.5%     446065 ±  9%  stress-ng.time.involuntary_context_switches
    168521           -52.0%      80837        stress-ng.time.minor_page_faults
      4458           -53.7%       2062 ±  3%  stress-ng.time.percent_of_cpu_this_job_got
      2605           -54.0%       1199 ±  3%  stress-ng.time.system_time
     73.90 ±  2%     -46.3%      39.67 ±  4%  stress-ng.time.user_time
    106021 ±  2%     -81.2%      19900 ±  7%  stress-ng.time.voluntary_context_switches
    404635 ±  3%     -31.0%     279044 ±  7%  proc-vmstat.nr_active_anon
    185837            -2.9%     180494        proc-vmstat.nr_anon_pages
   1139738           -10.6%    1019325        proc-vmstat.nr_file_pages
    153.81 ± 11%    -100.0%       0.01 ±152%  proc-vmstat.nr_isolated_anon
     92592 ±  8%     -51.0%      45333 ± 34%  proc-vmstat.nr_mapped
    219912 ±  5%     -54.8%      99481 ± 19%  proc-vmstat.nr_shmem
     38659            -2.0%      37872        proc-vmstat.nr_slab_reclaimable
    148828            +4.3%     155195        proc-vmstat.nr_slab_unreclaimable
    404635 ±  3%     -31.0%     279044 ±  7%  proc-vmstat.nr_zone_active_anon
     21570 ± 18%     -61.8%       8245 ± 47%  proc-vmstat.numa_hint_faults
     16919 ± 21%     -56.8%       7311 ± 51%  proc-vmstat.numa_hint_faults_local
    963.83 ±  7%    -100.0%       0.00        proc-vmstat.numa_huge_pte_updates
    111261 ±  3%     -99.2%     909.67 ± 20%  proc-vmstat.numa_pages_migrated
    526245 ±  7%     -97.7%      12122 ± 45%  proc-vmstat.numa_pte_updates
    535799           -22.9%     413207 ±  2%  proc-vmstat.pgfault
    111261 ±  3%     -99.2%     909.67 ± 20%  proc-vmstat.pgmigrate_success
     21679 ±  7%      -9.8%      19545 ±  2%  proc-vmstat.pgreuse
   1799808           -57.1%     772136 ±  9%  sched_debug.cfs_rq:/.avg_vruntime.avg
   2946020 ±  8%    +251.1%   10342560 ± 10%  sched_debug.cfs_rq:/.avg_vruntime.max
   1006872 ±  8%     -87.4%     126618 ± 18%  sched_debug.cfs_rq:/.avg_vruntime.min
    356117 ±  4%    +262.9%    1292236 ± 17%  sched_debug.cfs_rq:/.avg_vruntime.stddev
     64.34 ± 33%     +59.2%     102.44 ±  8%  sched_debug.cfs_rq:/.load_avg.avg
   1799808           -57.1%     772137 ±  9%  sched_debug.cfs_rq:/.min_vruntime.avg
   2946020 ±  8%    +251.1%   10342560 ± 10%  sched_debug.cfs_rq:/.min_vruntime.max
   1006872 ±  8%     -87.4%     126618 ± 18%  sched_debug.cfs_rq:/.min_vruntime.min
    356117 ±  4%    +262.9%    1292236 ± 17%  sched_debug.cfs_rq:/.min_vruntime.stddev
      1008 ±  8%     +15.7%       1166        sched_debug.cfs_rq:/.runnable_avg.avg
      3848 ± 74%     -67.7%       1243 ± 13%  sched_debug.cfs_rq:/.util_avg.max
      8.67 ±213%   +2513.5%     226.50 ± 32%  sched_debug.cfs_rq:/.util_avg.min
    431.22 ± 40%     -66.2%     145.58 ± 13%  sched_debug.cfs_rq:/.util_avg.stddev
    808.82 ± 11%     -52.0%     387.95 ±  6%  sched_debug.cfs_rq:/.util_est.avg
   1337827 ±  3%     -26.5%     983709 ±  2%  sched_debug.cpu.avg_idle.avg
    197988 ± 29%     -97.1%       5680 ± 16%  sched_debug.cpu.avg_idle.min
    483058 ±  8%     +27.5%     616103 ±  4%  sched_debug.cpu.avg_idle.stddev
      1721 ±  3%    +243.6%       5914 ±  9%  sched_debug.cpu.clock_task.stddev
   1029615           -42.2%     594917 ±  2%  sched_debug.cpu.max_idle_balance_cost.avg
   1448914 ±  2%     -11.7%    1279232 ±  6%  sched_debug.cpu.max_idle_balance_cost.max
    931867           -46.3%     500000        sched_debug.cpu.max_idle_balance_cost.min
     99448 ± 18%     +54.7%     153845 ±  9%  sched_debug.cpu.max_idle_balance_cost.stddev
      4446           -17.7%       3660 ±  5%  sched_debug.cpu.nr_switches.avg
     59305 ±  8%     +23.6%      73291 ± 17%  sched_debug.cpu.nr_switches.max
      2177 ±  3%     -47.3%       1147 ±  6%  sched_debug.cpu.nr_switches.min
      5185 ±  8%     +26.1%       6537 ± 13%  sched_debug.cpu.nr_switches.stddev
    -99.50           -70.2%     -29.67        sched_debug.cpu.nr_uninterruptible.min
     30.41 ±  6%     -54.1%      13.96 ± 30%  sched_debug.cpu.nr_uninterruptible.stddev
      0.18 ± 24%     +59.1%       0.29 ±  4%  sched_debug.dl_rq:.dl_nr_running.avg
      2.23           -30.8%       1.54 ±  2%  perf-stat.i.MPKI
 1.973e+10           -11.8%   1.74e+10        perf-stat.i.branch-instructions
      0.59            -0.2        0.41 ±  3%  perf-stat.i.branch-miss-rate%
 1.171e+08           -38.6%   71869487 ±  3%  perf-stat.i.branch-misses
     11.83 ±  2%      -5.6        6.24 ±  2%  perf-stat.i.cache-miss-rate%
 2.013e+08 ±  2%     -43.3%   1.14e+08 ±  3%  perf-stat.i.cache-misses
 1.706e+09           +10.7%  1.889e+09        perf-stat.i.cache-references
     17575           -32.4%      11879 ±  9%  perf-stat.i.context-switches
      6.69           +22.6%       8.21        perf-stat.i.cpi
      2421 ±  3%     -76.6%     567.43 ±  6%  perf-stat.i.cpu-migrations
      3014 ±  2%     +82.0%       5485 ±  4%  perf-stat.i.cycles-between-cache-misses
 9.022e+10           -18.1%  7.391e+10        perf-stat.i.instructions
      0.15           -17.9%       0.12        perf-stat.i.ipc
      7721 ±  4%     -34.5%       5055 ±  5%  perf-stat.i.minor-faults
      7724 ±  4%     -34.5%       5055 ±  5%  perf-stat.i.page-faults
      2.22           -30.4%       1.55 ±  2%  perf-stat.overall.MPKI
      0.59            -0.2        0.41 ±  3%  perf-stat.overall.branch-miss-rate%
     11.72 ±  2%      -5.7        6.05 ±  3%  perf-stat.overall.cache-miss-rate%
      6.69           +22.9%       8.21        perf-stat.overall.cpi
      3010 ±  2%     +76.8%       5321 ±  3%  perf-stat.overall.cycles-between-cache-misses
      0.15           -18.6%       0.12        perf-stat.overall.ipc
 1.911e+10           -11.6%  1.689e+10        perf-stat.ps.branch-instructions
 1.129e+08           -38.0%   69941891 ±  3%  perf-stat.ps.branch-misses
 1.939e+08           -42.8%  1.109e+08 ±  3%  perf-stat.ps.cache-misses
 1.655e+09           +10.7%  1.833e+09        perf-stat.ps.cache-references
     16423           -31.2%      11302 ± 10%  perf-stat.ps.context-switches
      2129           -77.4%     481.20 ±  4%  perf-stat.ps.cpu-migrations
 8.731e+10           -17.8%  7.177e+10        perf-stat.ps.instructions
      6585 ±  4%     -34.4%       4320 ±  5%  perf-stat.ps.minor-faults
      6587 ±  4%     -34.4%       4320 ±  5%  perf-stat.ps.page-faults
 5.348e+12           -19.0%  4.332e+12        perf-stat.total.instructions
     21.08 ±  3%     -13.2        7.92 ±  9%  perf-profile.calltrace.cycles-pp.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver
     20.80 ±  3%     -13.1        7.75 ± 10%  perf-profile.calltrace.cycles-pp.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
     18.41 ±  3%     -11.8        6.61 ± 10%  perf-profile.calltrace.cycles-pp.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu
     13.19 ±  5%      -9.1        4.08 ± 11%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input
     10.51 ±  6%      -7.5        2.98 ± 12%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv
      7.69 ± 12%      -7.5        0.20 ±223%  perf-profile.calltrace.cycles-pp.__put_partials.kmem_cache_free.raw_recvmsg.inet_recvmsg.sock_recvmsg
      8.67 ± 10%      -7.4        1.30 ± 47%  perf-profile.calltrace.cycles-pp.kmem_cache_free.raw_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
      7.46 ± 12%      -7.3        0.15 ±223%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__put_partials.kmem_cache_free.raw_recvmsg.inet_recvmsg
      7.30 ± 12%      -7.2        0.13 ±223%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__put_partials.kmem_cache_free.raw_recvmsg
      8.98 ±  4%      -6.8        2.19 ± 13%  perf-profile.calltrace.cycles-pp.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.ip_local_deliver
      5.22 ±  8%      -4.7        0.56 ± 11%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc_noprof.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
      3.48 ±  2%      -2.0        1.46 ± 15%  perf-profile.calltrace.cycles-pp.__copy_skb_header.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
      2.30            -1.3        1.03 ±  9%  perf-profile.calltrace.cycles-pp.sock_def_readable.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input
      2.36            -1.2        1.12 ±  8%  perf-profile.calltrace.cycles-pp.sk_filter_trim_cap.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu
     29.52 ±  3%     -21.5        7.99 ± 18%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     25.26 ±  3%     -19.8        5.47 ± 18%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
     21.08 ±  3%     -13.2        7.92 ±  9%  perf-profile.children.cycles-pp.raw_rcv
     20.81 ±  3%     -13.1        7.75 ± 10%  perf-profile.children.cycles-pp.sock_queue_rcv_skb_reason
     18.42 ±  3%     -11.8        6.62 ± 10%  perf-profile.children.cycles-pp.__sock_queue_rcv_skb
      9.07 ± 10%      -8.4        0.66 ± 51%  perf-profile.children.cycles-pp.__put_partials
     10.74 ±  8%      -7.9        2.88 ± 39%  perf-profile.children.cycles-pp.kmem_cache_free
      8.99 ±  4%      -6.8        2.19 ± 13%  perf-profile.children.cycles-pp.skb_clone
      5.25 ±  8%      -4.6        0.61 ± 10%  perf-profile.children.cycles-pp.kmem_cache_alloc_noprof
      5.09 ±  8%      -4.5        0.58 ± 11%  perf-profile.children.cycles-pp.___slab_alloc
      4.58 ±  9%      -4.4        0.23 ± 22%  perf-profile.children.cycles-pp.get_partial_node
      3.48 ±  2%      -2.0        1.46 ± 15%  perf-profile.children.cycles-pp.__copy_skb_header
      2.30            -1.3        1.03 ±  9%  perf-profile.children.cycles-pp.sock_def_readable
      2.36            -1.2        1.12 ±  8%  perf-profile.children.cycles-pp.sk_filter_trim_cap
      0.11 ±  4%      -0.0        0.06 ± 11%  perf-profile.children.cycles-pp.ipv4_pktinfo_prepare
      0.12 ±  4%      -0.0        0.07 ±  8%  perf-profile.children.cycles-pp.dst_release
      0.13            +0.0        0.16 ±  5%  perf-profile.children.cycles-pp.update_process_times
      0.15 ±  2%      +0.0        0.19 ±  5%  perf-profile.children.cycles-pp.tick_nohz_handler
      0.15 ±  2%      +0.0        0.19 ±  3%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.18 ±  3%      +0.0        0.23 ±  6%  perf-profile.children.cycles-pp.skb_defer_free_flush
      0.21 ±  2%      +0.1        0.27 ±  4%  perf-profile.children.cycles-pp.hrtimer_interrupt
      0.22            +0.1        0.27 ±  4%  perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.13 ±  8%      +0.1        0.19 ±  9%  perf-profile.children.cycles-pp.kmalloc_reserve
      0.19 ±  5%      +0.1        0.27 ± 18%  perf-profile.children.cycles-pp.__ip_select_ident
      0.00            +0.1        0.10 ± 51%  perf-profile.children.cycles-pp.sk_skb_reason_drop
      0.00            +0.1        0.11 ± 67%  perf-profile.children.cycles-pp.inet_sock_destruct
      0.00            +0.1        0.11 ± 66%  perf-profile.children.cycles-pp.__sk_destruct
      0.23 ±  4%      +0.2        0.38 ±  3%  perf-profile.children.cycles-pp.run_ksoftirqd
      0.23 ±  4%      +0.2        0.38 ±  3%  perf-profile.children.cycles-pp.smpboot_thread_fn
      0.23 ±  3%      +0.2        0.43 ±  7%  perf-profile.children.cycles-pp.kthread
      0.23 ±  3%      +0.2        0.43 ±  7%  perf-profile.children.cycles-pp.ret_from_fork
      0.23 ±  3%      +0.2        0.43 ±  7%  perf-profile.children.cycles-pp.ret_from_fork_asm
      0.24 ±  3%      +0.2        0.44 ± 18%  perf-profile.children.cycles-pp.dst_destroy
      0.00            +0.2        0.21 ± 77%  perf-profile.children.cycles-pp.__mmput
      0.00            +0.2        0.21 ± 77%  perf-profile.children.cycles-pp.exit_mmap
      0.00            +0.2        0.22 ± 76%  perf-profile.children.cycles-pp.exit_mm
      0.00            +0.2        0.22 ± 80%  perf-profile.children.cycles-pp.__irq_exit_rcu
      0.00            +0.2        0.22 ± 76%  perf-profile.children.cycles-pp.__x64_sys_exit_group
      0.00            +0.2        0.22 ± 75%  perf-profile.children.cycles-pp.do_exit
      0.00            +0.2        0.22 ± 75%  perf-profile.children.cycles-pp.do_group_exit
      0.00            +0.2        0.23 ± 73%  perf-profile.children.cycles-pp.x64_sys_call
      0.00            +0.2        0.24 ± 79%  perf-profile.children.cycles-pp.acpi_idle_do_entry
      0.00            +0.2        0.24 ± 79%  perf-profile.children.cycles-pp.acpi_idle_enter
      0.00            +0.2        0.24 ± 79%  perf-profile.children.cycles-pp.acpi_safe_halt
      0.00            +0.2        0.24 ± 79%  perf-profile.children.cycles-pp.cpuidle_enter
      0.00            +0.2        0.24 ± 79%  perf-profile.children.cycles-pp.cpuidle_enter_state
      0.00            +0.2        0.24 ± 79%  perf-profile.children.cycles-pp.pv_native_safe_halt
      0.00            +0.2        0.24 ± 78%  perf-profile.children.cycles-pp.cpuidle_idle_call
      0.00            +0.2        0.24 ± 78%  perf-profile.children.cycles-pp.start_secondary
      0.00            +0.2        0.24 ± 77%  perf-profile.children.cycles-pp.common_startup_64
      0.00            +0.2        0.24 ± 77%  perf-profile.children.cycles-pp.cpu_startup_entry
      0.00            +0.2        0.24 ± 77%  perf-profile.children.cycles-pp.do_idle
      0.23 ±  2%      +0.3        0.49 ± 37%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.25            +0.3        0.52 ± 36%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.24 ±  4%      +0.4        0.60 ± 29%  perf-profile.children.cycles-pp.rcu_do_batch
      0.24 ±  4%      +0.4        0.60 ± 29%  perf-profile.children.cycles-pp.rcu_core
      0.86 ±  3%      +0.7        1.52 ± 41%  perf-profile.children.cycles-pp.__slab_free
     25.25 ±  3%     -19.8        5.46 ± 18%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      3.47 ±  2%      -2.0        1.45 ± 15%  perf-profile.self.cycles-pp.__copy_skb_header
      4.26            -1.7        2.53 ± 18%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      2.87            -1.4        1.48 ±  8%  perf-profile.self.cycles-pp.__sock_queue_rcv_skb
      2.29            -1.3        1.03 ±  9%  perf-profile.self.cycles-pp.sock_def_readable
      2.34            -1.2        1.12 ±  8%  perf-profile.self.cycles-pp.sk_filter_trim_cap
      0.46 ±  2%      -0.3        0.16 ± 16%  perf-profile.self.cycles-pp.get_partial_node
      0.50 ±  2%      -0.2        0.34 ±  4%  perf-profile.self.cycles-pp.___slab_alloc
      0.31            -0.1        0.16 ± 11%  perf-profile.self.cycles-pp.kmem_cache_alloc_noprof
      0.28            -0.1        0.16 ±  8%  perf-profile.self.cycles-pp.skb_clone
      0.10 ±  4%      -0.0        0.06 ±  9%  perf-profile.self.cycles-pp.dst_release
      0.06            +0.0        0.08 ±  9%  perf-profile.self.cycles-pp.kmem_cache_alloc_node_noprof
      0.12 ±  3%      +0.0        0.15 ±  7%  perf-profile.self.cycles-pp.skb_defer_free_flush
      0.00            +0.1        0.05 ±  7%  perf-profile.self.cycles-pp.ktime_get
     44.51           +26.8       71.33 ±  8%  perf-profile.self.cycles-pp.raw_v4_input




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression
  2025-11-25 14:46 [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression kernel test robot
@ 2025-11-26  9:49 ` David Hildenbrand (Red Hat)
  2025-12-01 20:56   ` Sid Kumar
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-26  9:49 UTC (permalink / raw)
  To: kernel test robot, Kefeng Wang
  Cc: oe-lkp, lkp, Andrew Morton, Sidhartha Kumar, Lorenzo Stoakes,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Liam Howlett,
	Ryan Roberts, Zi Yan, linux-mm

On 11/25/25 15:46, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a 46.9% regression of stress-ng.rawsock.ops_per_sec on:
> 
> 
> commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory: use folio_can_map_prot_numa() for pmd folio")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

Unexpected, but maybe simple a symptom of doing the right thing?

"which skips unsuitable folio, i.e.  zone device, shared folios (KSM, 
CoW), non-movable dma pinned, dirty file folio and folios that already 
have the expected node affinity."

I suspect skipping shared folios or dirty file folios might make the 
difference. The benchmark results would be misleading in that case: as 
we shouldn't have migrated these pages in the first place beforehand.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression
  2025-11-26  9:49 ` David Hildenbrand (Red Hat)
@ 2025-12-01 20:56   ` Sid Kumar
  2025-12-01 21:13     ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 4+ messages in thread
From: Sid Kumar @ 2025-12-01 20:56 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), kernel test robot, Kefeng Wang
  Cc: oe-lkp, lkp, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
	Barry Song, Dev Jain, Lance Yang, Liam Howlett, Ryan Roberts,
	Zi Yan, linux-mm


On 11/26/25 3:49 AM, David Hildenbrand (Red Hat) wrote:
> On 11/25/25 15:46, kernel test robot wrote:
>>
>>
>> Hello,
>>
>> kernel test robot noticed a 46.9% regression of 
>> stress-ng.rawsock.ops_per_sec on:
>>
>>
>> commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory: 
>> use folio_can_map_prot_numa() for pmd folio")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> Unexpected, but maybe simple a symptom of doing the right thing?
>
> "which skips unsuitable folio, i.e.  zone device, shared folios (KSM, 
> CoW), non-movable dma pinned, dirty file folio and folios that already 
> have the expected node affinity."
>
> I suspect skipping shared folios or dirty file folios might make the 
> difference. The benchmark results would be misleading in that case: as 
> we shouldn't have migrated these pages in the first place beforehand.


Reproducing the benchmark and adding prints to show which condition the 
return false occurs in shows that:

     /* Also skip shared copy-on-write folios */
     if (is_cow_mapping(vma->vm_flags) && 
folio_maybe_mapped_shared(folio)) {
         printk("false at is_Cow_mapping\n");
         return false;
     }

virtme-ng% dmesg | grep is_Cow_mapping | wc -l
25302

is the condition that now fails and leads to the regression.


Thanks,

Sid





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression
  2025-12-01 20:56   ` Sid Kumar
@ 2025-12-01 21:13     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 4+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-01 21:13 UTC (permalink / raw)
  To: Sid Kumar, kernel test robot, Kefeng Wang
  Cc: oe-lkp, lkp, Andrew Morton, Lorenzo Stoakes, Baolin Wang,
	Barry Song, Dev Jain, Lance Yang, Liam Howlett, Ryan Roberts,
	Zi Yan, linux-mm

On 12/1/25 21:56, Sid Kumar wrote:
> 
> On 11/26/25 3:49 AM, David Hildenbrand (Red Hat) wrote:
>> On 11/25/25 15:46, kernel test robot wrote:
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a 46.9% regression of
>>> stress-ng.rawsock.ops_per_sec on:
>>>
>>>
>>> commit: f66e2727ddfcbbe3dbb459e809824f721a914464 ("mm: huge_memory:
>>> use folio_can_map_prot_numa() for pmd folio")
>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> Unexpected, but maybe simple a symptom of doing the right thing?
>>
>> "which skips unsuitable folio, i.e.  zone device, shared folios (KSM,
>> CoW), non-movable dma pinned, dirty file folio and folios that already
>> have the expected node affinity."
>>
>> I suspect skipping shared folios or dirty file folios might make the
>> difference. The benchmark results would be misleading in that case: as
>> we shouldn't have migrated these pages in the first place beforehand.
> 
> 
> Reproducing the benchmark and adding prints to show which condition the
> return false occurs in shows that:
> 
>       /* Also skip shared copy-on-write folios */
>       if (is_cow_mapping(vma->vm_flags) &&
> folio_maybe_mapped_shared(folio)) {
>           printk("false at is_Cow_mapping\n");
>           return false;
>       }
> 
> virtme-ng% dmesg | grep is_Cow_mapping | wc -l
> 25302
> 
> is the condition that now fails and leads to the regression.

Okay, as I thought, it's rather a "doing the right thing". At
least doing the same thing we do during PTE faults :)

This check dates back to:

commit 859d4adc3415a64ccb8b0c50dc4e3a888dcb5805
Author: Henry Willard <henry.willard@oracle.com>
Date:   Wed Jan 31 16:21:07 2018 -0800

     mm: numa: do not trap faults on shared data section pages.
     
     Workloads consisting of a large number of processes running the same
     program with a very large shared data segment may experience performance
     problems when numa balancing attempts to migrate the shared cow pages.
     This manifests itself with many processes or tasks in
     TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.
     
     The program listed below simulates the conditions with these results
     when run with 288 processes on a 144 core/8 socket machine.
     
     Average throughput      Average throughput     Average throughput
     with numa_balancing=0   with numa_balancing=1  with numa_balancing=1
                             without the patch      with the patch
     ---------------------   ---------------------  ---------------------
     2118782                 2021534                2107979
     
     Complex production environments show less variability and fewer poorly
     performing outliers accompanied with a smaller number of processes
     waiting on NUMA page migration with this patch applied.  In some cases,
     %iowait drops from 16%-26% to 0.
     


I think the reproducer would actually not care about anonymous folios, but
not sure if that would make a difference for the benchmark here.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-12-01 21:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-25 14:46 [linux-next:master] [mm] f66e2727dd: stress-ng.rawsock.ops_per_sec 46.9% regression kernel test robot
2025-11-26  9:49 ` David Hildenbrand (Red Hat)
2025-12-01 20:56   ` Sid Kumar
2025-12-01 21:13     ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox