* [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
@ 2023-09-08 5:26 kernel test robot
2023-09-08 14:43 ` Chuck Lever III
0 siblings, 1 reply; 15+ messages in thread
From: kernel test robot @ 2023-09-08 5:26 UTC (permalink / raw)
To: Chuck Lever
Cc: oe-lkp, lkp, linux-kernel, Christian Brauner, linux-mm,
ying.huang, feng.tang, fengwei.yin, oliver.sang
Hello,
kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
testcase: aim9
test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
parameters:
testtime: 300s
test: disk_src
cpufreq_governor: performance
In addition to that, the commit also has significant impact on the following tests:
+------------------+-------------------------------------------------------------------------------------------------+
| testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression |
| test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
| test parameters | cpufreq_governor=performance |
| | test=all |
| | testtime=5s |
+------------------+-------------------------------------------------------------------------------------------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230908/202309081306.3ecb3734-oliver.sang@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s
commit:
23a31d8764 ("shmem: Refactor shmem_symlink()")
a2e459555c ("shmem: stable directory offsets")
23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.26 ± 9% +0.1 0.36 ± 2% mpstat.cpu.all.soft%
0.61 -0.1 0.52 mpstat.cpu.all.usr%
0.16 ± 10% -18.9% 0.13 ± 12% perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
0.04 ± 7% +1802.4% 0.78 ±115% perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
202424 -19.0% 163868 aim9.disk_src.ops_per_sec
94.83 -4.2% 90.83 aim9.time.percent_of_cpu_this_job_got
73.62 -17.6% 60.69 aim9.time.user_time
23541 +6.5% 25074 proc-vmstat.nr_slab_reclaimable
1437319 ± 24% +377.6% 6864201 proc-vmstat.numa_hit
1387016 ± 25% +391.4% 6815486 proc-vmstat.numa_local
4864362 ± 34% +453.6% 26931180 proc-vmstat.pgalloc_normal
4835960 ± 34% +455.4% 26856610 proc-vmstat.pgfree
538959 ± 24% -23.2% 414090 sched_debug.cfs_rq:/.load.max
130191 ± 14% -13.3% 112846 ± 6% sched_debug.cfs_rq:/.load.stddev
116849 ± 27% -51.2% 56995 ± 20% sched_debug.cfs_rq:/.min_vruntime.max
1223 ±191% -897.4% -9754 sched_debug.cfs_rq:/.spread0.avg
107969 ± 29% -65.3% 37448 ± 39% sched_debug.cfs_rq:/.spread0.max
55209 ± 14% -21.8% 43154 ± 14% sched_debug.cpu.nr_switches.max
11.21 +23.7% 13.87 perf-stat.i.MPKI
7.223e+08 -4.4% 6.907e+08 perf-stat.i.branch-instructions
2.67 +0.2 2.88 perf-stat.i.branch-miss-rate%
19988363 +2.8% 20539702 perf-stat.i.branch-misses
17.36 -2.8 14.59 perf-stat.i.cache-miss-rate%
40733859 +19.5% 48659982 perf-stat.i.cache-references
1.76 +3.5% 1.82 perf-stat.i.cpi
55.21 +5.4% 58.21 ± 2% perf-stat.i.cpu-migrations
1.01e+09 -3.8% 9.719e+08 perf-stat.i.dTLB-loads
0.26 ± 4% -0.0 0.23 ± 3% perf-stat.i.dTLB-store-miss-rate%
2166022 ± 4% -6.9% 2015917 ± 3% perf-stat.i.dTLB-store-misses
8.503e+08 +5.5% 8.968e+08 perf-stat.i.dTLB-stores
69.22 ± 4% +6.4 75.60 perf-stat.i.iTLB-load-miss-rate%
316455 ± 12% -31.6% 216531 ± 3% perf-stat.i.iTLB-loads
3.722e+09 -3.1% 3.608e+09 perf-stat.i.instructions
0.57 -3.3% 0.55 perf-stat.i.ipc
865.04 -10.4% 775.02 ± 3% perf-stat.i.metric.K/sec
47.51 -2.1 45.37 perf-stat.i.node-load-miss-rate%
106705 ± 3% +14.8% 122490 ± 5% perf-stat.i.node-loads
107169 ± 4% +29.0% 138208 ± 7% perf-stat.i.node-stores
10.94 +23.3% 13.49 perf-stat.overall.MPKI
2.77 +0.2 2.97 perf-stat.overall.branch-miss-rate%
17.28 -2.7 14.56 perf-stat.overall.cache-miss-rate%
1.73 +3.4% 1.79 perf-stat.overall.cpi
0.25 ± 4% -0.0 0.22 ± 3% perf-stat.overall.dTLB-store-miss-rate%
69.20 ± 4% +6.4 75.60 perf-stat.overall.iTLB-load-miss-rate%
0.58 -3.2% 0.56 perf-stat.overall.ipc
45.25 -2.2 43.10 perf-stat.overall.node-load-miss-rate%
7.199e+08 -4.4% 6.883e+08 perf-stat.ps.branch-instructions
19919808 +2.8% 20469001 perf-stat.ps.branch-misses
40597326 +19.5% 48497201 perf-stat.ps.cache-references
55.06 +5.4% 58.03 ± 2% perf-stat.ps.cpu-migrations
1.007e+09 -3.8% 9.686e+08 perf-stat.ps.dTLB-loads
2158768 ± 4% -6.9% 2009174 ± 3% perf-stat.ps.dTLB-store-misses
8.475e+08 +5.5% 8.937e+08 perf-stat.ps.dTLB-stores
315394 ± 12% -31.6% 215816 ± 3% perf-stat.ps.iTLB-loads
3.71e+09 -3.1% 3.595e+09 perf-stat.ps.instructions
106351 ± 3% +14.8% 122083 ± 5% perf-stat.ps.node-loads
106728 ± 4% +29.1% 137740 ± 7% perf-stat.ps.node-stores
1.117e+12 -3.0% 1.084e+12 perf-stat.total.instructions
0.00 +0.8 0.75 ± 12% perf-profile.calltrace.cycles-pp.__call_rcu_common.xas_store.__xa_erase.xa_erase.simple_offset_remove
0.00 +0.8 0.78 ± 34% perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store
0.00 +0.8 0.83 ± 29% perf-profile.calltrace.cycles-pp.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand
0.00 +0.9 0.92 ± 26% perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create
0.00 +1.0 0.99 ± 27% perf-profile.calltrace.cycles-pp.shuffle_freelist.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc
0.00 +1.0 1.04 ± 28% perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store.__xa_alloc
0.00 +1.1 1.11 ± 26% perf-profile.calltrace.cycles-pp.xas_alloc.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
1.51 ± 24% +1.2 2.73 ± 10% perf-profile.calltrace.cycles-pp.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +1.2 1.24 ± 20% perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create.xas_store
0.00 +1.3 1.27 ± 10% perf-profile.calltrace.cycles-pp.xas_store.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink
0.00 +1.3 1.30 ± 10% perf-profile.calltrace.cycles-pp.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink
0.00 +1.3 1.33 ± 19% perf-profile.calltrace.cycles-pp.xas_alloc.xas_expand.xas_create.xas_store.__xa_alloc
0.00 +1.4 1.36 ± 10% perf-profile.calltrace.cycles-pp.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
0.00 +1.4 1.37 ± 10% perf-profile.calltrace.cycles-pp.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink
0.00 +1.5 1.51 ± 17% perf-profile.calltrace.cycles-pp.xas_expand.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
0.00 +1.6 1.62 ± 12% perf-profile.calltrace.cycles-pp.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64
0.00 +2.8 2.80 ± 13% perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
0.00 +2.9 2.94 ± 13% perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
5.38 ± 24% +3.1 8.51 ± 11% perf-profile.calltrace.cycles-pp.lookup_open.open_last_lookups.path_openat.do_filp_open.do_sys_openat2
6.08 ± 24% +3.2 9.24 ± 12% perf-profile.calltrace.cycles-pp.open_last_lookups.path_openat.do_filp_open.do_sys_openat2.__x64_sys_creat
0.00 +3.2 3.20 ± 13% perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
0.00 +3.2 3.24 ± 13% perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
0.00 +3.4 3.36 ± 14% perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
2.78 ± 25% +3.4 6.17 ± 12% perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
0.16 ± 30% -0.1 0.08 ± 20% perf-profile.children.cycles-pp.map_id_up
0.02 ±146% +0.1 0.08 ± 13% perf-profile.children.cycles-pp.shmem_is_huge
0.02 ±141% +0.1 0.09 ± 16% perf-profile.children.cycles-pp.__list_del_entry_valid
0.00 +0.1 0.08 ± 11% perf-profile.children.cycles-pp.free_unref_page
0.00 +0.1 0.08 ± 13% perf-profile.children.cycles-pp.shmem_destroy_inode
0.04 ±101% +0.1 0.14 ± 25% perf-profile.children.cycles-pp.rcu_nocb_try_bypass
0.00 +0.1 0.12 ± 27% perf-profile.children.cycles-pp.xas_find_marked
0.02 ±144% +0.1 0.16 ± 14% perf-profile.children.cycles-pp.__unfreeze_partials
0.03 ±106% +0.2 0.19 ± 26% perf-profile.children.cycles-pp.xas_descend
0.01 ±223% +0.2 0.17 ± 15% perf-profile.children.cycles-pp.get_page_from_freelist
0.11 ± 22% +0.2 0.29 ± 16% perf-profile.children.cycles-pp.rcu_segcblist_enqueue
0.02 ±146% +0.2 0.24 ± 13% perf-profile.children.cycles-pp.__alloc_pages
0.36 ± 79% +0.6 0.98 ± 15% perf-profile.children.cycles-pp.__slab_free
0.50 ± 26% +0.7 1.23 ± 14% perf-profile.children.cycles-pp.__call_rcu_common
0.00 +0.8 0.82 ± 13% perf-profile.children.cycles-pp.radix_tree_node_rcu_free
0.00 +1.1 1.14 ± 17% perf-profile.children.cycles-pp.radix_tree_node_ctor
0.16 ± 86% +1.2 1.38 ± 16% perf-profile.children.cycles-pp.setup_object
1.52 ± 25% +1.2 2.75 ± 10% perf-profile.children.cycles-pp.vfs_unlink
0.36 ± 22% +1.3 1.63 ± 12% perf-profile.children.cycles-pp.shmem_unlink
0.00 +1.3 1.30 ± 10% perf-profile.children.cycles-pp.__xa_erase
0.20 ± 79% +1.3 1.53 ± 15% perf-profile.children.cycles-pp.shuffle_freelist
0.00 +1.4 1.36 ± 10% perf-profile.children.cycles-pp.xa_erase
0.00 +1.4 1.38 ± 10% perf-profile.children.cycles-pp.simple_offset_remove
0.00 +1.5 1.51 ± 17% perf-profile.children.cycles-pp.xas_expand
0.26 ± 78% +1.6 1.87 ± 13% perf-profile.children.cycles-pp.allocate_slab
0.40 ± 49% +1.7 2.10 ± 13% perf-profile.children.cycles-pp.___slab_alloc
1.30 ± 85% +2.1 3.42 ± 12% perf-profile.children.cycles-pp.rcu_do_batch
1.56 ± 27% +2.4 3.93 ± 11% perf-profile.children.cycles-pp.kmem_cache_alloc_lru
0.00 +2.4 2.44 ± 12% perf-profile.children.cycles-pp.xas_alloc
2.66 ± 13% +2.5 5.14 ± 5% perf-profile.children.cycles-pp.__irq_exit_rcu
11.16 ± 10% +2.7 13.88 ± 8% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
11.77 ± 10% +2.7 14.49 ± 8% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.00 +2.8 2.82 ± 13% perf-profile.children.cycles-pp.xas_create
5.40 ± 24% +3.1 8.52 ± 11% perf-profile.children.cycles-pp.lookup_open
6.12 ± 24% +3.1 9.27 ± 12% perf-profile.children.cycles-pp.open_last_lookups
0.00 +3.2 3.22 ± 13% perf-profile.children.cycles-pp.__xa_alloc
0.00 +3.2 3.24 ± 13% perf-profile.children.cycles-pp.__xa_alloc_cyclic
0.00 +3.4 3.36 ± 14% perf-profile.children.cycles-pp.simple_offset_add
2.78 ± 25% +3.4 6.18 ± 12% perf-profile.children.cycles-pp.shmem_mknod
0.00 +4.2 4.24 ± 12% perf-profile.children.cycles-pp.xas_store
0.14 ± 27% -0.1 0.08 ± 21% perf-profile.self.cycles-pp.map_id_up
0.00 +0.1 0.06 ± 24% perf-profile.self.cycles-pp.shmem_destroy_inode
0.00 +0.1 0.07 ± 8% perf-profile.self.cycles-pp.__xa_alloc
0.02 ±146% +0.1 0.11 ± 28% perf-profile.self.cycles-pp.rcu_nocb_try_bypass
0.01 ±223% +0.1 0.10 ± 28% perf-profile.self.cycles-pp.shuffle_freelist
0.00 +0.1 0.11 ± 40% perf-profile.self.cycles-pp.xas_create
0.00 +0.1 0.12 ± 27% perf-profile.self.cycles-pp.xas_find_marked
0.00 +0.1 0.14 ± 18% perf-profile.self.cycles-pp.xas_alloc
0.03 ±103% +0.1 0.17 ± 29% perf-profile.self.cycles-pp.xas_descend
0.00 +0.2 0.16 ± 23% perf-profile.self.cycles-pp.xas_expand
0.10 ± 22% +0.2 0.27 ± 16% perf-profile.self.cycles-pp.rcu_segcblist_enqueue
0.00 +0.4 0.36 ± 16% perf-profile.self.cycles-pp.xas_store
0.32 ± 30% +0.4 0.71 ± 12% perf-profile.self.cycles-pp.__call_rcu_common
0.18 ± 27% +0.5 0.65 ± 8% perf-profile.self.cycles-pp.kmem_cache_alloc_lru
0.36 ± 79% +0.6 0.96 ± 15% perf-profile.self.cycles-pp.__slab_free
0.00 +0.8 0.80 ± 14% perf-profile.self.cycles-pp.radix_tree_node_rcu_free
0.00 +1.0 1.01 ± 16% perf-profile.self.cycles-pp.radix_tree_node_ctor
***************************************************************************************************
lkp-ivb-2ep1: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/all/aim9/5s
commit:
23a31d8764 ("shmem: Refactor shmem_symlink()")
a2e459555c ("shmem: stable directory offsets")
23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
---------------- ---------------------------
%stddev %change %stddev
\ | \
9781285 +2.0% 9975309 proc-vmstat.pgalloc_normal
4481052 -1.6% 4408359 proc-vmstat.pgfault
9749965 +2.0% 9942285 proc-vmstat.pgfree
14556 -1.6% 14324 perf-stat.i.minor-faults
14556 -1.6% 14324 perf-stat.i.page-faults
14505 -1.6% 14272 perf-stat.ps.minor-faults
14505 -1.6% 14272 perf-stat.ps.page-faults
849714 -3.6% 819341 aim9.brk_test.ops_per_sec
478138 +3.1% 492806 aim9.dgram_pipe.ops_per_sec
199087 -14.6% 170071 aim9.disk_src.ops_per_sec
286595 -9.7% 258794 aim9.link_test.ops_per_sec
303603 -2.8% 295009 aim9.page_test.ops_per_sec
3692190 -1.7% 3629732 aim9.time.minor_page_faults
0.00 +1.0 0.95 ± 25% perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
0.00 +1.0 1.01 ± 23% perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
1.54 ± 22% +1.1 2.61 ± 22% perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
0.00 +1.2 1.15 ± 21% perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
0.00 +1.2 1.18 ± 21% perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
0.00 +1.2 1.22 ± 21% perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
0.28 ± 21% +0.2 0.45 ± 24% perf-profile.children.cycles-pp.__call_rcu_common
0.00 +0.3 0.26 ± 43% perf-profile.children.cycles-pp.radix_tree_node_rcu_free
0.14 ± 46% +0.3 0.45 ± 20% perf-profile.children.cycles-pp.setup_object
0.00 +0.3 0.33 ± 24% perf-profile.children.cycles-pp.radix_tree_node_ctor
0.16 ± 49% +0.4 0.52 ± 24% perf-profile.children.cycles-pp.shuffle_freelist
0.23 ± 43% +0.4 0.63 ± 23% perf-profile.children.cycles-pp.allocate_slab
0.30 ± 35% +0.4 0.74 ± 24% perf-profile.children.cycles-pp.___slab_alloc
0.17 ± 25% +0.5 0.66 ± 23% perf-profile.children.cycles-pp.shmem_unlink
0.00 +0.5 0.49 ± 24% perf-profile.children.cycles-pp.__xa_erase
0.00 +0.5 0.52 ± 24% perf-profile.children.cycles-pp.xa_erase
0.00 +0.5 0.52 ± 64% perf-profile.children.cycles-pp.xas_expand
0.00 +0.5 0.53 ± 24% perf-profile.children.cycles-pp.simple_offset_remove
0.87 ± 26% +0.7 1.56 ± 23% perf-profile.children.cycles-pp.kmem_cache_alloc_lru
2.44 ± 12% +0.8 3.25 ± 13% perf-profile.children.cycles-pp.__irq_exit_rcu
0.00 +0.8 0.82 ± 24% perf-profile.children.cycles-pp.xas_alloc
0.01 ±230% +1.0 0.99 ± 23% perf-profile.children.cycles-pp.xas_create
1.55 ± 22% +1.1 2.63 ± 22% perf-profile.children.cycles-pp.shmem_mknod
0.00 +1.2 1.16 ± 21% perf-profile.children.cycles-pp.__xa_alloc
0.00 +1.2 1.18 ± 21% perf-profile.children.cycles-pp.__xa_alloc_cyclic
0.00 +1.2 1.22 ± 21% perf-profile.children.cycles-pp.simple_offset_add
0.18 ± 28% +1.5 1.65 ± 21% perf-profile.children.cycles-pp.xas_store
0.11 ± 31% +0.1 0.25 ± 27% perf-profile.self.cycles-pp.xas_store
0.11 ± 31% +0.2 0.28 ± 24% perf-profile.self.cycles-pp.kmem_cache_alloc_lru
0.00 +0.3 0.26 ± 44% perf-profile.self.cycles-pp.radix_tree_node_rcu_free
0.00 +0.3 0.29 ± 23% perf-profile.self.cycles-pp.radix_tree_node_ctor
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-08 5:26 [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression kernel test robot
@ 2023-09-08 14:43 ` Chuck Lever III
2023-09-12 1:25 ` Oliver Sang
0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever III @ 2023-09-08 14:43 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin
> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>
>
> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> testcase: aim9
> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> parameters:
>
> testtime: 300s
> test: disk_src
> cpufreq_governor: performance
>
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+-------------------------------------------------------------------------------------------------+
> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression |
> | test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
> | test parameters | cpufreq_governor=performance |
> | | test=all |
> | | testtime=5s |
> +------------------+-------------------------------------------------------------------------------------------------+
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
Hi, several weeks ago we requested that these tests be run
again by the robot because they can't be run in environments
I have available to me (the tests do not run on Fedora, and
I don't have any big iron).
We wanted the tests rerun before the patch was committed.
There was a deafening silence. So I assumed the work I did
then to address the regression was successful, and the
patches are now in upstream Linux.
This new report is disappointing.
But, I'm still in a position where I can't run this test,
and the results don't really indicate where the problem
is. So I can't possibly address this issue.
Any suggestions, advice, or help would be appreciated.
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20230908/202309081306.3ecb3734-oliver.sang@intel.com
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s
>
> commit:
> 23a31d8764 ("shmem: Refactor shmem_symlink()")
> a2e459555c ("shmem: stable directory offsets")
>
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 0.26 ± 9% +0.1 0.36 ± 2% mpstat.cpu.all.soft%
> 0.61 -0.1 0.52 mpstat.cpu.all.usr%
> 0.16 ± 10% -18.9% 0.13 ± 12% perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
> 0.04 ± 7% +1802.4% 0.78 ±115% perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
> 202424 -19.0% 163868 aim9.disk_src.ops_per_sec
> 94.83 -4.2% 90.83 aim9.time.percent_of_cpu_this_job_got
> 73.62 -17.6% 60.69 aim9.time.user_time
> 23541 +6.5% 25074 proc-vmstat.nr_slab_reclaimable
> 1437319 ± 24% +377.6% 6864201 proc-vmstat.numa_hit
> 1387016 ± 25% +391.4% 6815486 proc-vmstat.numa_local
> 4864362 ± 34% +453.6% 26931180 proc-vmstat.pgalloc_normal
> 4835960 ± 34% +455.4% 26856610 proc-vmstat.pgfree
> 538959 ± 24% -23.2% 414090 sched_debug.cfs_rq:/.load.max
> 130191 ± 14% -13.3% 112846 ± 6% sched_debug.cfs_rq:/.load.stddev
> 116849 ± 27% -51.2% 56995 ± 20% sched_debug.cfs_rq:/.min_vruntime.max
> 1223 ±191% -897.4% -9754 sched_debug.cfs_rq:/.spread0.avg
> 107969 ± 29% -65.3% 37448 ± 39% sched_debug.cfs_rq:/.spread0.max
> 55209 ± 14% -21.8% 43154 ± 14% sched_debug.cpu.nr_switches.max
> 11.21 +23.7% 13.87 perf-stat.i.MPKI
> 7.223e+08 -4.4% 6.907e+08 perf-stat.i.branch-instructions
> 2.67 +0.2 2.88 perf-stat.i.branch-miss-rate%
> 19988363 +2.8% 20539702 perf-stat.i.branch-misses
> 17.36 -2.8 14.59 perf-stat.i.cache-miss-rate%
> 40733859 +19.5% 48659982 perf-stat.i.cache-references
> 1.76 +3.5% 1.82 perf-stat.i.cpi
> 55.21 +5.4% 58.21 ± 2% perf-stat.i.cpu-migrations
> 1.01e+09 -3.8% 9.719e+08 perf-stat.i.dTLB-loads
> 0.26 ± 4% -0.0 0.23 ± 3% perf-stat.i.dTLB-store-miss-rate%
> 2166022 ± 4% -6.9% 2015917 ± 3% perf-stat.i.dTLB-store-misses
> 8.503e+08 +5.5% 8.968e+08 perf-stat.i.dTLB-stores
> 69.22 ± 4% +6.4 75.60 perf-stat.i.iTLB-load-miss-rate%
> 316455 ± 12% -31.6% 216531 ± 3% perf-stat.i.iTLB-loads
> 3.722e+09 -3.1% 3.608e+09 perf-stat.i.instructions
> 0.57 -3.3% 0.55 perf-stat.i.ipc
> 865.04 -10.4% 775.02 ± 3% perf-stat.i.metric.K/sec
> 47.51 -2.1 45.37 perf-stat.i.node-load-miss-rate%
> 106705 ± 3% +14.8% 122490 ± 5% perf-stat.i.node-loads
> 107169 ± 4% +29.0% 138208 ± 7% perf-stat.i.node-stores
> 10.94 +23.3% 13.49 perf-stat.overall.MPKI
> 2.77 +0.2 2.97 perf-stat.overall.branch-miss-rate%
> 17.28 -2.7 14.56 perf-stat.overall.cache-miss-rate%
> 1.73 +3.4% 1.79 perf-stat.overall.cpi
> 0.25 ± 4% -0.0 0.22 ± 3% perf-stat.overall.dTLB-store-miss-rate%
> 69.20 ± 4% +6.4 75.60 perf-stat.overall.iTLB-load-miss-rate%
> 0.58 -3.2% 0.56 perf-stat.overall.ipc
> 45.25 -2.2 43.10 perf-stat.overall.node-load-miss-rate%
> 7.199e+08 -4.4% 6.883e+08 perf-stat.ps.branch-instructions
> 19919808 +2.8% 20469001 perf-stat.ps.branch-misses
> 40597326 +19.5% 48497201 perf-stat.ps.cache-references
> 55.06 +5.4% 58.03 ± 2% perf-stat.ps.cpu-migrations
> 1.007e+09 -3.8% 9.686e+08 perf-stat.ps.dTLB-loads
> 2158768 ± 4% -6.9% 2009174 ± 3% perf-stat.ps.dTLB-store-misses
> 8.475e+08 +5.5% 8.937e+08 perf-stat.ps.dTLB-stores
> 315394 ± 12% -31.6% 215816 ± 3% perf-stat.ps.iTLB-loads
> 3.71e+09 -3.1% 3.595e+09 perf-stat.ps.instructions
> 106351 ± 3% +14.8% 122083 ± 5% perf-stat.ps.node-loads
> 106728 ± 4% +29.1% 137740 ± 7% perf-stat.ps.node-stores
> 1.117e+12 -3.0% 1.084e+12 perf-stat.total.instructions
> 0.00 +0.8 0.75 ± 12% perf-profile.calltrace.cycles-pp.__call_rcu_common.xas_store.__xa_erase.xa_erase.simple_offset_remove
> 0.00 +0.8 0.78 ± 34% perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store
> 0.00 +0.8 0.83 ± 29% perf-profile.calltrace.cycles-pp.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand
> 0.00 +0.9 0.92 ± 26% perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create
> 0.00 +1.0 0.99 ± 27% perf-profile.calltrace.cycles-pp.shuffle_freelist.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc
> 0.00 +1.0 1.04 ± 28% perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store.__xa_alloc
> 0.00 +1.1 1.11 ± 26% perf-profile.calltrace.cycles-pp.xas_alloc.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
> 1.51 ± 24% +1.2 2.73 ± 10% perf-profile.calltrace.cycles-pp.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.00 +1.2 1.24 ± 20% perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create.xas_store
> 0.00 +1.3 1.27 ± 10% perf-profile.calltrace.cycles-pp.xas_store.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink
> 0.00 +1.3 1.30 ± 10% perf-profile.calltrace.cycles-pp.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink
> 0.00 +1.3 1.33 ± 19% perf-profile.calltrace.cycles-pp.xas_alloc.xas_expand.xas_create.xas_store.__xa_alloc
> 0.00 +1.4 1.36 ± 10% perf-profile.calltrace.cycles-pp.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
> 0.00 +1.4 1.37 ± 10% perf-profile.calltrace.cycles-pp.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink
> 0.00 +1.5 1.51 ± 17% perf-profile.calltrace.cycles-pp.xas_expand.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
> 0.00 +1.6 1.62 ± 12% perf-profile.calltrace.cycles-pp.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64
> 0.00 +2.8 2.80 ± 13% perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
> 0.00 +2.9 2.94 ± 13% perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
> 5.38 ± 24% +3.1 8.51 ± 11% perf-profile.calltrace.cycles-pp.lookup_open.open_last_lookups.path_openat.do_filp_open.do_sys_openat2
> 6.08 ± 24% +3.2 9.24 ± 12% perf-profile.calltrace.cycles-pp.open_last_lookups.path_openat.do_filp_open.do_sys_openat2.__x64_sys_creat
> 0.00 +3.2 3.20 ± 13% perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
> 0.00 +3.2 3.24 ± 13% perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
> 0.00 +3.4 3.36 ± 14% perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
> 2.78 ± 25% +3.4 6.17 ± 12% perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
> 0.16 ± 30% -0.1 0.08 ± 20% perf-profile.children.cycles-pp.map_id_up
> 0.02 ±146% +0.1 0.08 ± 13% perf-profile.children.cycles-pp.shmem_is_huge
> 0.02 ±141% +0.1 0.09 ± 16% perf-profile.children.cycles-pp.__list_del_entry_valid
> 0.00 +0.1 0.08 ± 11% perf-profile.children.cycles-pp.free_unref_page
> 0.00 +0.1 0.08 ± 13% perf-profile.children.cycles-pp.shmem_destroy_inode
> 0.04 ±101% +0.1 0.14 ± 25% perf-profile.children.cycles-pp.rcu_nocb_try_bypass
> 0.00 +0.1 0.12 ± 27% perf-profile.children.cycles-pp.xas_find_marked
> 0.02 ±144% +0.1 0.16 ± 14% perf-profile.children.cycles-pp.__unfreeze_partials
> 0.03 ±106% +0.2 0.19 ± 26% perf-profile.children.cycles-pp.xas_descend
> 0.01 ±223% +0.2 0.17 ± 15% perf-profile.children.cycles-pp.get_page_from_freelist
> 0.11 ± 22% +0.2 0.29 ± 16% perf-profile.children.cycles-pp.rcu_segcblist_enqueue
> 0.02 ±146% +0.2 0.24 ± 13% perf-profile.children.cycles-pp.__alloc_pages
> 0.36 ± 79% +0.6 0.98 ± 15% perf-profile.children.cycles-pp.__slab_free
> 0.50 ± 26% +0.7 1.23 ± 14% perf-profile.children.cycles-pp.__call_rcu_common
> 0.00 +0.8 0.82 ± 13% perf-profile.children.cycles-pp.radix_tree_node_rcu_free
> 0.00 +1.1 1.14 ± 17% perf-profile.children.cycles-pp.radix_tree_node_ctor
> 0.16 ± 86% +1.2 1.38 ± 16% perf-profile.children.cycles-pp.setup_object
> 1.52 ± 25% +1.2 2.75 ± 10% perf-profile.children.cycles-pp.vfs_unlink
> 0.36 ± 22% +1.3 1.63 ± 12% perf-profile.children.cycles-pp.shmem_unlink
> 0.00 +1.3 1.30 ± 10% perf-profile.children.cycles-pp.__xa_erase
> 0.20 ± 79% +1.3 1.53 ± 15% perf-profile.children.cycles-pp.shuffle_freelist
> 0.00 +1.4 1.36 ± 10% perf-profile.children.cycles-pp.xa_erase
> 0.00 +1.4 1.38 ± 10% perf-profile.children.cycles-pp.simple_offset_remove
> 0.00 +1.5 1.51 ± 17% perf-profile.children.cycles-pp.xas_expand
> 0.26 ± 78% +1.6 1.87 ± 13% perf-profile.children.cycles-pp.allocate_slab
> 0.40 ± 49% +1.7 2.10 ± 13% perf-profile.children.cycles-pp.___slab_alloc
> 1.30 ± 85% +2.1 3.42 ± 12% perf-profile.children.cycles-pp.rcu_do_batch
> 1.56 ± 27% +2.4 3.93 ± 11% perf-profile.children.cycles-pp.kmem_cache_alloc_lru
> 0.00 +2.4 2.44 ± 12% perf-profile.children.cycles-pp.xas_alloc
> 2.66 ± 13% +2.5 5.14 ± 5% perf-profile.children.cycles-pp.__irq_exit_rcu
> 11.16 ± 10% +2.7 13.88 ± 8% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
> 11.77 ± 10% +2.7 14.49 ± 8% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
> 0.00 +2.8 2.82 ± 13% perf-profile.children.cycles-pp.xas_create
> 5.40 ± 24% +3.1 8.52 ± 11% perf-profile.children.cycles-pp.lookup_open
> 6.12 ± 24% +3.1 9.27 ± 12% perf-profile.children.cycles-pp.open_last_lookups
> 0.00 +3.2 3.22 ± 13% perf-profile.children.cycles-pp.__xa_alloc
> 0.00 +3.2 3.24 ± 13% perf-profile.children.cycles-pp.__xa_alloc_cyclic
> 0.00 +3.4 3.36 ± 14% perf-profile.children.cycles-pp.simple_offset_add
> 2.78 ± 25% +3.4 6.18 ± 12% perf-profile.children.cycles-pp.shmem_mknod
> 0.00 +4.2 4.24 ± 12% perf-profile.children.cycles-pp.xas_store
> 0.14 ± 27% -0.1 0.08 ± 21% perf-profile.self.cycles-pp.map_id_up
> 0.00 +0.1 0.06 ± 24% perf-profile.self.cycles-pp.shmem_destroy_inode
> 0.00 +0.1 0.07 ± 8% perf-profile.self.cycles-pp.__xa_alloc
> 0.02 ±146% +0.1 0.11 ± 28% perf-profile.self.cycles-pp.rcu_nocb_try_bypass
> 0.01 ±223% +0.1 0.10 ± 28% perf-profile.self.cycles-pp.shuffle_freelist
> 0.00 +0.1 0.11 ± 40% perf-profile.self.cycles-pp.xas_create
> 0.00 +0.1 0.12 ± 27% perf-profile.self.cycles-pp.xas_find_marked
> 0.00 +0.1 0.14 ± 18% perf-profile.self.cycles-pp.xas_alloc
> 0.03 ±103% +0.1 0.17 ± 29% perf-profile.self.cycles-pp.xas_descend
> 0.00 +0.2 0.16 ± 23% perf-profile.self.cycles-pp.xas_expand
> 0.10 ± 22% +0.2 0.27 ± 16% perf-profile.self.cycles-pp.rcu_segcblist_enqueue
> 0.00 +0.4 0.36 ± 16% perf-profile.self.cycles-pp.xas_store
> 0.32 ± 30% +0.4 0.71 ± 12% perf-profile.self.cycles-pp.__call_rcu_common
> 0.18 ± 27% +0.5 0.65 ± 8% perf-profile.self.cycles-pp.kmem_cache_alloc_lru
> 0.36 ± 79% +0.6 0.96 ± 15% perf-profile.self.cycles-pp.__slab_free
> 0.00 +0.8 0.80 ± 14% perf-profile.self.cycles-pp.radix_tree_node_rcu_free
> 0.00 +1.0 1.01 ± 16% perf-profile.self.cycles-pp.radix_tree_node_ctor
>
>
> ***************************************************************************************************
> lkp-ivb-2ep1: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/all/aim9/5s
>
> commit:
> 23a31d8764 ("shmem: Refactor shmem_symlink()")
> a2e459555c ("shmem: stable directory offsets")
>
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 9781285 +2.0% 9975309 proc-vmstat.pgalloc_normal
> 4481052 -1.6% 4408359 proc-vmstat.pgfault
> 9749965 +2.0% 9942285 proc-vmstat.pgfree
> 14556 -1.6% 14324 perf-stat.i.minor-faults
> 14556 -1.6% 14324 perf-stat.i.page-faults
> 14505 -1.6% 14272 perf-stat.ps.minor-faults
> 14505 -1.6% 14272 perf-stat.ps.page-faults
> 849714 -3.6% 819341 aim9.brk_test.ops_per_sec
> 478138 +3.1% 492806 aim9.dgram_pipe.ops_per_sec
> 199087 -14.6% 170071 aim9.disk_src.ops_per_sec
> 286595 -9.7% 258794 aim9.link_test.ops_per_sec
> 303603 -2.8% 295009 aim9.page_test.ops_per_sec
> 3692190 -1.7% 3629732 aim9.time.minor_page_faults
> 0.00 +1.0 0.95 ± 25% perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
> 0.00 +1.0 1.01 ± 23% perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
> 1.54 ± 22% +1.1 2.61 ± 22% perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
> 0.00 +1.2 1.15 ± 21% perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
> 0.00 +1.2 1.18 ± 21% perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
> 0.00 +1.2 1.22 ± 21% perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
> 0.28 ± 21% +0.2 0.45 ± 24% perf-profile.children.cycles-pp.__call_rcu_common
> 0.00 +0.3 0.26 ± 43% perf-profile.children.cycles-pp.radix_tree_node_rcu_free
> 0.14 ± 46% +0.3 0.45 ± 20% perf-profile.children.cycles-pp.setup_object
> 0.00 +0.3 0.33 ± 24% perf-profile.children.cycles-pp.radix_tree_node_ctor
> 0.16 ± 49% +0.4 0.52 ± 24% perf-profile.children.cycles-pp.shuffle_freelist
> 0.23 ± 43% +0.4 0.63 ± 23% perf-profile.children.cycles-pp.allocate_slab
> 0.30 ± 35% +0.4 0.74 ± 24% perf-profile.children.cycles-pp.___slab_alloc
> 0.17 ± 25% +0.5 0.66 ± 23% perf-profile.children.cycles-pp.shmem_unlink
> 0.00 +0.5 0.49 ± 24% perf-profile.children.cycles-pp.__xa_erase
> 0.00 +0.5 0.52 ± 24% perf-profile.children.cycles-pp.xa_erase
> 0.00 +0.5 0.52 ± 64% perf-profile.children.cycles-pp.xas_expand
> 0.00 +0.5 0.53 ± 24% perf-profile.children.cycles-pp.simple_offset_remove
> 0.87 ± 26% +0.7 1.56 ± 23% perf-profile.children.cycles-pp.kmem_cache_alloc_lru
> 2.44 ± 12% +0.8 3.25 ± 13% perf-profile.children.cycles-pp.__irq_exit_rcu
> 0.00 +0.8 0.82 ± 24% perf-profile.children.cycles-pp.xas_alloc
> 0.01 ±230% +1.0 0.99 ± 23% perf-profile.children.cycles-pp.xas_create
> 1.55 ± 22% +1.1 2.63 ± 22% perf-profile.children.cycles-pp.shmem_mknod
> 0.00 +1.2 1.16 ± 21% perf-profile.children.cycles-pp.__xa_alloc
> 0.00 +1.2 1.18 ± 21% perf-profile.children.cycles-pp.__xa_alloc_cyclic
> 0.00 +1.2 1.22 ± 21% perf-profile.children.cycles-pp.simple_offset_add
> 0.18 ± 28% +1.5 1.65 ± 21% perf-profile.children.cycles-pp.xas_store
> 0.11 ± 31% +0.1 0.25 ± 27% perf-profile.self.cycles-pp.xas_store
> 0.11 ± 31% +0.2 0.28 ± 24% perf-profile.self.cycles-pp.kmem_cache_alloc_lru
> 0.00 +0.3 0.26 ± 44% perf-profile.self.cycles-pp.radix_tree_node_rcu_free
> 0.00 +0.3 0.29 ± 23% perf-profile.self.cycles-pp.radix_tree_node_ctor
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-08 14:43 ` Chuck Lever III
@ 2023-09-12 1:25 ` Oliver Sang
2023-09-12 13:01 ` Chuck Lever III
0 siblings, 1 reply; 15+ messages in thread
From: Oliver Sang @ 2023-09-12 1:25 UTC (permalink / raw)
To: Chuck Lever III
Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin,
oliver.sang
hi, Chuck Lever,
On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
>
>
> > On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> >
> >
> > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > testcase: aim9
> > test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> > parameters:
> >
> > testtime: 300s
> > test: disk_src
> > cpufreq_governor: performance
> >
> >
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+-------------------------------------------------------------------------------------------------+
> > | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression |
> > | test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
> > | test parameters | cpufreq_governor=performance |
> > | | test=all |
> > | | testtime=5s |
> > +------------------+-------------------------------------------------------------------------------------------------+
> >
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
>
> Hi, several weeks ago we requested that these tests be run
> again by the robot because they can't be run in environments
> I have available to me (the tests do not run on Fedora, and
> I don't have any big iron).
>
> We wanted the tests rerun before the patch was committed.
> There was a deafening silence. So I assumed the work I did
> then to address the regression was successful, and the
> patches are now in upstream Linux.
>
> This new report is disappointing.
I'm so sorry that I missed the test request for
https://lore.kernel.org/all/169030957098.157536.9938425508695693348.stgit@manet.1015granger.net/
just FYI, when this auto-bisect done,
head commit of linus/master: [65d6e954e37872fd9afb5ef3fc0481bb3c2f20f4] was
tested, it already includes
2be4f05af71bb libfs: Remove parent dentry locking in offset_iterate_dir()
in our tests, the regression still exists.
>
> But, I'm still in a position where I can't run this test,
> and the results don't really indicate where the problem
> is. So I can't possibly address this issue.
>
> Any suggestions, advice, or help would be appreciated.
if you have further fix patch, could you let us know? I will test it.
>
> --
> Chuck Lever
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 1:25 ` Oliver Sang
@ 2023-09-12 13:01 ` Chuck Lever III
2023-09-12 13:19 ` Oliver Sang
2023-09-12 15:14 ` Feng Tang
0 siblings, 2 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-12 13:01 UTC (permalink / raw)
To: Oliver Sang
Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin
> On Sep 11, 2023, at 9:25 PM, Oliver Sang <oliver.sang@intel.com> wrote:
>
> hi, Chuck Lever,
>
> On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
>>
>>
>>> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
>>>
>>>
>>>
>>> Hello,
>>>
>>> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>>>
>>>
>>> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>
>>> testcase: aim9
>>> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
>>> parameters:
>>>
>>> testtime: 300s
>>> test: disk_src
>>> cpufreq_governor: performance
>>>
>>>
>>> In addition to that, the commit also has significant impact on the following tests:
>>>
>>> +------------------+-------------------------------------------------------------------------------------------------+
>>> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression |
>>> | test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
>>> | test parameters | cpufreq_governor=performance |
>>> | | test=all |
>>> | | testtime=5s |
>>> +------------------+-------------------------------------------------------------------------------------------------+
>>>
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
>>
>> Hi, several weeks ago we requested that these tests be run
>> again by the robot because they can't be run in environments
>> I have available to me (the tests do not run on Fedora, and
>> I don't have any big iron).
>>
>> We wanted the tests rerun before the patch was committed.
>> There was a deafening silence. So I assumed the work I did
>> then to address the regression was successful, and the
>> patches are now in upstream Linux.
>>
>> This new report is disappointing.
>
> I'm so sorry that I missed the test request for
> https://lore.kernel.org/all/169030957098.157536.9938425508695693348.stgit@manet.1015granger.net/
>
> just FYI, when this auto-bisect done,
> head commit of linus/master: [65d6e954e37872fd9afb5ef3fc0481bb3c2f20f4] was
> tested, it already includes
> 2be4f05af71bb libfs: Remove parent dentry locking in offset_iterate_dir()
>
> in our tests, the regression still exists.
Thanks for clarifying. I wondered about that, of course only just
after clicking "Send".
>> But, I'm still in a position where I can't run this test,
>> and the results don't really indicate where the problem
>> is. So I can't possibly address this issue.
>>
>> Any suggestions, advice, or help would be appreciated.
>
> if you have further fix patch, could you let us know? I will test it.
Well that's the problem. Since I can't run the reproducer, there's
nothing I can do to troubleshoot the problem myself.
Is there any hope in getting this reproducer to run on Fedora?
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 13:01 ` Chuck Lever III
@ 2023-09-12 13:19 ` Oliver Sang
2023-09-12 15:14 ` Feng Tang
1 sibling, 0 replies; 15+ messages in thread
From: Oliver Sang @ 2023-09-12 13:19 UTC (permalink / raw)
To: Chuck Lever III
Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin,
oliver.sang
hi, Chuck Lever,
On Tue, Sep 12, 2023 at 01:01:29PM +0000, Chuck Lever III wrote:
>
>
> > in our tests, the regression still exists.
>
> Thanks for clarifying. I wondered about that, of course only just
> after clicking "Send".
>
>
> >> But, I'm still in a position where I can't run this test,
> >> and the results don't really indicate where the problem
> >> is. So I can't possibly address this issue.
> >>
> >> Any suggestions, advice, or help would be appreciated.
> >
> > if you have further fix patch, could you let us know? I will test it.
>
> Well that's the problem. Since I can't run the reproducer, there's
> nothing I can do to troubleshoot the problem myself.
>
> Is there any hope in getting this reproducer to run on Fedora?
sorry about this, maybe it's hard for us to support Fedora in a short time.
BTW, we are investigating this regression now, hope that we can update you
later. Thanks
>
>
> --
> Chuck Lever
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 13:01 ` Chuck Lever III
2023-09-12 13:19 ` Oliver Sang
@ 2023-09-12 15:14 ` Feng Tang
2023-09-12 15:26 ` Chuck Lever III
` (2 more replies)
1 sibling, 3 replies; 15+ messages in thread
From: Feng Tang @ 2023-09-12 15:14 UTC (permalink / raw)
To: Chuck Lever III
Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei
Hi Chuck Lever,
On Tue, Sep 12, 2023 at 09:01:29PM +0800, Chuck Lever III wrote:
>
>
> > On Sep 11, 2023, at 9:25 PM, Oliver Sang <oliver.sang@intel.com> wrote:
> >
> > hi, Chuck Lever,
> >
> > On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
> >>
> >>
> >>> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> Hello,
> >>>
> >>> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> >>>
> >>>
> >>> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>>
> >>> testcase: aim9
> >>> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> >>> parameters:
> >>>
> >>> testtime: 300s
> >>> test: disk_src
> >>> cpufreq_governor: performance
> >>>
> >>>
> >>> In addition to that, the commit also has significant impact on the following tests:
> >>>
> >>> +------------------+-------------------------------------------------------------------------------------------------+
> >>> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression |
> >>> | test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
> >>> | test parameters | cpufreq_governor=performance |
> >>> | | test=all |
> >>> | | testtime=5s |
> >>> +------------------+-------------------------------------------------------------------------------------------------+
> >>>
> >>>
> >>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >>> the same patch/commit), kindly add following tags
> >>> | Reported-by: kernel test robot <oliver.sang@intel.com>
> >>> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> >> But, I'm still in a position where I can't run this test,
> >> and the results don't really indicate where the problem
> >> is. So I can't possibly address this issue.
> >>
> >> Any suggestions, advice, or help would be appreciated.
> >
> > if you have further fix patch, could you let us know? I will test it.
>
> Well that's the problem. Since I can't run the reproducer, there's
> nothing I can do to troubleshoot the problem myself.
We dug more into the perf and other profiling data from 0Day server
running this case, and it seems that the new simple_offset_add()
called by shmem_mknod() brings extra cost related with slab,
specifically the 'radix_tree_node', which cause the regression.
Here is some slabinfo diff for commit a2e459555c5f and its parent:
23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
---------------- ---------------------------
26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
Also the perf profile show some difference
0.01 ±223% +0.1 0.10 ± 28% pp.self.shuffle_freelist
0.00 +0.1 0.11 ± 40% pp.self.xas_create
0.00 +0.1 0.12 ± 27% pp.self.xas_find_marked
0.00 +0.1 0.14 ± 18% pp.self.xas_alloc
0.03 ±103% +0.1 0.17 ± 29% pp.self.xas_descend
0.00 +0.2 0.16 ± 23% pp.self.xas_expand
0.10 ± 22% +0.2 0.27 ± 16% pp.self.rcu_segcblist_enqueue
0.92 ± 35% +0.3 1.22 ± 11% pp.self.kmem_cache_free
0.00 +0.4 0.36 ± 16% pp.self.xas_store
0.32 ± 30% +0.4 0.71 ± 12% pp.self.__call_rcu_common
0.18 ± 27% +0.5 0.65 ± 8% pp.self.kmem_cache_alloc_lru
0.36 ± 79% +0.6 0.96 ± 15% pp.self.__slab_free
0.00 +0.8 0.80 ± 14% pp.self.radix_tree_node_rcu_free
0.00 +1.0 1.01 ± 16% pp.self.radix_tree_node_ctor
Some perf profile from a2e459555c5f is:
- 17.09% 0.09% singleuser [kernel.kallsyms] [k] path_openat
- 16.99% path_openat
- 12.23% open_last_lookups
- 11.33% lookup_open.isra.0
- 9.05% shmem_mknod
- 5.11% simple_offset_add
- 4.95% __xa_alloc_cyclic
- 4.88% __xa_alloc
- 4.76% xas_store
- xas_create
- 2.40% xas_expand.constprop.0
- 2.01% xas_alloc
- kmem_cache_alloc_lru
- 1.28% ___slab_alloc
- 1.22% allocate_slab
- 1.19% shuffle_freelist
- 1.04% setup_object
radix_tree_node_ctor
Please let me know if you need more info.
>
> Is there any hope in getting this reproducer to run on Fedora?
Myself haven't succeeded to reproduce it locally, will keep trying
it tomorrow.
Thanks,
Feng
>
> --
> Chuck Lever
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 15:14 ` Feng Tang
@ 2023-09-12 15:26 ` Chuck Lever III
2023-09-12 16:01 ` Matthew Wilcox
2023-09-13 6:47 ` Feng Tang
2 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-12 15:26 UTC (permalink / raw)
To: Feng Tang
Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei
> On Sep 12, 2023, at 11:14 AM, Feng Tang <feng.tang@intel.com> wrote:
>
> Hi Chuck Lever,
>
> On Tue, Sep 12, 2023 at 09:01:29PM +0800, Chuck Lever III wrote:
>>
>>
>>> On Sep 11, 2023, at 9:25 PM, Oliver Sang <oliver.sang@intel.com> wrote:
>>>
>>> hi, Chuck Lever,
>>>
>>> On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
>>>>
>>>>
>>>>> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>>>>>
>>>>>
>>>>> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>
>>>>> testcase: aim9
>>>>> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
>>>>> parameters:
>>>>>
>>>>> testtime: 300s
>>>>> test: disk_src
>>>>> cpufreq_governor: performance
>>>>>
>>>>>
>>>>> In addition to that, the commit also has significant impact on the following tests:
>>>>>
>>>>> +------------------+-------------------------------------------------------------------------------------------------+
>>>>> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression |
>>>>> | test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
>>>>> | test parameters | cpufreq_governor=performance |
>>>>> | | test=all |
>>>>> | | testtime=5s |
>>>>> +------------------+-------------------------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>>> the same patch/commit), kindly add following tags
>>>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
>
>>>> But, I'm still in a position where I can't run this test,
>>>> and the results don't really indicate where the problem
>>>> is. So I can't possibly address this issue.
>>>>
>>>> Any suggestions, advice, or help would be appreciated.
>>>
>>> if you have further fix patch, could you let us know? I will test it.
>>
>> Well that's the problem. Since I can't run the reproducer, there's
>> nothing I can do to troubleshoot the problem myself.
>
> We dug more into the perf and other profiling data from 0Day server
> running this case, and it seems that the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
Thank you! Will ponder.
> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> ---------------- ---------------------------
>
> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>
> Also the perf profile show some difference
>
> 0.01 ±223% +0.1 0.10 ± 28% pp.self.shuffle_freelist
> 0.00 +0.1 0.11 ± 40% pp.self.xas_create
> 0.00 +0.1 0.12 ± 27% pp.self.xas_find_marked
> 0.00 +0.1 0.14 ± 18% pp.self.xas_alloc
> 0.03 ±103% +0.1 0.17 ± 29% pp.self.xas_descend
> 0.00 +0.2 0.16 ± 23% pp.self.xas_expand
> 0.10 ± 22% +0.2 0.27 ± 16% pp.self.rcu_segcblist_enqueue
> 0.92 ± 35% +0.3 1.22 ± 11% pp.self.kmem_cache_free
> 0.00 +0.4 0.36 ± 16% pp.self.xas_store
> 0.32 ± 30% +0.4 0.71 ± 12% pp.self.__call_rcu_common
> 0.18 ± 27% +0.5 0.65 ± 8% pp.self.kmem_cache_alloc_lru
> 0.36 ± 79% +0.6 0.96 ± 15% pp.self.__slab_free
> 0.00 +0.8 0.80 ± 14% pp.self.radix_tree_node_rcu_free
> 0.00 +1.0 1.01 ± 16% pp.self.radix_tree_node_ctor
>
> Some perf profile from a2e459555c5f is:
>
> - 17.09% 0.09% singleuser [kernel.kallsyms] [k] path_openat
> - 16.99% path_openat
> - 12.23% open_last_lookups
> - 11.33% lookup_open.isra.0
> - 9.05% shmem_mknod
> - 5.11% simple_offset_add
> - 4.95% __xa_alloc_cyclic
> - 4.88% __xa_alloc
> - 4.76% xas_store
> - xas_create
> - 2.40% xas_expand.constprop.0
> - 2.01% xas_alloc
> - kmem_cache_alloc_lru
> - 1.28% ___slab_alloc
> - 1.22% allocate_slab
> - 1.19% shuffle_freelist
> - 1.04% setup_object
> radix_tree_node_ctor
>
> Please let me know if you need more info.
>
>>
>> Is there any hope in getting this reproducer to run on Fedora?
>
> Myself haven't succeeded to reproduce it locally, will keep trying
> it tomorrow.
>
> Thanks,
> Feng
>
>>
>> --
>> Chuck Lever
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 15:14 ` Feng Tang
2023-09-12 15:26 ` Chuck Lever III
@ 2023-09-12 16:01 ` Matthew Wilcox
2023-09-12 16:27 ` Chuck Lever III
` (2 more replies)
2023-09-13 6:47 ` Feng Tang
2 siblings, 3 replies; 15+ messages in thread
From: Matthew Wilcox @ 2023-09-12 16:01 UTC (permalink / raw)
To: Feng Tang
Cc: Chuck Lever III, Sang, Oliver, oe-lkp, lkp,
Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
Ying, Yin, Fengwei, Liam R. Howlett
On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
> > Well that's the problem. Since I can't run the reproducer, there's
> > nothing I can do to troubleshoot the problem myself.
>
> We dug more into the perf and other profiling data from 0Day server
> running this case, and it seems that the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
>
> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> ---------------- ---------------------------
>
> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
I can't find the benchmark source, but my suspicion is that this
creates and deletes a lot of files in a directory. The 'stable
directory offsets' series uses xa_alloc_cyclic(), so we'll end up
with a very sparse radix tree. ie it'll look something like this:
0 - "."
1 - ".."
6 - "d"
27 - "y"
4000 - "fzz"
65537 - "czzz"
643289767 - "bzzzzzz"
(i didn't work out the names precisely here, but this is approximately
what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
almost all of them)
The radix tree does not handle this well. It'll allocate one node for:
entries 0-63 (covers the first 4 entries)
entries 0-4095
entries 3968-4031 (the first 5)
entries 0-262143
entries 65536-69631
entries 65536-65599 (the first 6)
entries 0-16777215
entries 0-1073741823
entries 637534208-654311423
entries 643039232-643301375
entries 643289088-643293183
entries 643289728-643289791 (all 7)
That ends up being 12 nodes (you get 7 nodes per page) to store 7
pointers. Admittedly to get here, you have to do 643289765 creations
and nearly as many deletions, so are we going to see it in a
non-benchmark situation?
The maple tree is more resilient against this kind of shenanigan, but
we're not there in terms of supporting the kind of allocation you
want. For this kind of allocation pattern, you'd get all 7 pointers
in a single 256-byte node.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 16:01 ` Matthew Wilcox
@ 2023-09-12 16:27 ` Chuck Lever III
2023-09-13 17:45 ` Chuck Lever III
2024-01-04 19:33 ` Chuck Lever III
2 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-12 16:27 UTC (permalink / raw)
To: Matthew Wilcox, Feng Tang
Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei,
Liam Howlett
> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>>
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>>
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>> ---------------- ---------------------------
>>
>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory. The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree. ie it'll look something like this:
>
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
>
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
>
> The radix tree does not handle this well. It'll allocate one node for:
>
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
>
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers. Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?
I would imagine that a system that is up for a long time might
see that kind of fragmentation in /tmp, eventually.
> The maple tree is more resilient against this kind of shenanigan, but
> we're not there in terms of supporting the kind of allocation you
> want. For this kind of allocation pattern, you'd get all 7 pointers
> in a single 256-byte node.
We can try switching to Maple tree when it's ready. The simple
directory offset API design, I hope, is reasonably resilient to
changes to the underlying data structure.
Meantime, I'm wondering if we can see the absolute performance
numbers rather than the percentage degradation? I'm curious
if we're talking about the different between 2 and 3 us per
operation, or something more substantial?
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 15:14 ` Feng Tang
2023-09-12 15:26 ` Chuck Lever III
2023-09-12 16:01 ` Matthew Wilcox
@ 2023-09-13 6:47 ` Feng Tang
2023-09-13 13:32 ` Chuck Lever III
2 siblings, 1 reply; 15+ messages in thread
From: Feng Tang @ 2023-09-13 6:47 UTC (permalink / raw)
To: Chuck Lever III, willy
Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei
[-- Attachment #1: Type: text/plain, Size: 4313 bytes --]
On Tue, Sep 12, 2023 at 11:14:42PM +0800, Tang, Feng wrote:
[...]
> >
> > Well that's the problem. Since I can't run the reproducer, there's
> > nothing I can do to troubleshoot the problem myself.
>
> We dug more into the perf and other profiling data from 0Day server
> running this case, and it seems that the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
>
> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> ---------------- ---------------------------
>
> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>
> Also the perf profile show some difference
>
> 0.01 ±223% +0.1 0.10 ± 28% pp.self.shuffle_freelist
> 0.00 +0.1 0.11 ± 40% pp.self.xas_create
> 0.00 +0.1 0.12 ± 27% pp.self.xas_find_marked
> 0.00 +0.1 0.14 ± 18% pp.self.xas_alloc
> 0.03 ±103% +0.1 0.17 ± 29% pp.self.xas_descend
> 0.00 +0.2 0.16 ± 23% pp.self.xas_expand
> 0.10 ± 22% +0.2 0.27 ± 16% pp.self.rcu_segcblist_enqueue
> 0.92 ± 35% +0.3 1.22 ± 11% pp.self.kmem_cache_free
> 0.00 +0.4 0.36 ± 16% pp.self.xas_store
> 0.32 ± 30% +0.4 0.71 ± 12% pp.self.__call_rcu_common
> 0.18 ± 27% +0.5 0.65 ± 8% pp.self.kmem_cache_alloc_lru
> 0.36 ± 79% +0.6 0.96 ± 15% pp.self.__slab_free
> 0.00 +0.8 0.80 ± 14% pp.self.radix_tree_node_rcu_free
> 0.00 +1.0 1.01 ± 16% pp.self.radix_tree_node_ctor
>
> Some perf profile from a2e459555c5f is:
>
> - 17.09% 0.09% singleuser [kernel.kallsyms] [k] path_openat
> - 16.99% path_openat
> - 12.23% open_last_lookups
> - 11.33% lookup_open.isra.0
> - 9.05% shmem_mknod
> - 5.11% simple_offset_add
> - 4.95% __xa_alloc_cyclic
> - 4.88% __xa_alloc
> - 4.76% xas_store
> - xas_create
> - 2.40% xas_expand.constprop.0
> - 2.01% xas_alloc
> - kmem_cache_alloc_lru
> - 1.28% ___slab_alloc
> - 1.22% allocate_slab
> - 1.19% shuffle_freelist
> - 1.04% setup_object
> radix_tree_node_ctor
>
> Please let me know if you need more info.
>
> >
> > Is there any hope in getting this reproducer to run on Fedora?
>
> Myself haven't succeeded to reproduce it locally, will keep trying
> it tomorrow.
It can be reproduced on a local machien with CentOS 9 (similar to
Fedora ?), and some steps as:
* download source code
$ wget https://sourceforge.net/projects/aimbench/files/aim-suite9/Initial%20release/s9110.tar.Z
* untar the file and run 'make'
* create 2 files 's9workfile' and 'test.config'
$ cat s9workfile
# @(#) s9workfile:1.2 1/22/96 00:00:00
# AIM Independent Resource Benchmark - Suite IX Workfile
FILESIZE: 5M
disk_src
$ cat test.config
test
disk_src
200s
/dev/shm/
* run the test with cmd "./singleuser -nl < test.config"
The test case here is 'disk_src', so I picked one file 'disk_src.c'
and attached it for quick reference.
The kernel config of my CentOS is different from what 0Day used, so
the perf-profile and peformance score are a little different, but
the regression trend is the same, that commit a2e459555c5f has about
20% drop.
Also the test platform doesn't matter, I tried on several generation
of Xeon servers which can all reproduce it.
Thanks,
Feng
[-- Attachment #2: disk_src.c --]
[-- Type: text/x-csrc, Size: 9319 bytes --]
/****************************************************************
** **
** Copyright (c) 1996 - 2001 Caldera International, Inc. **
** All Rights Reserved. **
** **
** This program is free software; you can redistribute it **
** and/or modify it under the terms of the GNU General Public **
** License as published by the Free Software Foundation; **
** either version 2 of the License, or (at your option) any **
** later version. **
** **
** This program is distributed in the hope that it will be **
** useful, but WITHOUT ANY WARRANTY; without even the implied **
** warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR **
** PURPOSE. See the GNU General Public License for more **
** details. **
** **
** You should have received a copy of the GNU General Public **
** License along with this program; if not, write to the Free **
** Software Foundation, Inc., 59 Temple Place, Suite 330, **
** Boston, MA 02111-1307 USA **
** **
****************************************************************/
#define _POSIX_SOURCE 1 /* turn on POSIX funct'ns */
#include <stdio.h> /* enable printf(), etc. */
#include <unistd.h> /* for chdir(), etc. */
#include <stdlib.h> /* for malloc(), etc. */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h> /* required for creat */
#include <signal.h>
#include "suite.h" /* our goodies */
static int disk_src();
COUNT_START /* declare counters */
source_file * disk_src_c()
{
static source_file s = { " @(#) disk_src.c:1.10 3/4/94 17:21:22", /* SCCS info */
__FILE__, __DATE__, __TIME__
};
register_test("disk_src", "DISKS", disk_src, 75, "Directory Searches"); /* 100 c and s files in fakeh/dirlist */
return &s;
}
enum choices { STAT = 0, CREAT, MCHOICE }; /* encode choices here */
#define FAKEH "fakeh"
#define MFILES 75 /* number of entries */
#define MYBUF 160 /* MAX size of input line */
#define MSCR 5 /* number of scramble passes */
/*
* dsearch exercises the directory search mechanism of unix systems.
* it is called by the disk test program. dsearch assumes that it is
* invoked with its current directory is the parent directory
* of the hand created directory that is distributed with the benchmark.
* it assumes that in this directory is a file "dirlist" that provides a list
* of file names under the current directory, along with a list of names to
* search for. some of these names are to be stat'ed while some are to be
* creat'ed
*/
unsigned long mrand();
int get_list(FILE * file,
char *list[MCHOICE][MFILES]);
void scramble(char *list[],
int num);
void cl_list(char *list[MCHOICE][MFILES]);
void errdump(int line,
char *str);
int
dsearch(char *fakeh_dir)
{
FILE *fp; /* file containing filenames */
int
fd, /* file discriptor for creat, etc. */
index; /* loop variable */
struct stat stbuf; /* stat buffer */
char
cwd[256], /* hold current working dir */
errbuf[80], /* build error msgs in here */
*flist[MCHOICE][MFILES]; /* the list of target files */
if (getcwd(cwd, 256) == NULL) {
fprintf(stderr,
"dsearch(): can't get current working directory\n");
return (-1);
}
if (chdir(fakeh_dir) < 0) { /* move to directory */
perror("dsearch()"); /* if error, print it */
errdump(__LINE__, "dsearch(): directory 'fakeh' is inaccessable\n"); /* and dump */
return (-1); /* return failure */
}
if ((fp = fopen("dirlist", "r")) == NULL) { /* open list of filenames */
errdump(__LINE__, "dsearch(): file 'dirlist' is inaccessable\n"); /* handle error */
chdir(cwd); /* move back up */
return (-1); /* return error */
}
/*
* end of error processing
*/
if (get_list(fp, flist) < 0) { /* load the list */
errdump(__LINE__, "dsearch(): file 'dirlist' is corrupted\n"); /* handle errors */
chdir(cwd); /* go back up */
cl_list(flist); /* close list */
return (-1); /* return error */
}
fclose(fp); /* close list of filenames */
scramble(flist[STAT], MFILES); /* scramble names */
scramble(flist[CREAT], MFILES); /* scramble names */
for (index = 0; index < MFILES; index++) { /* loop through files */
if (flist[STAT][index] != NULL) { /* if not null */
if (stat(flist[STAT][index], &stbuf) < 0) { /* stat the file */
perror("stat() in dsearch()"); /* handle errors */
sprintf(errbuf, "dsearch(): can't stat '%s'\n", /* create error message */
flist[STAT][index]);
errdump(__LINE__, errbuf); /* print it */
chdir(cwd); /* return to proper dir */
cl_list(flist); /* clear list */
return (-1); /* return error */
} /* endo f error */
}
/*
* end of if not null
*/
if (flist[CREAT][index] != NULL) { /* if creating */
if ((fd = creat(flist[CREAT][index], S_IRWXU | S_IRWXG | S_IRWXO)) < 0) { /* try create */
perror("creat() in dsearch()"); /* handle error */
sprintf(errbuf, "dsearch():can't creat '%s'\n", /* build error message */
flist[CREAT][index]);
errdump(__LINE__, errbuf); /* print it */
chdir(cwd); /* change directories */
cl_list(flist); /* clear list */
return (-1); /* return error */
} /* end of error */
close(fd); /* close the file */
if (unlink(flist[CREAT][index])) { /* unlink it */
perror("unlink() in dsearch()"); /* handle error */
sprintf(errbuf, "dsearch():can't unlink '%s'\n", /* build error message */
flist[CREAT][index]);
errdump(__LINE__, errbuf); /* print it */
chdir(cwd); /* change directories */
cl_list(flist); /* clear list */
return (-1); /* return error */
} /* end of error */
}
/*
* end if creating
*/
COUNT_BUMP;
} /* end of for */
cl_list(flist); /* clear list */
chdir(cwd); /* go back up */
return (0); /* return no error */
}
int
get_list(FILE * file,
char *list[MCHOICE][MFILES])
{
char
buff[MYBUF], /* holds 1 line of input */
*tmp; /* holds malloc results */
int
s_index, c_index, i;
pid_t pid = getpid(); /* process ID, for unique file names */
s_index = c_index = 0; /* initialize indexes */
for (i = 0; i < MFILES; i++) /* initialize array */
list[STAT][i] = list[CREAT][i] = NULL; /* clear to empty */
while (fgets(buff, MYBUF - 1, file) != NULL) { /* get a line */
if (buff[0] != 's' && buff[0] != 'c') /* if it isn't legal, */
continue; /* ignore it */
buff[strlen(buff) - 1] = '\0'; /* eliminate trailing new line */
if ((tmp = malloc(strlen(buff) + 1 + 8)) == NULL) { /* allocate space */
cl_list(list); /* handle bad allocate */
return (-1); /* return error */
}
/*
* end of malloc error chking
*/
strcpy(tmp, buff + 2); /* copy name into buffer */
switch (buff[0]) { /* decide on operation */
case 's': /* if stat */
list[STAT][s_index++] = tmp; /* put it into array */
break; /* and leave */
case 'c': /* if CREAT; Tin Le */
sprintf(tmp, "%s%05d", (buff + 2), pid % 100000); /* make unique name, last 4 digits of pid */
list[CREAT][c_index++] = tmp; /* save it off */
break; /* and leave */
default: /* this cannot be */
errdump(__LINE__, "getlist(): Deadly error encountered\n"); /* print merror message */
cl_list(list); /* clear list */
return (-1); /* return error here */
} /* end of switch */
} /* end of loop */
return (1); /* return success */
}
void
scramble(char *list[],
int num)
{
int
i, /* loop variable */
scount, /* scramble count */
rnum; /* random index for scramble */
char
*tmp; /* intermediate location */
for (scount = 0; scount < MSCR; scount++) { /* for number of scrambles */
for (i = 0; i < num; i++) { /* go through list */
rnum = mrand() % num; /* get an index */
tmp = list[i]; /* swap this one */
list[i] = list[rnum]; /* with that one */
list[rnum] = tmp; /* and we're done */
} /* loop through each one */
} /* for each pass */
}
unsigned long
mrand()
{ /* return integer randome number */
return ((unsigned long)aim_rand()); /* do it */
}
void
cl_list(char *list[MCHOICE][MFILES])
{
int index; /* loop variable */
for (index = 0; index < MFILES; index++) { /* loo through all entries */
if (list[STAT][index] != NULL) /* if string resides here */
free(list[STAT][index]); /* free it to heap */
if (list[CREAT][index] != NULL) /* if string here */
free(list[CREAT][index]); /* free it */
} /* end of loop */
}
void
errdump(int line,
char *str)
{ /* print error message */
fprintf(stderr,
"Error in file %s (compiled at %s on %s) from line %d:\n\t%s",
__FILE__, __TIME__, __DATE__, line, str);
}
static int
disk_src(char *argv,
Result * res)
{
int i;
char fakeh_dir[128];
if (*argv)
sprintf(fakeh_dir, "%s/%s", argv, FAKEH);
else
strcpy(fakeh_dir, FAKEH);
i = dsearch(fakeh_dir);
COUNT_END("disk_src");
return (res->i = i);
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-13 6:47 ` Feng Tang
@ 2023-09-13 13:32 ` Chuck Lever III
0 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-13 13:32 UTC (permalink / raw)
To: Feng Tang
Cc: Matthew Wilcox, Sang, Oliver, oe-lkp, lkp,
Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
Ying, Yin, Fengwei
> On Sep 13, 2023, at 2:47 AM, Feng Tang <feng.tang@intel.com> wrote:
>
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Tang, Feng wrote:
> [...]
>>>
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>>
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>>
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>> ---------------- ---------------------------
>>
>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>>
>> Also the perf profile show some difference
>>
>> 0.01 ±223% +0.1 0.10 ± 28% pp.self.shuffle_freelist
>> 0.00 +0.1 0.11 ± 40% pp.self.xas_create
>> 0.00 +0.1 0.12 ± 27% pp.self.xas_find_marked
>> 0.00 +0.1 0.14 ± 18% pp.self.xas_alloc
>> 0.03 ±103% +0.1 0.17 ± 29% pp.self.xas_descend
>> 0.00 +0.2 0.16 ± 23% pp.self.xas_expand
>> 0.10 ± 22% +0.2 0.27 ± 16% pp.self.rcu_segcblist_enqueue
>> 0.92 ± 35% +0.3 1.22 ± 11% pp.self.kmem_cache_free
>> 0.00 +0.4 0.36 ± 16% pp.self.xas_store
>> 0.32 ± 30% +0.4 0.71 ± 12% pp.self.__call_rcu_common
>> 0.18 ± 27% +0.5 0.65 ± 8% pp.self.kmem_cache_alloc_lru
>> 0.36 ± 79% +0.6 0.96 ± 15% pp.self.__slab_free
>> 0.00 +0.8 0.80 ± 14% pp.self.radix_tree_node_rcu_free
>> 0.00 +1.0 1.01 ± 16% pp.self.radix_tree_node_ctor
>>
>> Some perf profile from a2e459555c5f is:
>>
>> - 17.09% 0.09% singleuser [kernel.kallsyms] [k] path_openat
>> - 16.99% path_openat
>> - 12.23% open_last_lookups
>> - 11.33% lookup_open.isra.0
>> - 9.05% shmem_mknod
>> - 5.11% simple_offset_add
>> - 4.95% __xa_alloc_cyclic
>> - 4.88% __xa_alloc
>> - 4.76% xas_store
>> - xas_create
>> - 2.40% xas_expand.constprop.0
>> - 2.01% xas_alloc
>> - kmem_cache_alloc_lru
>> - 1.28% ___slab_alloc
>> - 1.22% allocate_slab
>> - 1.19% shuffle_freelist
>> - 1.04% setup_object
>> radix_tree_node_ctor
>>
>> Please let me know if you need more info.
>>
>>>
>>> Is there any hope in getting this reproducer to run on Fedora?
>>
>> Myself haven't succeeded to reproduce it locally, will keep trying
>> it tomorrow.
>
> It can be reproduced on a local machien with CentOS 9 (similar to
> Fedora ?),
Excellent, I appreciate this!
> and some steps as:
>
> * download source code
> $ wget https://sourceforge.net/projects/aimbench/files/aim-suite9/Initial%20release/s9110.tar.Z
>
> * untar the file and run 'make'
>
> * create 2 files 's9workfile' and 'test.config'
>
> $ cat s9workfile
> # @(#) s9workfile:1.2 1/22/96 00:00:00
> # AIM Independent Resource Benchmark - Suite IX Workfile
> FILESIZE: 5M
> disk_src
>
> $ cat test.config
> test
> disk_src
> 200s
> /dev/shm/
>
> * run the test with cmd "./singleuser -nl < test.config"
>
> The test case here is 'disk_src', so I picked one file 'disk_src.c'
> and attached it for quick reference.
>
> The kernel config of my CentOS is different from what 0Day used, so
> the perf-profile and peformance score are a little different, but
> the regression trend is the same, that commit a2e459555c5f has about
> 20% drop.
>
> Also the test platform doesn't matter, I tried on several generation
> of Xeon servers which can all reproduce it.
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 16:01 ` Matthew Wilcox
2023-09-12 16:27 ` Chuck Lever III
@ 2023-09-13 17:45 ` Chuck Lever III
2024-01-04 19:33 ` Chuck Lever III
2 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-13 17:45 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Feng Tang, Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei,
Liam Howlett
> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>>
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>>
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>> ---------------- ---------------------------
>>
>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory. The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree. ie it'll look something like this:
>
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
>
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
>
> The radix tree does not handle this well. It'll allocate one node for:
>
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
>
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers.
I'm able to run the reproducer Feng provided. simple_offset_add()
nearly doubles the cost of shmem_mknod() thanks to the memory
allocations done in xas_create().
However, tmpfs is already fast compared to persistent filesystems.
For instance, even with the simple_offset patch applied:
tmpfs: 158079.00 Directory Searches/second
btrfs: 64978.88 Directory Searches/second
> Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?
Most directories in a tmpfs have a limited lifespan and thus are
unlikely to live long enough to be affected by this issue. The
only one that has a rather unlimited lifespan is the root
directory.
It's hard for me to tell whether this is a pervasive problem
or one we can live with until we find a more suitable data
structure. IMO the benefit of having stable directory offsets
far outweighs the eventual slow down in the root directory.
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2023-09-12 16:01 ` Matthew Wilcox
2023-09-12 16:27 ` Chuck Lever III
2023-09-13 17:45 ` Chuck Lever III
@ 2024-01-04 19:33 ` Chuck Lever III
2024-01-05 16:27 ` Liam R. Howlett
2 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever III @ 2024-01-04 19:33 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Feng Tang, Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei,
Liam Howlett
> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>>
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>>
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>> ---------------- ---------------------------
>>
>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory. The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree. ie it'll look something like this:
>
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
>
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
>
> The radix tree does not handle this well. It'll allocate one node for:
>
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
>
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers. Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?
>
> The maple tree is more resilient against this kind of shenanigan, but
> we're not there in terms of supporting the kind of allocation you
> want. For this kind of allocation pattern, you'd get all 7 pointers
> in a single 256-byte node.
Hello Matthew, it's been a couple of kernel releases, so
following up.
Is Maple tree ready for libfs to use it for managing directory
offsets?
Should we just go for broke and convert libfs from xarray to
Maple tree now?
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2024-01-04 19:33 ` Chuck Lever III
@ 2024-01-05 16:27 ` Liam R. Howlett
2024-01-05 16:33 ` Chuck Lever III
0 siblings, 1 reply; 15+ messages in thread
From: Liam R. Howlett @ 2024-01-05 16:27 UTC (permalink / raw)
To: Chuck Lever III
Cc: Matthew Wilcox, Feng Tang, Sang, Oliver, oe-lkp, lkp,
Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
Ying, Yin, Fengwei
* Chuck Lever III <chuck.lever@oracle.com> [240104 14:33]:
>
>
> > On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
> >>> Well that's the problem. Since I can't run the reproducer, there's
> >>> nothing I can do to troubleshoot the problem myself.
> >>
> >> We dug more into the perf and other profiling data from 0Day server
> >> running this case, and it seems that the new simple_offset_add()
> >> called by shmem_mknod() brings extra cost related with slab,
> >> specifically the 'radix_tree_node', which cause the regression.
> >>
> >> Here is some slabinfo diff for commit a2e459555c5f and its parent:
> >>
> >> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> >> ---------------- ---------------------------
> >>
> >> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
> >> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
> >> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
> >> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
> >
> > I can't find the benchmark source, but my suspicion is that this
> > creates and deletes a lot of files in a directory. The 'stable
> > directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> > with a very sparse radix tree. ie it'll look something like this:
> >
> > 0 - "."
> > 1 - ".."
> > 6 - "d"
> > 27 - "y"
> > 4000 - "fzz"
> > 65537 - "czzz"
> > 643289767 - "bzzzzzz"
> >
> > (i didn't work out the names precisely here, but this is approximately
> > what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> > almost all of them)
> >
> > The radix tree does not handle this well. It'll allocate one node for:
> >
> > entries 0-63 (covers the first 4 entries)
> > entries 0-4095
> > entries 3968-4031 (the first 5)
> > entries 0-262143
> > entries 65536-69631
> > entries 65536-65599 (the first 6)
> > entries 0-16777215
> > entries 0-1073741823
> > entries 637534208-654311423
> > entries 643039232-643301375
> > entries 643289088-643293183
> > entries 643289728-643289791 (all 7)
> >
> > That ends up being 12 nodes (you get 7 nodes per page) to store 7
> > pointers. Admittedly to get here, you have to do 643289765 creations
> > and nearly as many deletions, so are we going to see it in a
> > non-benchmark situation?
> >
> > The maple tree is more resilient against this kind of shenanigan, but
> > we're not there in terms of supporting the kind of allocation you
> > want. For this kind of allocation pattern, you'd get all 7 pointers
> > in a single 256-byte node.
>
> Hello Matthew, it's been a couple of kernel releases, so
> following up.
>
> Is Maple tree ready for libfs to use it for managing directory
> offsets?
The feature you are looking for is dense nodes. It will allow for
a compact tree when you have a number of single indexes mapping to
entries (ideal for many ranges of 1).
I'm actively working on dense nodes, which will yield 31 entries per
node when they are single index mappings. I'm hoping to have it
completed in the next few weeks and start beating it up with tests
before pushing it out.
>
> Should we just go for broke and convert libfs from xarray to
> Maple tree now?
We are trying to keep the API close for both the xarray and the maple
tree, so if you do the conversion to one then switching shouldn't be
much work. I'd try the maple tree to see if the performance is
acceptable today (I may be biased), but I don't know how big of an
effort this conversion would entail.
The maple tree will compress the NULL indexes to a single entry of NULL.
My main concern is the density of information and the number of
allocations the tree will do to keep up with the workload - both will
improve with the dense nodes feature.
If you convert to maple tree, you will get the update for free later as
the node type the tree chooses will be transparent to users.
If you need tagging then you should use the xarray as I haven't started
that feature yet - but I don't think you need that?
I also noticed that Matthew mentioned xa_alloc_cyclic() as the potential
call into the xarray. The maple tree counterpart isn't used much today
and may need to be optimised. If you can verify what this test does, we
could produce a test case for the maple tree test suite and optimise if
necessary.
Let us know if you have any other questions or need some pointers on how
to get started with a conversion.
Thanks,
Liam
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression
2024-01-05 16:27 ` Liam R. Howlett
@ 2024-01-05 16:33 ` Chuck Lever III
0 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2024-01-05 16:33 UTC (permalink / raw)
To: Liam Howlett
Cc: Matthew Wilcox, Feng Tang, Sang, Oliver, oe-lkp, lkp,
Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
Ying, Yin, Fengwei
> On Jan 5, 2024, at 11:27 AM, Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Chuck Lever III <chuck.lever@oracle.com> [240104 14:33]:
>>
>>
>>> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>>>> Well that's the problem. Since I can't run the reproducer, there's
>>>>> nothing I can do to troubleshoot the problem myself.
>>>>
>>>> We dug more into the perf and other profiling data from 0Day server
>>>> running this case, and it seems that the new simple_offset_add()
>>>> called by shmem_mknod() brings extra cost related with slab,
>>>> specifically the 'radix_tree_node', which cause the regression.
>>>>
>>>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>>>
>>>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>>>> ---------------- ---------------------------
>>>>
>>>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>>>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>>>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>>>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>>>
>>> I can't find the benchmark source, but my suspicion is that this
>>> creates and deletes a lot of files in a directory. The 'stable
>>> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
>>> with a very sparse radix tree. ie it'll look something like this:
>>>
>>> 0 - "."
>>> 1 - ".."
>>> 6 - "d"
>>> 27 - "y"
>>> 4000 - "fzz"
>>> 65537 - "czzz"
>>> 643289767 - "bzzzzzz"
>>>
>>> (i didn't work out the names precisely here, but this is approximately
>>> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
>>> almost all of them)
>>>
>>> The radix tree does not handle this well. It'll allocate one node for:
>>>
>>> entries 0-63 (covers the first 4 entries)
>>> entries 0-4095
>>> entries 3968-4031 (the first 5)
>>> entries 0-262143
>>> entries 65536-69631
>>> entries 65536-65599 (the first 6)
>>> entries 0-16777215
>>> entries 0-1073741823
>>> entries 637534208-654311423
>>> entries 643039232-643301375
>>> entries 643289088-643293183
>>> entries 643289728-643289791 (all 7)
>>>
>>> That ends up being 12 nodes (you get 7 nodes per page) to store 7
>>> pointers. Admittedly to get here, you have to do 643289765 creations
>>> and nearly as many deletions, so are we going to see it in a
>>> non-benchmark situation?
>>>
>>> The maple tree is more resilient against this kind of shenanigan, but
>>> we're not there in terms of supporting the kind of allocation you
>>> want. For this kind of allocation pattern, you'd get all 7 pointers
>>> in a single 256-byte node.
>>
>> Hello Matthew, it's been a couple of kernel releases, so
>> following up.
>>
>> Is Maple tree ready for libfs to use it for managing directory
>> offsets?
>
> The feature you are looking for is dense nodes. It will allow for
> a compact tree when you have a number of single indexes mapping to
> entries (ideal for many ranges of 1).
>
> I'm actively working on dense nodes, which will yield 31 entries per
> node when they are single index mappings. I'm hoping to have it
> completed in the next few weeks and start beating it up with tests
> before pushing it out.
>
>>
>> Should we just go for broke and convert libfs from xarray to
>> Maple tree now?
>
> We are trying to keep the API close for both the xarray and the maple
> tree, so if you do the conversion to one then switching shouldn't be
> much work. I'd try the maple tree to see if the performance is
> acceptable today (I may be biased), but I don't know how big of an
> effort this conversion would entail.
>
> The maple tree will compress the NULL indexes to a single entry of NULL.
> My main concern is the density of information and the number of
> allocations the tree will do to keep up with the workload - both will
> improve with the dense nodes feature.
>
> If you convert to maple tree, you will get the update for free later as
> the node type the tree chooses will be transparent to users.
>
> If you need tagging then you should use the xarray as I haven't started
> that feature yet - but I don't think you need that?
I don't recall using xarray tags for directory offset mapping.
> I also noticed that Matthew mentioned xa_alloc_cyclic() as the potential
> call into the xarray. The maple tree counterpart isn't used much today
> and may need to be optimised. If you can verify what this test does, we
> could produce a test case for the maple tree test suite and optimise if
> necessary.
>
> Let us know if you have any other questions or need some pointers on how
> to get started with a conversion.
Sounds like conversion is worth starting on, at least. I'll try
to clear some time to work on it.
--
Chuck Lever
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-01-05 16:33 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-08 5:26 [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression kernel test robot
2023-09-08 14:43 ` Chuck Lever III
2023-09-12 1:25 ` Oliver Sang
2023-09-12 13:01 ` Chuck Lever III
2023-09-12 13:19 ` Oliver Sang
2023-09-12 15:14 ` Feng Tang
2023-09-12 15:26 ` Chuck Lever III
2023-09-12 16:01 ` Matthew Wilcox
2023-09-12 16:27 ` Chuck Lever III
2023-09-13 17:45 ` Chuck Lever III
2024-01-04 19:33 ` Chuck Lever III
2024-01-05 16:27 ` Liam R. Howlett
2024-01-05 16:33 ` Chuck Lever III
2023-09-13 6:47 ` Feng Tang
2023-09-13 13:32 ` Chuck Lever III
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox