[linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
@ 2023-09-08  5:26 kernel test robot
  2023-09-08 14:43 ` Chuck Lever III
  0 siblings, 1 reply; 15+ messages in thread
From: kernel test robot @ 2023-09-08  5:26 UTC (permalink / raw)
  To: Chuck Lever
  Cc: oe-lkp, lkp, linux-kernel, Christian Brauner, linux-mm,
	ying.huang, feng.tang, fengwei.yin, oliver.sang



Hello,

kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:


commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: aim9
test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
parameters:

	testtime: 300s
	test: disk_src
	cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+-------------------------------------------------------------------------------------------------+
| testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression                                               |
| test machine     | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
| test parameters  | cpufreq_governor=performance                                                                    |
|                  | test=all                                                                                        |
|                  | testtime=5s                                                                                     |
+------------------+-------------------------------------------------------------------------------------------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230908/202309081306.3ecb3734-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s

commit: 
  23a31d8764 ("shmem: Refactor shmem_symlink()")
  a2e459555c ("shmem: stable directory offsets")

23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.26 ±  9%      +0.1        0.36 ±  2%  mpstat.cpu.all.soft%
      0.61            -0.1        0.52        mpstat.cpu.all.usr%
      0.16 ± 10%     -18.9%       0.13 ± 12%  perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      0.04 ±  7%   +1802.4%       0.78 ±115%  perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    202424           -19.0%     163868        aim9.disk_src.ops_per_sec
     94.83            -4.2%      90.83        aim9.time.percent_of_cpu_this_job_got
     73.62           -17.6%      60.69        aim9.time.user_time
     23541            +6.5%      25074        proc-vmstat.nr_slab_reclaimable
   1437319 ± 24%    +377.6%    6864201        proc-vmstat.numa_hit
   1387016 ± 25%    +391.4%    6815486        proc-vmstat.numa_local
   4864362 ± 34%    +453.6%   26931180        proc-vmstat.pgalloc_normal
   4835960 ± 34%    +455.4%   26856610        proc-vmstat.pgfree
    538959 ± 24%     -23.2%     414090        sched_debug.cfs_rq:/.load.max
    130191 ± 14%     -13.3%     112846 ±  6%  sched_debug.cfs_rq:/.load.stddev
    116849 ± 27%     -51.2%      56995 ± 20%  sched_debug.cfs_rq:/.min_vruntime.max
      1223 ±191%    -897.4%      -9754        sched_debug.cfs_rq:/.spread0.avg
    107969 ± 29%     -65.3%      37448 ± 39%  sched_debug.cfs_rq:/.spread0.max
     55209 ± 14%     -21.8%      43154 ± 14%  sched_debug.cpu.nr_switches.max
     11.21           +23.7%      13.87        perf-stat.i.MPKI
 7.223e+08            -4.4%  6.907e+08        perf-stat.i.branch-instructions
      2.67            +0.2        2.88        perf-stat.i.branch-miss-rate%
  19988363            +2.8%   20539702        perf-stat.i.branch-misses
     17.36            -2.8       14.59        perf-stat.i.cache-miss-rate%
  40733859           +19.5%   48659982        perf-stat.i.cache-references
      1.76            +3.5%       1.82        perf-stat.i.cpi
     55.21            +5.4%      58.21 ±  2%  perf-stat.i.cpu-migrations
  1.01e+09            -3.8%  9.719e+08        perf-stat.i.dTLB-loads
      0.26 ±  4%      -0.0        0.23 ±  3%  perf-stat.i.dTLB-store-miss-rate%
   2166022 ±  4%      -6.9%    2015917 ±  3%  perf-stat.i.dTLB-store-misses
 8.503e+08            +5.5%  8.968e+08        perf-stat.i.dTLB-stores
     69.22 ±  4%      +6.4       75.60        perf-stat.i.iTLB-load-miss-rate%
    316455 ± 12%     -31.6%     216531 ±  3%  perf-stat.i.iTLB-loads
 3.722e+09            -3.1%  3.608e+09        perf-stat.i.instructions
      0.57            -3.3%       0.55        perf-stat.i.ipc
    865.04           -10.4%     775.02 ±  3%  perf-stat.i.metric.K/sec
     47.51            -2.1       45.37        perf-stat.i.node-load-miss-rate%
    106705 ±  3%     +14.8%     122490 ±  5%  perf-stat.i.node-loads
    107169 ±  4%     +29.0%     138208 ±  7%  perf-stat.i.node-stores
     10.94           +23.3%      13.49        perf-stat.overall.MPKI
      2.77            +0.2        2.97        perf-stat.overall.branch-miss-rate%
     17.28            -2.7       14.56        perf-stat.overall.cache-miss-rate%
      1.73            +3.4%       1.79        perf-stat.overall.cpi
      0.25 ±  4%      -0.0        0.22 ±  3%  perf-stat.overall.dTLB-store-miss-rate%
     69.20 ±  4%      +6.4       75.60        perf-stat.overall.iTLB-load-miss-rate%
      0.58            -3.2%       0.56        perf-stat.overall.ipc
     45.25            -2.2       43.10        perf-stat.overall.node-load-miss-rate%
 7.199e+08            -4.4%  6.883e+08        perf-stat.ps.branch-instructions
  19919808            +2.8%   20469001        perf-stat.ps.branch-misses
  40597326           +19.5%   48497201        perf-stat.ps.cache-references
     55.06            +5.4%      58.03 ±  2%  perf-stat.ps.cpu-migrations
 1.007e+09            -3.8%  9.686e+08        perf-stat.ps.dTLB-loads
   2158768 ±  4%      -6.9%    2009174 ±  3%  perf-stat.ps.dTLB-store-misses
 8.475e+08            +5.5%  8.937e+08        perf-stat.ps.dTLB-stores
    315394 ± 12%     -31.6%     215816 ±  3%  perf-stat.ps.iTLB-loads
  3.71e+09            -3.1%  3.595e+09        perf-stat.ps.instructions
    106351 ±  3%     +14.8%     122083 ±  5%  perf-stat.ps.node-loads
    106728 ±  4%     +29.1%     137740 ±  7%  perf-stat.ps.node-stores
 1.117e+12            -3.0%  1.084e+12        perf-stat.total.instructions
      0.00            +0.8        0.75 ± 12%  perf-profile.calltrace.cycles-pp.__call_rcu_common.xas_store.__xa_erase.xa_erase.simple_offset_remove
      0.00            +0.8        0.78 ± 34%  perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store
      0.00            +0.8        0.83 ± 29%  perf-profile.calltrace.cycles-pp.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand
      0.00            +0.9        0.92 ± 26%  perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create
      0.00            +1.0        0.99 ± 27%  perf-profile.calltrace.cycles-pp.shuffle_freelist.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc
      0.00            +1.0        1.04 ± 28%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store.__xa_alloc
      0.00            +1.1        1.11 ± 26%  perf-profile.calltrace.cycles-pp.xas_alloc.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
      1.51 ± 24%      +1.2        2.73 ± 10%  perf-profile.calltrace.cycles-pp.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.00            +1.2        1.24 ± 20%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create.xas_store
      0.00            +1.3        1.27 ± 10%  perf-profile.calltrace.cycles-pp.xas_store.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink
      0.00            +1.3        1.30 ± 10%  perf-profile.calltrace.cycles-pp.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink
      0.00            +1.3        1.33 ± 19%  perf-profile.calltrace.cycles-pp.xas_alloc.xas_expand.xas_create.xas_store.__xa_alloc
      0.00            +1.4        1.36 ± 10%  perf-profile.calltrace.cycles-pp.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
      0.00            +1.4        1.37 ± 10%  perf-profile.calltrace.cycles-pp.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink
      0.00            +1.5        1.51 ± 17%  perf-profile.calltrace.cycles-pp.xas_expand.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
      0.00            +1.6        1.62 ± 12%  perf-profile.calltrace.cycles-pp.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64
      0.00            +2.8        2.80 ± 13%  perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
      0.00            +2.9        2.94 ± 13%  perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
      5.38 ± 24%      +3.1        8.51 ± 11%  perf-profile.calltrace.cycles-pp.lookup_open.open_last_lookups.path_openat.do_filp_open.do_sys_openat2
      6.08 ± 24%      +3.2        9.24 ± 12%  perf-profile.calltrace.cycles-pp.open_last_lookups.path_openat.do_filp_open.do_sys_openat2.__x64_sys_creat
      0.00            +3.2        3.20 ± 13%  perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
      0.00            +3.2        3.24 ± 13%  perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
      0.00            +3.4        3.36 ± 14%  perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
      2.78 ± 25%      +3.4        6.17 ± 12%  perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
      0.16 ± 30%      -0.1        0.08 ± 20%  perf-profile.children.cycles-pp.map_id_up
      0.02 ±146%      +0.1        0.08 ± 13%  perf-profile.children.cycles-pp.shmem_is_huge
      0.02 ±141%      +0.1        0.09 ± 16%  perf-profile.children.cycles-pp.__list_del_entry_valid
      0.00            +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.free_unref_page
      0.00            +0.1        0.08 ± 13%  perf-profile.children.cycles-pp.shmem_destroy_inode
      0.04 ±101%      +0.1        0.14 ± 25%  perf-profile.children.cycles-pp.rcu_nocb_try_bypass
      0.00            +0.1        0.12 ± 27%  perf-profile.children.cycles-pp.xas_find_marked
      0.02 ±144%      +0.1        0.16 ± 14%  perf-profile.children.cycles-pp.__unfreeze_partials
      0.03 ±106%      +0.2        0.19 ± 26%  perf-profile.children.cycles-pp.xas_descend
      0.01 ±223%      +0.2        0.17 ± 15%  perf-profile.children.cycles-pp.get_page_from_freelist
      0.11 ± 22%      +0.2        0.29 ± 16%  perf-profile.children.cycles-pp.rcu_segcblist_enqueue
      0.02 ±146%      +0.2        0.24 ± 13%  perf-profile.children.cycles-pp.__alloc_pages
      0.36 ± 79%      +0.6        0.98 ± 15%  perf-profile.children.cycles-pp.__slab_free
      0.50 ± 26%      +0.7        1.23 ± 14%  perf-profile.children.cycles-pp.__call_rcu_common
      0.00            +0.8        0.82 ± 13%  perf-profile.children.cycles-pp.radix_tree_node_rcu_free
      0.00            +1.1        1.14 ± 17%  perf-profile.children.cycles-pp.radix_tree_node_ctor
      0.16 ± 86%      +1.2        1.38 ± 16%  perf-profile.children.cycles-pp.setup_object
      1.52 ± 25%      +1.2        2.75 ± 10%  perf-profile.children.cycles-pp.vfs_unlink
      0.36 ± 22%      +1.3        1.63 ± 12%  perf-profile.children.cycles-pp.shmem_unlink
      0.00            +1.3        1.30 ± 10%  perf-profile.children.cycles-pp.__xa_erase
      0.20 ± 79%      +1.3        1.53 ± 15%  perf-profile.children.cycles-pp.shuffle_freelist
      0.00            +1.4        1.36 ± 10%  perf-profile.children.cycles-pp.xa_erase
      0.00            +1.4        1.38 ± 10%  perf-profile.children.cycles-pp.simple_offset_remove
      0.00            +1.5        1.51 ± 17%  perf-profile.children.cycles-pp.xas_expand
      0.26 ± 78%      +1.6        1.87 ± 13%  perf-profile.children.cycles-pp.allocate_slab
      0.40 ± 49%      +1.7        2.10 ± 13%  perf-profile.children.cycles-pp.___slab_alloc
      1.30 ± 85%      +2.1        3.42 ± 12%  perf-profile.children.cycles-pp.rcu_do_batch
      1.56 ± 27%      +2.4        3.93 ± 11%  perf-profile.children.cycles-pp.kmem_cache_alloc_lru
      0.00            +2.4        2.44 ± 12%  perf-profile.children.cycles-pp.xas_alloc
      2.66 ± 13%      +2.5        5.14 ±  5%  perf-profile.children.cycles-pp.__irq_exit_rcu
     11.16 ± 10%      +2.7       13.88 ±  8%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
     11.77 ± 10%      +2.7       14.49 ±  8%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.00            +2.8        2.82 ± 13%  perf-profile.children.cycles-pp.xas_create
      5.40 ± 24%      +3.1        8.52 ± 11%  perf-profile.children.cycles-pp.lookup_open
      6.12 ± 24%      +3.1        9.27 ± 12%  perf-profile.children.cycles-pp.open_last_lookups
      0.00            +3.2        3.22 ± 13%  perf-profile.children.cycles-pp.__xa_alloc
      0.00            +3.2        3.24 ± 13%  perf-profile.children.cycles-pp.__xa_alloc_cyclic
      0.00            +3.4        3.36 ± 14%  perf-profile.children.cycles-pp.simple_offset_add
      2.78 ± 25%      +3.4        6.18 ± 12%  perf-profile.children.cycles-pp.shmem_mknod
      0.00            +4.2        4.24 ± 12%  perf-profile.children.cycles-pp.xas_store
      0.14 ± 27%      -0.1        0.08 ± 21%  perf-profile.self.cycles-pp.map_id_up
      0.00            +0.1        0.06 ± 24%  perf-profile.self.cycles-pp.shmem_destroy_inode
      0.00            +0.1        0.07 ±  8%  perf-profile.self.cycles-pp.__xa_alloc
      0.02 ±146%      +0.1        0.11 ± 28%  perf-profile.self.cycles-pp.rcu_nocb_try_bypass
      0.01 ±223%      +0.1        0.10 ± 28%  perf-profile.self.cycles-pp.shuffle_freelist
      0.00            +0.1        0.11 ± 40%  perf-profile.self.cycles-pp.xas_create
      0.00            +0.1        0.12 ± 27%  perf-profile.self.cycles-pp.xas_find_marked
      0.00            +0.1        0.14 ± 18%  perf-profile.self.cycles-pp.xas_alloc
      0.03 ±103%      +0.1        0.17 ± 29%  perf-profile.self.cycles-pp.xas_descend
      0.00            +0.2        0.16 ± 23%  perf-profile.self.cycles-pp.xas_expand
      0.10 ± 22%      +0.2        0.27 ± 16%  perf-profile.self.cycles-pp.rcu_segcblist_enqueue
      0.00            +0.4        0.36 ± 16%  perf-profile.self.cycles-pp.xas_store
      0.32 ± 30%      +0.4        0.71 ± 12%  perf-profile.self.cycles-pp.__call_rcu_common
      0.18 ± 27%      +0.5        0.65 ±  8%  perf-profile.self.cycles-pp.kmem_cache_alloc_lru
      0.36 ± 79%      +0.6        0.96 ± 15%  perf-profile.self.cycles-pp.__slab_free
      0.00            +0.8        0.80 ± 14%  perf-profile.self.cycles-pp.radix_tree_node_rcu_free
      0.00            +1.0        1.01 ± 16%  perf-profile.self.cycles-pp.radix_tree_node_ctor


***************************************************************************************************
lkp-ivb-2ep1: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/all/aim9/5s

commit: 
  23a31d8764 ("shmem: Refactor shmem_symlink()")
  a2e459555c ("shmem: stable directory offsets")

23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   9781285            +2.0%    9975309        proc-vmstat.pgalloc_normal
   4481052            -1.6%    4408359        proc-vmstat.pgfault
   9749965            +2.0%    9942285        proc-vmstat.pgfree
     14556            -1.6%      14324        perf-stat.i.minor-faults
     14556            -1.6%      14324        perf-stat.i.page-faults
     14505            -1.6%      14272        perf-stat.ps.minor-faults
     14505            -1.6%      14272        perf-stat.ps.page-faults
    849714            -3.6%     819341        aim9.brk_test.ops_per_sec
    478138            +3.1%     492806        aim9.dgram_pipe.ops_per_sec
    199087           -14.6%     170071        aim9.disk_src.ops_per_sec
    286595            -9.7%     258794        aim9.link_test.ops_per_sec
    303603            -2.8%     295009        aim9.page_test.ops_per_sec
   3692190            -1.7%    3629732        aim9.time.minor_page_faults
      0.00            +1.0        0.95 ± 25%  perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
      0.00            +1.0        1.01 ± 23%  perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
      1.54 ± 22%      +1.1        2.61 ± 22%  perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
      0.00            +1.2        1.15 ± 21%  perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
      0.00            +1.2        1.18 ± 21%  perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
      0.00            +1.2        1.22 ± 21%  perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
      0.28 ± 21%      +0.2        0.45 ± 24%  perf-profile.children.cycles-pp.__call_rcu_common
      0.00            +0.3        0.26 ± 43%  perf-profile.children.cycles-pp.radix_tree_node_rcu_free
      0.14 ± 46%      +0.3        0.45 ± 20%  perf-profile.children.cycles-pp.setup_object
      0.00            +0.3        0.33 ± 24%  perf-profile.children.cycles-pp.radix_tree_node_ctor
      0.16 ± 49%      +0.4        0.52 ± 24%  perf-profile.children.cycles-pp.shuffle_freelist
      0.23 ± 43%      +0.4        0.63 ± 23%  perf-profile.children.cycles-pp.allocate_slab
      0.30 ± 35%      +0.4        0.74 ± 24%  perf-profile.children.cycles-pp.___slab_alloc
      0.17 ± 25%      +0.5        0.66 ± 23%  perf-profile.children.cycles-pp.shmem_unlink
      0.00            +0.5        0.49 ± 24%  perf-profile.children.cycles-pp.__xa_erase
      0.00            +0.5        0.52 ± 24%  perf-profile.children.cycles-pp.xa_erase
      0.00            +0.5        0.52 ± 64%  perf-profile.children.cycles-pp.xas_expand
      0.00            +0.5        0.53 ± 24%  perf-profile.children.cycles-pp.simple_offset_remove
      0.87 ± 26%      +0.7        1.56 ± 23%  perf-profile.children.cycles-pp.kmem_cache_alloc_lru
      2.44 ± 12%      +0.8        3.25 ± 13%  perf-profile.children.cycles-pp.__irq_exit_rcu
      0.00            +0.8        0.82 ± 24%  perf-profile.children.cycles-pp.xas_alloc
      0.01 ±230%      +1.0        0.99 ± 23%  perf-profile.children.cycles-pp.xas_create
      1.55 ± 22%      +1.1        2.63 ± 22%  perf-profile.children.cycles-pp.shmem_mknod
      0.00            +1.2        1.16 ± 21%  perf-profile.children.cycles-pp.__xa_alloc
      0.00            +1.2        1.18 ± 21%  perf-profile.children.cycles-pp.__xa_alloc_cyclic
      0.00            +1.2        1.22 ± 21%  perf-profile.children.cycles-pp.simple_offset_add
      0.18 ± 28%      +1.5        1.65 ± 21%  perf-profile.children.cycles-pp.xas_store
      0.11 ± 31%      +0.1        0.25 ± 27%  perf-profile.self.cycles-pp.xas_store
      0.11 ± 31%      +0.2        0.28 ± 24%  perf-profile.self.cycles-pp.kmem_cache_alloc_lru
      0.00            +0.3        0.26 ± 44%  perf-profile.self.cycles-pp.radix_tree_node_rcu_free
      0.00            +0.3        0.29 ± 23%  perf-profile.self.cycles-pp.radix_tree_node_ctor



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-08  5:26 [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression kernel test robot
@ 2023-09-08 14:43 ` Chuck Lever III
  2023-09-12  1:25   ` Oliver Sang
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever III @ 2023-09-08 14:43 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin



> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
> 
> 
> 
> Hello,
> 
> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> 
> 
> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> testcase: aim9
> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> parameters:
> 
> testtime: 300s
> test: disk_src
> cpufreq_governor: performance
> 
> 
> In addition to that, the commit also has significant impact on the following tests:
> 
> +------------------+-------------------------------------------------------------------------------------------------+
> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression                                               |
> | test machine     | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
> | test parameters  | cpufreq_governor=performance                                                                    |
> |                  | test=all                                                                                        |
> |                  | testtime=5s                                                                                     |
> +------------------+-------------------------------------------------------------------------------------------------+
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com

Hi, several weeks ago we requested that these tests be run
again by the robot because they can't be run in environments
I have available to me (the tests do not run on Fedora, and
I don't have any big iron).

We wanted the tests rerun before the patch was committed.
There was a deafening silence. So I assumed the work I did
then to address the regression was successful, and the
patches are now in upstream Linux.

This new report is disappointing.

But, I'm still in a position where I can't run this test,
and the results don't really indicate where the problem
is. So I can't possibly address this issue.

Any suggestions, advice, or help would be appreciated.


> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20230908/202309081306.3ecb3734-oliver.sang@intel.com
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
>  gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s
> 
> commit: 
>  23a31d8764 ("shmem: Refactor shmem_symlink()")
>  a2e459555c ("shmem: stable directory offsets")
> 
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
> ---------------- --------------------------- 
>         %stddev     %change         %stddev
>             \          |                \  
>      0.26 ±  9%      +0.1        0.36 ±  2%  mpstat.cpu.all.soft%
>      0.61            -0.1        0.52        mpstat.cpu.all.usr%
>      0.16 ± 10%     -18.9%       0.13 ± 12%  perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
>      0.04 ±  7%   +1802.4%       0.78 ±115%  perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
>    202424           -19.0%     163868        aim9.disk_src.ops_per_sec
>     94.83            -4.2%      90.83        aim9.time.percent_of_cpu_this_job_got
>     73.62           -17.6%      60.69        aim9.time.user_time
>     23541            +6.5%      25074        proc-vmstat.nr_slab_reclaimable
>   1437319 ± 24%    +377.6%    6864201        proc-vmstat.numa_hit
>   1387016 ± 25%    +391.4%    6815486        proc-vmstat.numa_local
>   4864362 ± 34%    +453.6%   26931180        proc-vmstat.pgalloc_normal
>   4835960 ± 34%    +455.4%   26856610        proc-vmstat.pgfree
>    538959 ± 24%     -23.2%     414090        sched_debug.cfs_rq:/.load.max
>    130191 ± 14%     -13.3%     112846 ±  6%  sched_debug.cfs_rq:/.load.stddev
>    116849 ± 27%     -51.2%      56995 ± 20%  sched_debug.cfs_rq:/.min_vruntime.max
>      1223 ±191%    -897.4%      -9754        sched_debug.cfs_rq:/.spread0.avg
>    107969 ± 29%     -65.3%      37448 ± 39%  sched_debug.cfs_rq:/.spread0.max
>     55209 ± 14%     -21.8%      43154 ± 14%  sched_debug.cpu.nr_switches.max
>     11.21           +23.7%      13.87        perf-stat.i.MPKI
> 7.223e+08            -4.4%  6.907e+08        perf-stat.i.branch-instructions
>      2.67            +0.2        2.88        perf-stat.i.branch-miss-rate%
>  19988363            +2.8%   20539702        perf-stat.i.branch-misses
>     17.36            -2.8       14.59        perf-stat.i.cache-miss-rate%
>  40733859           +19.5%   48659982        perf-stat.i.cache-references
>      1.76            +3.5%       1.82        perf-stat.i.cpi
>     55.21            +5.4%      58.21 ±  2%  perf-stat.i.cpu-migrations
>  1.01e+09            -3.8%  9.719e+08        perf-stat.i.dTLB-loads
>      0.26 ±  4%      -0.0        0.23 ±  3%  perf-stat.i.dTLB-store-miss-rate%
>   2166022 ±  4%      -6.9%    2015917 ±  3%  perf-stat.i.dTLB-store-misses
> 8.503e+08            +5.5%  8.968e+08        perf-stat.i.dTLB-stores
>     69.22 ±  4%      +6.4       75.60        perf-stat.i.iTLB-load-miss-rate%
>    316455 ± 12%     -31.6%     216531 ±  3%  perf-stat.i.iTLB-loads
> 3.722e+09            -3.1%  3.608e+09        perf-stat.i.instructions
>      0.57            -3.3%       0.55        perf-stat.i.ipc
>    865.04           -10.4%     775.02 ±  3%  perf-stat.i.metric.K/sec
>     47.51            -2.1       45.37        perf-stat.i.node-load-miss-rate%
>    106705 ±  3%     +14.8%     122490 ±  5%  perf-stat.i.node-loads
>    107169 ±  4%     +29.0%     138208 ±  7%  perf-stat.i.node-stores
>     10.94           +23.3%      13.49        perf-stat.overall.MPKI
>      2.77            +0.2        2.97        perf-stat.overall.branch-miss-rate%
>     17.28            -2.7       14.56        perf-stat.overall.cache-miss-rate%
>      1.73            +3.4%       1.79        perf-stat.overall.cpi
>      0.25 ±  4%      -0.0        0.22 ±  3%  perf-stat.overall.dTLB-store-miss-rate%
>     69.20 ±  4%      +6.4       75.60        perf-stat.overall.iTLB-load-miss-rate%
>      0.58            -3.2%       0.56        perf-stat.overall.ipc
>     45.25            -2.2       43.10        perf-stat.overall.node-load-miss-rate%
> 7.199e+08            -4.4%  6.883e+08        perf-stat.ps.branch-instructions
>  19919808            +2.8%   20469001        perf-stat.ps.branch-misses
>  40597326           +19.5%   48497201        perf-stat.ps.cache-references
>     55.06            +5.4%      58.03 ±  2%  perf-stat.ps.cpu-migrations
> 1.007e+09            -3.8%  9.686e+08        perf-stat.ps.dTLB-loads
>   2158768 ±  4%      -6.9%    2009174 ±  3%  perf-stat.ps.dTLB-store-misses
> 8.475e+08            +5.5%  8.937e+08        perf-stat.ps.dTLB-stores
>    315394 ± 12%     -31.6%     215816 ±  3%  perf-stat.ps.iTLB-loads
>  3.71e+09            -3.1%  3.595e+09        perf-stat.ps.instructions
>    106351 ±  3%     +14.8%     122083 ±  5%  perf-stat.ps.node-loads
>    106728 ±  4%     +29.1%     137740 ±  7%  perf-stat.ps.node-stores
> 1.117e+12            -3.0%  1.084e+12        perf-stat.total.instructions
>      0.00            +0.8        0.75 ± 12%  perf-profile.calltrace.cycles-pp.__call_rcu_common.xas_store.__xa_erase.xa_erase.simple_offset_remove
>      0.00            +0.8        0.78 ± 34%  perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store
>      0.00            +0.8        0.83 ± 29%  perf-profile.calltrace.cycles-pp.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand
>      0.00            +0.9        0.92 ± 26%  perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create
>      0.00            +1.0        0.99 ± 27%  perf-profile.calltrace.cycles-pp.shuffle_freelist.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc
>      0.00            +1.0        1.04 ± 28%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store.__xa_alloc
>      0.00            +1.1        1.11 ± 26%  perf-profile.calltrace.cycles-pp.xas_alloc.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
>      1.51 ± 24%      +1.2        2.73 ± 10%  perf-profile.calltrace.cycles-pp.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      0.00            +1.2        1.24 ± 20%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create.xas_store
>      0.00            +1.3        1.27 ± 10%  perf-profile.calltrace.cycles-pp.xas_store.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink
>      0.00            +1.3        1.30 ± 10%  perf-profile.calltrace.cycles-pp.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink
>      0.00            +1.3        1.33 ± 19%  perf-profile.calltrace.cycles-pp.xas_alloc.xas_expand.xas_create.xas_store.__xa_alloc
>      0.00            +1.4        1.36 ± 10%  perf-profile.calltrace.cycles-pp.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
>      0.00            +1.4        1.37 ± 10%  perf-profile.calltrace.cycles-pp.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink
>      0.00            +1.5        1.51 ± 17%  perf-profile.calltrace.cycles-pp.xas_expand.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
>      0.00            +1.6        1.62 ± 12%  perf-profile.calltrace.cycles-pp.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64
>      0.00            +2.8        2.80 ± 13%  perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
>      0.00            +2.9        2.94 ± 13%  perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
>      5.38 ± 24%      +3.1        8.51 ± 11%  perf-profile.calltrace.cycles-pp.lookup_open.open_last_lookups.path_openat.do_filp_open.do_sys_openat2
>      6.08 ± 24%      +3.2        9.24 ± 12%  perf-profile.calltrace.cycles-pp.open_last_lookups.path_openat.do_filp_open.do_sys_openat2.__x64_sys_creat
>      0.00            +3.2        3.20 ± 13%  perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
>      0.00            +3.2        3.24 ± 13%  perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
>      0.00            +3.4        3.36 ± 14%  perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
>      2.78 ± 25%      +3.4        6.17 ± 12%  perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
>      0.16 ± 30%      -0.1        0.08 ± 20%  perf-profile.children.cycles-pp.map_id_up
>      0.02 ±146%      +0.1        0.08 ± 13%  perf-profile.children.cycles-pp.shmem_is_huge
>      0.02 ±141%      +0.1        0.09 ± 16%  perf-profile.children.cycles-pp.__list_del_entry_valid
>      0.00            +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.free_unref_page
>      0.00            +0.1        0.08 ± 13%  perf-profile.children.cycles-pp.shmem_destroy_inode
>      0.04 ±101%      +0.1        0.14 ± 25%  perf-profile.children.cycles-pp.rcu_nocb_try_bypass
>      0.00            +0.1        0.12 ± 27%  perf-profile.children.cycles-pp.xas_find_marked
>      0.02 ±144%      +0.1        0.16 ± 14%  perf-profile.children.cycles-pp.__unfreeze_partials
>      0.03 ±106%      +0.2        0.19 ± 26%  perf-profile.children.cycles-pp.xas_descend
>      0.01 ±223%      +0.2        0.17 ± 15%  perf-profile.children.cycles-pp.get_page_from_freelist
>      0.11 ± 22%      +0.2        0.29 ± 16%  perf-profile.children.cycles-pp.rcu_segcblist_enqueue
>      0.02 ±146%      +0.2        0.24 ± 13%  perf-profile.children.cycles-pp.__alloc_pages
>      0.36 ± 79%      +0.6        0.98 ± 15%  perf-profile.children.cycles-pp.__slab_free
>      0.50 ± 26%      +0.7        1.23 ± 14%  perf-profile.children.cycles-pp.__call_rcu_common
>      0.00            +0.8        0.82 ± 13%  perf-profile.children.cycles-pp.radix_tree_node_rcu_free
>      0.00            +1.1        1.14 ± 17%  perf-profile.children.cycles-pp.radix_tree_node_ctor
>      0.16 ± 86%      +1.2        1.38 ± 16%  perf-profile.children.cycles-pp.setup_object
>      1.52 ± 25%      +1.2        2.75 ± 10%  perf-profile.children.cycles-pp.vfs_unlink
>      0.36 ± 22%      +1.3        1.63 ± 12%  perf-profile.children.cycles-pp.shmem_unlink
>      0.00            +1.3        1.30 ± 10%  perf-profile.children.cycles-pp.__xa_erase
>      0.20 ± 79%      +1.3        1.53 ± 15%  perf-profile.children.cycles-pp.shuffle_freelist
>      0.00            +1.4        1.36 ± 10%  perf-profile.children.cycles-pp.xa_erase
>      0.00            +1.4        1.38 ± 10%  perf-profile.children.cycles-pp.simple_offset_remove
>      0.00            +1.5        1.51 ± 17%  perf-profile.children.cycles-pp.xas_expand
>      0.26 ± 78%      +1.6        1.87 ± 13%  perf-profile.children.cycles-pp.allocate_slab
>      0.40 ± 49%      +1.7        2.10 ± 13%  perf-profile.children.cycles-pp.___slab_alloc
>      1.30 ± 85%      +2.1        3.42 ± 12%  perf-profile.children.cycles-pp.rcu_do_batch
>      1.56 ± 27%      +2.4        3.93 ± 11%  perf-profile.children.cycles-pp.kmem_cache_alloc_lru
>      0.00            +2.4        2.44 ± 12%  perf-profile.children.cycles-pp.xas_alloc
>      2.66 ± 13%      +2.5        5.14 ±  5%  perf-profile.children.cycles-pp.__irq_exit_rcu
>     11.16 ± 10%      +2.7       13.88 ±  8%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
>     11.77 ± 10%      +2.7       14.49 ±  8%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
>      0.00            +2.8        2.82 ± 13%  perf-profile.children.cycles-pp.xas_create
>      5.40 ± 24%      +3.1        8.52 ± 11%  perf-profile.children.cycles-pp.lookup_open
>      6.12 ± 24%      +3.1        9.27 ± 12%  perf-profile.children.cycles-pp.open_last_lookups
>      0.00            +3.2        3.22 ± 13%  perf-profile.children.cycles-pp.__xa_alloc
>      0.00            +3.2        3.24 ± 13%  perf-profile.children.cycles-pp.__xa_alloc_cyclic
>      0.00            +3.4        3.36 ± 14%  perf-profile.children.cycles-pp.simple_offset_add
>      2.78 ± 25%      +3.4        6.18 ± 12%  perf-profile.children.cycles-pp.shmem_mknod
>      0.00            +4.2        4.24 ± 12%  perf-profile.children.cycles-pp.xas_store
>      0.14 ± 27%      -0.1        0.08 ± 21%  perf-profile.self.cycles-pp.map_id_up
>      0.00            +0.1        0.06 ± 24%  perf-profile.self.cycles-pp.shmem_destroy_inode
>      0.00            +0.1        0.07 ±  8%  perf-profile.self.cycles-pp.__xa_alloc
>      0.02 ±146%      +0.1        0.11 ± 28%  perf-profile.self.cycles-pp.rcu_nocb_try_bypass
>      0.01 ±223%      +0.1        0.10 ± 28%  perf-profile.self.cycles-pp.shuffle_freelist
>      0.00            +0.1        0.11 ± 40%  perf-profile.self.cycles-pp.xas_create
>      0.00            +0.1        0.12 ± 27%  perf-profile.self.cycles-pp.xas_find_marked
>      0.00            +0.1        0.14 ± 18%  perf-profile.self.cycles-pp.xas_alloc
>      0.03 ±103%      +0.1        0.17 ± 29%  perf-profile.self.cycles-pp.xas_descend
>      0.00            +0.2        0.16 ± 23%  perf-profile.self.cycles-pp.xas_expand
>      0.10 ± 22%      +0.2        0.27 ± 16%  perf-profile.self.cycles-pp.rcu_segcblist_enqueue
>      0.00            +0.4        0.36 ± 16%  perf-profile.self.cycles-pp.xas_store
>      0.32 ± 30%      +0.4        0.71 ± 12%  perf-profile.self.cycles-pp.__call_rcu_common
>      0.18 ± 27%      +0.5        0.65 ±  8%  perf-profile.self.cycles-pp.kmem_cache_alloc_lru
>      0.36 ± 79%      +0.6        0.96 ± 15%  perf-profile.self.cycles-pp.__slab_free
>      0.00            +0.8        0.80 ± 14%  perf-profile.self.cycles-pp.radix_tree_node_rcu_free
>      0.00            +1.0        1.01 ± 16%  perf-profile.self.cycles-pp.radix_tree_node_ctor
> 
> 
> ***************************************************************************************************
> lkp-ivb-2ep1: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
>  gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/all/aim9/5s
> 
> commit: 
>  23a31d8764 ("shmem: Refactor shmem_symlink()")
>  a2e459555c ("shmem: stable directory offsets")
> 
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
> ---------------- --------------------------- 
>         %stddev     %change         %stddev
>             \          |                \  
>   9781285            +2.0%    9975309        proc-vmstat.pgalloc_normal
>   4481052            -1.6%    4408359        proc-vmstat.pgfault
>   9749965            +2.0%    9942285        proc-vmstat.pgfree
>     14556            -1.6%      14324        perf-stat.i.minor-faults
>     14556            -1.6%      14324        perf-stat.i.page-faults
>     14505            -1.6%      14272        perf-stat.ps.minor-faults
>     14505            -1.6%      14272        perf-stat.ps.page-faults
>    849714            -3.6%     819341        aim9.brk_test.ops_per_sec
>    478138            +3.1%     492806        aim9.dgram_pipe.ops_per_sec
>    199087           -14.6%     170071        aim9.disk_src.ops_per_sec
>    286595            -9.7%     258794        aim9.link_test.ops_per_sec
>    303603            -2.8%     295009        aim9.page_test.ops_per_sec
>   3692190            -1.7%    3629732        aim9.time.minor_page_faults
>      0.00            +1.0        0.95 ± 25%  perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
>      0.00            +1.0        1.01 ± 23%  perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
>      1.54 ± 22%      +1.1        2.61 ± 22%  perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
>      0.00            +1.2        1.15 ± 21%  perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
>      0.00            +1.2        1.18 ± 21%  perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
>      0.00            +1.2        1.22 ± 21%  perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
>      0.28 ± 21%      +0.2        0.45 ± 24%  perf-profile.children.cycles-pp.__call_rcu_common
>      0.00            +0.3        0.26 ± 43%  perf-profile.children.cycles-pp.radix_tree_node_rcu_free
>      0.14 ± 46%      +0.3        0.45 ± 20%  perf-profile.children.cycles-pp.setup_object
>      0.00            +0.3        0.33 ± 24%  perf-profile.children.cycles-pp.radix_tree_node_ctor
>      0.16 ± 49%      +0.4        0.52 ± 24%  perf-profile.children.cycles-pp.shuffle_freelist
>      0.23 ± 43%      +0.4        0.63 ± 23%  perf-profile.children.cycles-pp.allocate_slab
>      0.30 ± 35%      +0.4        0.74 ± 24%  perf-profile.children.cycles-pp.___slab_alloc
>      0.17 ± 25%      +0.5        0.66 ± 23%  perf-profile.children.cycles-pp.shmem_unlink
>      0.00            +0.5        0.49 ± 24%  perf-profile.children.cycles-pp.__xa_erase
>      0.00            +0.5        0.52 ± 24%  perf-profile.children.cycles-pp.xa_erase
>      0.00            +0.5        0.52 ± 64%  perf-profile.children.cycles-pp.xas_expand
>      0.00            +0.5        0.53 ± 24%  perf-profile.children.cycles-pp.simple_offset_remove
>      0.87 ± 26%      +0.7        1.56 ± 23%  perf-profile.children.cycles-pp.kmem_cache_alloc_lru
>      2.44 ± 12%      +0.8        3.25 ± 13%  perf-profile.children.cycles-pp.__irq_exit_rcu
>      0.00            +0.8        0.82 ± 24%  perf-profile.children.cycles-pp.xas_alloc
>      0.01 ±230%      +1.0        0.99 ± 23%  perf-profile.children.cycles-pp.xas_create
>      1.55 ± 22%      +1.1        2.63 ± 22%  perf-profile.children.cycles-pp.shmem_mknod
>      0.00            +1.2        1.16 ± 21%  perf-profile.children.cycles-pp.__xa_alloc
>      0.00            +1.2        1.18 ± 21%  perf-profile.children.cycles-pp.__xa_alloc_cyclic
>      0.00            +1.2        1.22 ± 21%  perf-profile.children.cycles-pp.simple_offset_add
>      0.18 ± 28%      +1.5        1.65 ± 21%  perf-profile.children.cycles-pp.xas_store
>      0.11 ± 31%      +0.1        0.25 ± 27%  perf-profile.self.cycles-pp.xas_store
>      0.11 ± 31%      +0.2        0.28 ± 24%  perf-profile.self.cycles-pp.kmem_cache_alloc_lru
>      0.00            +0.3        0.26 ± 44%  perf-profile.self.cycles-pp.radix_tree_node_rcu_free
>      0.00            +0.3        0.29 ± 23%  perf-profile.self.cycles-pp.radix_tree_node_ctor
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
> 

--
Chuck Lever



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-08 14:43 ` Chuck Lever III
@ 2023-09-12  1:25   ` Oliver Sang
  2023-09-12 13:01     ` Chuck Lever III
  0 siblings, 1 reply; 15+ messages in thread
From: Oliver Sang @ 2023-09-12  1:25 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin,
	oliver.sang

hi, Chuck Lever,

On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
> 
> 
> > On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
> > 
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > 
> > 
> > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > 
> > testcase: aim9
> > test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> > parameters:
> > 
> > testtime: 300s
> > test: disk_src
> > cpufreq_governor: performance
> > 
> > 
> > In addition to that, the commit also has significant impact on the following tests:
> > 
> > +------------------+-------------------------------------------------------------------------------------------------+
> > | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression                                               |
> > | test machine     | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
> > | test parameters  | cpufreq_governor=performance                                                                    |
> > |                  | test=all                                                                                        |
> > |                  | testtime=5s                                                                                     |
> > +------------------+-------------------------------------------------------------------------------------------------+
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> 
> Hi, several weeks ago we requested that these tests be run
> again by the robot because they can't be run in environments
> I have available to me (the tests do not run on Fedora, and
> I don't have any big iron).
> 
> We wanted the tests rerun before the patch was committed.
> There was a deafening silence. So I assumed the work I did
> then to address the regression was successful, and the
> patches are now in upstream Linux.
> 
> This new report is disappointing.

I'm so sorry that I missed the test request for
https://lore.kernel.org/all/169030957098.157536.9938425508695693348.stgit@manet.1015granger.net/

just FYI, when this auto-bisect done,
head commit of linus/master: [65d6e954e37872fd9afb5ef3fc0481bb3c2f20f4] was
tested, it already includes
2be4f05af71bb libfs: Remove parent dentry locking in offset_iterate_dir()

in our tests, the regression still exists.


> 
> But, I'm still in a position where I can't run this test,
> and the results don't really indicate where the problem
> is. So I can't possibly address this issue.
> 
> Any suggestions, advice, or help would be appreciated.

if you have further fix patch, could you let us know? I will test it.

> 
> --
> Chuck Lever
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12  1:25   ` Oliver Sang
@ 2023-09-12 13:01     ` Chuck Lever III
  2023-09-12 13:19       ` Oliver Sang
  2023-09-12 15:14       ` Feng Tang
  0 siblings, 2 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-12 13:01 UTC (permalink / raw)
  To: Oliver Sang
  Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin



> On Sep 11, 2023, at 9:25 PM, Oliver Sang <oliver.sang@intel.com> wrote:
> 
> hi, Chuck Lever,
> 
> On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
>>> 
>>> 
>>> 
>>> Hello,
>>> 
>>> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>>> 
>>> 
>>> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>> 
>>> testcase: aim9
>>> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
>>> parameters:
>>> 
>>> testtime: 300s
>>> test: disk_src
>>> cpufreq_governor: performance
>>> 
>>> 
>>> In addition to that, the commit also has significant impact on the following tests:
>>> 
>>> +------------------+-------------------------------------------------------------------------------------------------+
>>> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression                                               |
>>> | test machine     | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
>>> | test parameters  | cpufreq_governor=performance                                                                    |
>>> |                  | test=all                                                                                        |
>>> |                  | testtime=5s                                                                                     |
>>> +------------------+-------------------------------------------------------------------------------------------------+
>>> 
>>> 
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
>> 
>> Hi, several weeks ago we requested that these tests be run
>> again by the robot because they can't be run in environments
>> I have available to me (the tests do not run on Fedora, and
>> I don't have any big iron).
>> 
>> We wanted the tests rerun before the patch was committed.
>> There was a deafening silence. So I assumed the work I did
>> then to address the regression was successful, and the
>> patches are now in upstream Linux.
>> 
>> This new report is disappointing.
> 
> I'm so sorry that I missed the test request for
> https://lore.kernel.org/all/169030957098.157536.9938425508695693348.stgit@manet.1015granger.net/
> 
> just FYI, when this auto-bisect done,
> head commit of linus/master: [65d6e954e37872fd9afb5ef3fc0481bb3c2f20f4] was
> tested, it already includes
> 2be4f05af71bb libfs: Remove parent dentry locking in offset_iterate_dir()
> 
> in our tests, the regression still exists.

Thanks for clarifying. I wondered about that, of course only just
after clicking "Send".


>> But, I'm still in a position where I can't run this test,
>> and the results don't really indicate where the problem
>> is. So I can't possibly address this issue.
>> 
>> Any suggestions, advice, or help would be appreciated.
> 
> if you have further fix patch, could you let us know? I will test it.

Well that's the problem. Since I can't run the reproducer, there's
nothing I can do to troubleshoot the problem myself.

Is there any hope in getting this reproducer to run on Fedora?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 13:01     ` Chuck Lever III
@ 2023-09-12 13:19       ` Oliver Sang
  2023-09-12 15:14       ` Feng Tang
  1 sibling, 0 replies; 15+ messages in thread
From: Oliver Sang @ 2023-09-12 13:19 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: oe-lkp, kernel test robot, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, ying.huang, feng.tang, fengwei.yin,
	oliver.sang

hi, Chuck Lever,

On Tue, Sep 12, 2023 at 01:01:29PM +0000, Chuck Lever III wrote:
> 
> 
> > in our tests, the regression still exists.
> 
> Thanks for clarifying. I wondered about that, of course only just
> after clicking "Send".
> 
> 
> >> But, I'm still in a position where I can't run this test,
> >> and the results don't really indicate where the problem
> >> is. So I can't possibly address this issue.
> >> 
> >> Any suggestions, advice, or help would be appreciated.
> > 
> > if you have further fix patch, could you let us know? I will test it.
> 
> Well that's the problem. Since I can't run the reproducer, there's
> nothing I can do to troubleshoot the problem myself.
> 
> Is there any hope in getting this reproducer to run on Fedora?

sorry about this, maybe it's hard for us to support Fedora in a short time.

BTW, we are investigating this regression now, hope that we can update you
later. Thanks

> 
> 
> --
> Chuck Lever
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 13:01     ` Chuck Lever III
  2023-09-12 13:19       ` Oliver Sang
@ 2023-09-12 15:14       ` Feng Tang
  2023-09-12 15:26         ` Chuck Lever III
                           ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Feng Tang @ 2023-09-12 15:14 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei

Hi Chuck Lever, 

On Tue, Sep 12, 2023 at 09:01:29PM +0800, Chuck Lever III wrote:
> 
> 
> > On Sep 11, 2023, at 9:25 PM, Oliver Sang <oliver.sang@intel.com> wrote:
> > 
> > hi, Chuck Lever,
> > 
> > On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
> >> 
> >> 
> >>> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
> >>> 
> >>> 
> >>> 
> >>> Hello,
> >>> 
> >>> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> >>> 
> >>> 
> >>> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >>> 
> >>> testcase: aim9
> >>> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
> >>> parameters:
> >>> 
> >>> testtime: 300s
> >>> test: disk_src
> >>> cpufreq_governor: performance
> >>> 
> >>> 
> >>> In addition to that, the commit also has significant impact on the following tests:
> >>> 
> >>> +------------------+-------------------------------------------------------------------------------------------------+
> >>> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression                                               |
> >>> | test machine     | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
> >>> | test parameters  | cpufreq_governor=performance                                                                    |
> >>> |                  | test=all                                                                                        |
> >>> |                  | testtime=5s                                                                                     |
> >>> +------------------+-------------------------------------------------------------------------------------------------+
> >>> 
> >>> 
> >>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >>> the same patch/commit), kindly add following tags
> >>> | Reported-by: kernel test robot <oliver.sang@intel.com>
> >>> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
 
> >> But, I'm still in a position where I can't run this test,
> >> and the results don't really indicate where the problem
> >> is. So I can't possibly address this issue.
> >> 
> >> Any suggestions, advice, or help would be appreciated.
> > 
> > if you have further fix patch, could you let us know? I will test it.
> 
> Well that's the problem. Since I can't run the reproducer, there's
> nothing I can do to troubleshoot the problem myself.

We dug more into the perf and other profiling data from 0Day server
running this case, and it seems that the new simple_offset_add()
called by shmem_mknod() brings extra cost related with slab,
specifically the 'radix_tree_node', which cause the regression.

Here is some slabinfo diff for commit a2e459555c5f and its parent:

	23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
	---------------- --------------------------- 
 
     26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
    941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
     26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
    941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs

Also the perf profile show some difference

      0.01 ±223%      +0.1        0.10 ± 28%  pp.self.shuffle_freelist
      0.00            +0.1        0.11 ± 40%  pp.self.xas_create
      0.00            +0.1        0.12 ± 27%  pp.self.xas_find_marked
      0.00            +0.1        0.14 ± 18%  pp.self.xas_alloc
      0.03 ±103%      +0.1        0.17 ± 29%  pp.self.xas_descend
      0.00            +0.2        0.16 ± 23%  pp.self.xas_expand
      0.10 ± 22%      +0.2        0.27 ± 16%  pp.self.rcu_segcblist_enqueue
      0.92 ± 35%      +0.3        1.22 ± 11%  pp.self.kmem_cache_free
      0.00            +0.4        0.36 ± 16%  pp.self.xas_store
      0.32 ± 30%      +0.4        0.71 ± 12%  pp.self.__call_rcu_common
      0.18 ± 27%      +0.5        0.65 ±  8%  pp.self.kmem_cache_alloc_lru
      0.36 ± 79%      +0.6        0.96 ± 15%  pp.self.__slab_free
      0.00            +0.8        0.80 ± 14%  pp.self.radix_tree_node_rcu_free
      0.00            +1.0        1.01 ± 16%  pp.self.radix_tree_node_ctor

Some perf profile from a2e459555c5f is: 

-   17.09%     0.09%  singleuser       [kernel.kallsyms]            [k] path_openat   
   - 16.99% path_openat                
      - 12.23% open_last_lookups      
         - 11.33% lookup_open.isra.0
            - 9.05% shmem_mknod
               - 5.11% simple_offset_add
                  - 4.95% __xa_alloc_cyclic 
                     - 4.88% __xa_alloc
                        - 4.76% xas_store 
                           - xas_create
                              - 2.40% xas_expand.constprop.0
                                 - 2.01% xas_alloc
                                    - kmem_cache_alloc_lru
                                       - 1.28% ___slab_alloc
                                          - 1.22% allocate_slab 
                                             - 1.19% shuffle_freelist 
                                                - 1.04% setup_object
                                                     radix_tree_node_ctor

Please let me know if you need more info.

> 
> Is there any hope in getting this reproducer to run on Fedora?

Myself haven't succeeded to reproduce it locally, will keep trying
it tomorrow.

Thanks,
Feng

> 
> --
> Chuck Lever
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 15:14       ` Feng Tang
@ 2023-09-12 15:26         ` Chuck Lever III
  2023-09-12 16:01         ` Matthew Wilcox
  2023-09-13  6:47         ` Feng Tang
  2 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-12 15:26 UTC (permalink / raw)
  To: Feng Tang
  Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei



> On Sep 12, 2023, at 11:14 AM, Feng Tang <feng.tang@intel.com> wrote:
> 
> Hi Chuck Lever, 
> 
> On Tue, Sep 12, 2023 at 09:01:29PM +0800, Chuck Lever III wrote:
>> 
>> 
>>> On Sep 11, 2023, at 9:25 PM, Oliver Sang <oliver.sang@intel.com> wrote:
>>> 
>>> hi, Chuck Lever,
>>> 
>>> On Fri, Sep 08, 2023 at 02:43:22PM +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Sep 8, 2023, at 1:26 AM, kernel test robot <oliver.sang@intel.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>>>>> 
>>>>> 
>>>>> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>> 
>>>>> testcase: aim9
>>>>> test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
>>>>> parameters:
>>>>> 
>>>>> testtime: 300s
>>>>> test: disk_src
>>>>> cpufreq_governor: performance
>>>>> 
>>>>> 
>>>>> In addition to that, the commit also has significant impact on the following tests:
>>>>> 
>>>>> +------------------+-------------------------------------------------------------------------------------------------+
>>>>> | testcase: change | aim9: aim9.disk_src.ops_per_sec -14.6% regression                                               |
>>>>> | test machine     | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory |
>>>>> | test parameters  | cpufreq_governor=performance                                                                    |
>>>>> |                  | test=all                                                                                        |
>>>>> |                  | testtime=5s                                                                                     |
>>>>> +------------------+-------------------------------------------------------------------------------------------------+
>>>>> 
>>>>> 
>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>>> the same patch/commit), kindly add following tags
>>>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>> | Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> 
>>>> But, I'm still in a position where I can't run this test,
>>>> and the results don't really indicate where the problem
>>>> is. So I can't possibly address this issue.
>>>> 
>>>> Any suggestions, advice, or help would be appreciated.
>>> 
>>> if you have further fix patch, could you let us know? I will test it.
>> 
>> Well that's the problem. Since I can't run the reproducer, there's
>> nothing I can do to troubleshoot the problem myself.
> 
> We dug more into the perf and other profiling data from 0Day server
> running this case, and it seems that the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.

Thank you! Will ponder.


> Here is some slabinfo diff for commit a2e459555c5f and its parent:
> 
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
> ---------------- --------------------------- 
> 
>     26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>    941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>     26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>    941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
> 
> Also the perf profile show some difference
> 
>      0.01 ±223%      +0.1        0.10 ± 28%  pp.self.shuffle_freelist
>      0.00            +0.1        0.11 ± 40%  pp.self.xas_create
>      0.00            +0.1        0.12 ± 27%  pp.self.xas_find_marked
>      0.00            +0.1        0.14 ± 18%  pp.self.xas_alloc
>      0.03 ±103%      +0.1        0.17 ± 29%  pp.self.xas_descend
>      0.00            +0.2        0.16 ± 23%  pp.self.xas_expand
>      0.10 ± 22%      +0.2        0.27 ± 16%  pp.self.rcu_segcblist_enqueue
>      0.92 ± 35%      +0.3        1.22 ± 11%  pp.self.kmem_cache_free
>      0.00            +0.4        0.36 ± 16%  pp.self.xas_store
>      0.32 ± 30%      +0.4        0.71 ± 12%  pp.self.__call_rcu_common
>      0.18 ± 27%      +0.5        0.65 ±  8%  pp.self.kmem_cache_alloc_lru
>      0.36 ± 79%      +0.6        0.96 ± 15%  pp.self.__slab_free
>      0.00            +0.8        0.80 ± 14%  pp.self.radix_tree_node_rcu_free
>      0.00            +1.0        1.01 ± 16%  pp.self.radix_tree_node_ctor
> 
> Some perf profile from a2e459555c5f is: 
> 
> -   17.09%     0.09%  singleuser       [kernel.kallsyms]            [k] path_openat   
>   - 16.99% path_openat                
>      - 12.23% open_last_lookups      
>         - 11.33% lookup_open.isra.0
>            - 9.05% shmem_mknod
>               - 5.11% simple_offset_add
>                  - 4.95% __xa_alloc_cyclic 
>                     - 4.88% __xa_alloc
>                        - 4.76% xas_store 
>                           - xas_create
>                              - 2.40% xas_expand.constprop.0
>                                 - 2.01% xas_alloc
>                                    - kmem_cache_alloc_lru
>                                       - 1.28% ___slab_alloc
>                                          - 1.22% allocate_slab 
>                                             - 1.19% shuffle_freelist 
>                                                - 1.04% setup_object
>                                                     radix_tree_node_ctor
> 
> Please let me know if you need more info.
> 
>> 
>> Is there any hope in getting this reproducer to run on Fedora?
> 
> Myself haven't succeeded to reproduce it locally, will keep trying
> it tomorrow.
> 
> Thanks,
> Feng
> 
>> 
>> --
>> Chuck Lever


--
Chuck Lever



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 15:14       ` Feng Tang
  2023-09-12 15:26         ` Chuck Lever III
@ 2023-09-12 16:01         ` Matthew Wilcox
  2023-09-12 16:27           ` Chuck Lever III
                             ` (2 more replies)
  2023-09-13  6:47         ` Feng Tang
  2 siblings, 3 replies; 15+ messages in thread
From: Matthew Wilcox @ 2023-09-12 16:01 UTC (permalink / raw)
  To: Feng Tang
  Cc: Chuck Lever III, Sang, Oliver, oe-lkp, lkp,
	Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
	Ying, Yin, Fengwei, Liam R. Howlett

On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
> > Well that's the problem. Since I can't run the reproducer, there's
> > nothing I can do to troubleshoot the problem myself.
> 
> We dug more into the perf and other profiling data from 0Day server
> running this case, and it seems that the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
> 
> Here is some slabinfo diff for commit a2e459555c5f and its parent:
> 
> 	23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
> 	---------------- --------------------------- 
>  
>      26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>     941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>      26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>     941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs

I can't find the benchmark source, but my suspicion is that this
creates and deletes a lot of files in a directory.  The 'stable
directory offsets' series uses xa_alloc_cyclic(), so we'll end up
with a very sparse radix tree.  ie it'll look something like this:

0 - "."
1 - ".."
6 - "d"
27 - "y"
4000 - "fzz"
65537 - "czzz"
643289767 - "bzzzzzz"

(i didn't work out the names precisely here, but this is approximately
what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
almost all of them)

The radix tree does not handle this well.  It'll allocate one node for:

entries 0-63 (covers the first 4 entries)
entries 0-4095
entries 3968-4031 (the first 5)
entries 0-262143
entries 65536-69631
entries 65536-65599 (the first 6)
entries 0-16777215
entries 0-1073741823
entries 637534208-654311423
entries 643039232-643301375
entries 643289088-643293183
entries 643289728-643289791 (all 7)

That ends up being 12 nodes (you get 7 nodes per page) to store 7
pointers.  Admittedly to get here, you have to do 643289765 creations
and nearly as many deletions, so are we going to see it in a
non-benchmark situation?

The maple tree is more resilient against this kind of shenanigan, but
we're not there in terms of supporting the kind of allocation you
want.  For this kind of allocation pattern, you'd get all 7 pointers
in a single 256-byte node.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 16:01         ` Matthew Wilcox
@ 2023-09-12 16:27           ` Chuck Lever III
  2023-09-13 17:45           ` Chuck Lever III
  2024-01-04 19:33           ` Chuck Lever III
  2 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-12 16:27 UTC (permalink / raw)
  To: Matthew Wilcox, Feng Tang
  Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei,
	Liam Howlett



> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>> 
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>> 
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>> 
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
>> ---------------- --------------------------- 
>> 
>>     26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>>     26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
> 
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory.  The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree.  ie it'll look something like this:
> 
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
> 
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
> 
> The radix tree does not handle this well.  It'll allocate one node for:
> 
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
> 
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers.  Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?

I would imagine that a system that is up for a long time might
see that kind of fragmentation in /tmp, eventually.


> The maple tree is more resilient against this kind of shenanigan, but
> we're not there in terms of supporting the kind of allocation you
> want.  For this kind of allocation pattern, you'd get all 7 pointers
> in a single 256-byte node.

We can try switching to Maple tree when it's ready. The simple
directory offset API design, I hope, is reasonably resilient to
changes to the underlying data structure.

Meantime, I'm wondering if we can see the absolute performance
numbers rather than the percentage degradation? I'm curious
if we're talking about the different between 2 and 3 us per
operation, or something more substantial?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 15:14       ` Feng Tang
  2023-09-12 15:26         ` Chuck Lever III
  2023-09-12 16:01         ` Matthew Wilcox
@ 2023-09-13  6:47         ` Feng Tang
  2023-09-13 13:32           ` Chuck Lever III
  2 siblings, 1 reply; 15+ messages in thread
From: Feng Tang @ 2023-09-13  6:47 UTC (permalink / raw)
  To: Chuck Lever III, willy
  Cc: Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei

[-- Attachment #1: Type: text/plain, Size: 4313 bytes --]

On Tue, Sep 12, 2023 at 11:14:42PM +0800, Tang, Feng wrote:
[...]
> > 
> > Well that's the problem. Since I can't run the reproducer, there's
> > nothing I can do to troubleshoot the problem myself.
> 
> We dug more into the perf and other profiling data from 0Day server
> running this case, and it seems that the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
> 
> Here is some slabinfo diff for commit a2e459555c5f and its parent:
> 
> 	23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
> 	---------------- --------------------------- 
>  
>      26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>     941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>      26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>     941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
> 
> Also the perf profile show some difference
> 
>       0.01 ±223%      +0.1        0.10 ± 28%  pp.self.shuffle_freelist
>       0.00            +0.1        0.11 ± 40%  pp.self.xas_create
>       0.00            +0.1        0.12 ± 27%  pp.self.xas_find_marked
>       0.00            +0.1        0.14 ± 18%  pp.self.xas_alloc
>       0.03 ±103%      +0.1        0.17 ± 29%  pp.self.xas_descend
>       0.00            +0.2        0.16 ± 23%  pp.self.xas_expand
>       0.10 ± 22%      +0.2        0.27 ± 16%  pp.self.rcu_segcblist_enqueue
>       0.92 ± 35%      +0.3        1.22 ± 11%  pp.self.kmem_cache_free
>       0.00            +0.4        0.36 ± 16%  pp.self.xas_store
>       0.32 ± 30%      +0.4        0.71 ± 12%  pp.self.__call_rcu_common
>       0.18 ± 27%      +0.5        0.65 ±  8%  pp.self.kmem_cache_alloc_lru
>       0.36 ± 79%      +0.6        0.96 ± 15%  pp.self.__slab_free
>       0.00            +0.8        0.80 ± 14%  pp.self.radix_tree_node_rcu_free
>       0.00            +1.0        1.01 ± 16%  pp.self.radix_tree_node_ctor
> 
> Some perf profile from a2e459555c5f is: 
> 
> -   17.09%     0.09%  singleuser       [kernel.kallsyms]            [k] path_openat   
>    - 16.99% path_openat                
>       - 12.23% open_last_lookups      
>          - 11.33% lookup_open.isra.0
>             - 9.05% shmem_mknod
>                - 5.11% simple_offset_add
>                   - 4.95% __xa_alloc_cyclic 
>                      - 4.88% __xa_alloc
>                         - 4.76% xas_store 
>                            - xas_create
>                               - 2.40% xas_expand.constprop.0
>                                  - 2.01% xas_alloc
>                                     - kmem_cache_alloc_lru
>                                        - 1.28% ___slab_alloc
>                                           - 1.22% allocate_slab 
>                                              - 1.19% shuffle_freelist 
>                                                 - 1.04% setup_object
>                                                      radix_tree_node_ctor
> 
> Please let me know if you need more info.
> 
> > 
> > Is there any hope in getting this reproducer to run on Fedora?
> 
> Myself haven't succeeded to reproduce it locally, will keep trying
> it tomorrow.

It can be reproduced on a local machien with CentOS 9 (similar to
Fedora ?), and some steps as:

* download source code
  $ wget https://sourceforge.net/projects/aimbench/files/aim-suite9/Initial%20release/s9110.tar.Z

* untar the file and run 'make'

* create 2 files 's9workfile' and 'test.config'

	$ cat s9workfile
	# @(#) s9workfile:1.2 1/22/96 00:00:00
	# AIM Independent Resource Benchmark - Suite IX Workfile
	FILESIZE: 5M
	disk_src

	$ cat test.config
	test
	disk_src
	200s
	/dev/shm/

* run the test with cmd "./singleuser -nl < test.config" 

The test case here is 'disk_src', so I picked one file 'disk_src.c'
and attached it for quick reference.

The kernel config of my CentOS is different from what 0Day used, so 
the perf-profile and peformance score are a little different, but
the regression trend is the same, that commit a2e459555c5f has about
20% drop.

Also the test platform doesn't matter, I tried on several generation 
of Xeon servers which can all reproduce it. 

Thanks,
Feng

[-- Attachment #2: disk_src.c --]
[-- Type: text/x-csrc, Size: 9319 bytes --]


/****************************************************************
**                                                             **
**    Copyright (c) 1996 - 2001 Caldera International, Inc.    **
**                    All Rights Reserved.                     **
**                                                             **
** This program is free software; you can redistribute it      **
** and/or modify it under the terms of the GNU General Public  **
** License as published by the Free Software Foundation;       **
** either version 2 of the License, or (at your option) any    **
** later version.                                              **
**                                                             **
** This program is distributed in the hope that it will be     **
** useful, but WITHOUT ANY WARRANTY; without even the implied  **
** warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR     **
** PURPOSE. See the GNU General Public License for more        **
** details.                                                    **
**                                                             **
** You should have received a copy of the GNU General Public   **
** License along with this program; if not, write to the Free  **
** Software Foundation, Inc., 59 Temple Place, Suite 330,      **
** Boston, MA  02111-1307  USA                                 **
**                                                             **
****************************************************************/
#define _POSIX_SOURCE 1			/* turn on POSIX funct'ns  */

#include <stdio.h>			/* enable printf(), etc. */
#include <unistd.h>			/* for chdir(), etc. */
#include <stdlib.h>			/* for malloc(), etc. */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>			/* required for creat */
#include <signal.h>
#include "suite.h"			/* our goodies */

static int disk_src();

COUNT_START				/* declare counters */
	source_file * disk_src_c()
{
	static source_file s = { " @(#) disk_src.c:1.10 3/4/94 17:21:22",	/* SCCS info */
		__FILE__, __DATE__, __TIME__
	};

	register_test("disk_src", "DISKS", disk_src, 75, "Directory Searches");	/* 100 c and s files in fakeh/dirlist */
	return &s;
}

enum choices { STAT = 0, CREAT, MCHOICE };	/* encode choices here */

#define FAKEH "fakeh"
#define MFILES	75			/* number of entries */
#define MYBUF 160			/* MAX size of input line */
#define MSCR 5				/* number of scramble passes */

/*
 * dsearch exercises the directory search mechanism of unix systems.
 * it is called by the disk test program. dsearch assumes that it is 
 * invoked with its current directory is the parent directory
 * of the hand created directory that is distributed with the benchmark.
 * it assumes that in this directory is a file "dirlist" that provides a list
 * of file names under the current directory, along with a list of names to
 * search for.  some of these names are to be stat'ed while some are to be
 * creat'ed
 */

unsigned long mrand();
int get_list(FILE * file,
	     char *list[MCHOICE][MFILES]);
void scramble(char *list[],
	      int num);
void cl_list(char *list[MCHOICE][MFILES]);
void errdump(int line,
	     char *str);

int
dsearch(char *fakeh_dir)
{
	FILE *fp;			/* file containing filenames */

	int
	  fd,				/* file discriptor for creat, etc. */
	  index;			/* loop variable */

	struct stat stbuf;		/* stat buffer */

	char
	  cwd[256],			/* hold current working dir */
	  errbuf[80],			/* build error msgs in here */
	 *flist[MCHOICE][MFILES];	/* the list of target files */

	if (getcwd(cwd, 256) == NULL) {
		fprintf(stderr,
			"dsearch(): can't get current working directory\n");
		return (-1);
	}
	if (chdir(fakeh_dir) < 0) {	/* move to directory */
		perror("dsearch()");	/* if error, print it */
		errdump(__LINE__, "dsearch(): directory 'fakeh' is inaccessable\n");	/* and dump */
		return (-1);		/* return failure */
	}

	if ((fp = fopen("dirlist", "r")) == NULL) {	/* open list of filenames */
		errdump(__LINE__, "dsearch(): file 'dirlist' is inaccessable\n");	/* handle error */
		chdir(cwd);		/* move back up */
		return (-1);		/* return error */
	}
	/*
	 * end of error processing 
	 */
	if (get_list(fp, flist) < 0) {	/* load the list */
		errdump(__LINE__, "dsearch(): file 'dirlist' is corrupted\n");	/* handle errors */
		chdir(cwd);		/* go back up */
		cl_list(flist);		/* close list */
		return (-1);		/* return error */
	}
	fclose(fp);			/* close list of filenames */

	scramble(flist[STAT], MFILES);	/* scramble names */
	scramble(flist[CREAT], MFILES);	/* scramble names */

	for (index = 0; index < MFILES; index++) {	/* loop through files  */
		if (flist[STAT][index] != NULL) {	/* if not null */
			if (stat(flist[STAT][index], &stbuf) < 0) {	/* stat the file */
				perror("stat() in dsearch()");	/* handle errors */
				sprintf(errbuf, "dsearch(): can't stat '%s'\n",	/* create error message */
					flist[STAT][index]);
				errdump(__LINE__, errbuf);	/* print it */
				chdir(cwd);	/* return to proper dir */
				cl_list(flist);	/* clear list */
				return (-1);	/* return error */
			}		/* endo f error */
		}
		/*
		 * end of if not null 
		 */
		if (flist[CREAT][index] != NULL) {	/* if creating */
			if ((fd = creat(flist[CREAT][index], S_IRWXU | S_IRWXG | S_IRWXO)) < 0) {	/* try create */
				perror("creat() in dsearch()");	/* handle error */
				sprintf(errbuf, "dsearch():can't creat '%s'\n",	/* build error message */
					flist[CREAT][index]);
				errdump(__LINE__, errbuf);	/* print it */
				chdir(cwd);	/* change directories */
				cl_list(flist);	/* clear list */
				return (-1);	/* return error */
			}		/* end of error */
			close(fd);	/* close the file */
			if (unlink(flist[CREAT][index])) {	/* unlink it */
				perror("unlink() in dsearch()");	/* handle error */
				sprintf(errbuf, "dsearch():can't unlink '%s'\n",	/* build error message */
					flist[CREAT][index]);
				errdump(__LINE__, errbuf);	/* print it */
				chdir(cwd);	/* change directories */
				cl_list(flist);	/* clear list */
				return (-1);	/* return error */
			}		/* end of error */
		}
		/*
		 * end if creating 
		 */
		COUNT_BUMP;

	}				/* end of for */
	cl_list(flist);			/* clear list */
	chdir(cwd);			/* go back up */
	return (0);			/* return no error */
}

int
get_list(FILE * file,
	 char *list[MCHOICE][MFILES])
{
	char
	  buff[MYBUF],			/* holds 1 line of input */
	 *tmp;				/* holds malloc results */

	int
	  s_index, c_index, i;

	pid_t pid = getpid();		/* process ID, for unique file names */

	s_index = c_index = 0;		/* initialize indexes */

	for (i = 0; i < MFILES; i++)	/* initialize array */
		list[STAT][i] = list[CREAT][i] = NULL;	/* clear to empty */

	while (fgets(buff, MYBUF - 1, file) != NULL) {	/* get a line */
		if (buff[0] != 's' && buff[0] != 'c')	/* if it isn't legal, */
			continue;	/* ignore it */

		buff[strlen(buff) - 1] = '\0';	/* eliminate trailing new line */
		if ((tmp = malloc(strlen(buff) + 1 + 8)) == NULL) {	/* allocate space */
			cl_list(list);	/* handle bad allocate */
			return (-1);	/* return error */
		}
		/*
		 * end of malloc error chking 
		 */
		strcpy(tmp, buff + 2);	/* copy name into buffer */

		switch (buff[0]) {	/* decide on operation */
		case 's':		/* if stat */
			list[STAT][s_index++] = tmp;	/* put it into array */
			break;		/* and leave */

		case 'c':		/* if CREAT; Tin Le */
			sprintf(tmp, "%s%05d", (buff + 2), pid % 100000);	/* make unique name, last 4 digits of pid */
			list[CREAT][c_index++] = tmp;	/* save it off */
			break;		/* and leave */

		default:		/* this cannot be */
			errdump(__LINE__, "getlist(): Deadly error encountered\n");	/* print merror message */
			cl_list(list);	/* clear list */
			return (-1);	/* return error here */
		}			/* end of switch */
	}				/* end of loop */
	return (1);			/* return success */
}



void
scramble(char *list[],
	 int num)
{
	int
	  i,				/* loop variable */
	  scount,			/* scramble count */
	  rnum;				/* random index for scramble */

	char
	 *tmp;				/* intermediate location */

	for (scount = 0; scount < MSCR; scount++) {	/* for number of scrambles */
		for (i = 0; i < num; i++) {	/* go through list */
			rnum = mrand() % num;	/* get an index */
			tmp = list[i];	/* swap this one */
			list[i] = list[rnum];	/* with that one */
			list[rnum] = tmp;	/* and we're done */
		}			/* loop through each one */
	}				/* for each pass */
}


unsigned long
mrand()
{					/* return integer randome number */
	return ((unsigned long)aim_rand());	/* do it */
}

void
cl_list(char *list[MCHOICE][MFILES])
{
	int index;			/* loop variable */

	for (index = 0; index < MFILES; index++) {	/* loo through all entries */
		if (list[STAT][index] != NULL)	/* if string resides here */
			free(list[STAT][index]);	/* free it to heap */
		if (list[CREAT][index] != NULL)	/* if string here */
			free(list[CREAT][index]);	/* free it */
	}				/* end of loop */
}

void
errdump(int line,
	char *str)
{					/* print error message */
	fprintf(stderr,
		"Error in file %s (compiled at %s on %s) from line %d:\n\t%s",
		__FILE__, __TIME__, __DATE__, line, str);
}

static int
disk_src(char *argv,
	 Result * res)
{
	int i;
	char fakeh_dir[128];

	if (*argv)
		sprintf(fakeh_dir, "%s/%s", argv, FAKEH);
	else
		strcpy(fakeh_dir, FAKEH);

	i = dsearch(fakeh_dir);
	COUNT_END("disk_src");
	return (res->i = i);
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-13  6:47         ` Feng Tang
@ 2023-09-13 13:32           ` Chuck Lever III
  0 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-13 13:32 UTC (permalink / raw)
  To: Feng Tang
  Cc: Matthew Wilcox, Sang, Oliver, oe-lkp, lkp,
	Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
	Ying, Yin, Fengwei



> On Sep 13, 2023, at 2:47 AM, Feng Tang <feng.tang@intel.com> wrote:
> 
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Tang, Feng wrote:
> [...]
>>> 
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>> 
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>> 
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>> 
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
>> ---------------- --------------------------- 
>> 
>>     26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>>     26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
>> 
>> Also the perf profile show some difference
>> 
>>      0.01 ±223%      +0.1        0.10 ± 28%  pp.self.shuffle_freelist
>>      0.00            +0.1        0.11 ± 40%  pp.self.xas_create
>>      0.00            +0.1        0.12 ± 27%  pp.self.xas_find_marked
>>      0.00            +0.1        0.14 ± 18%  pp.self.xas_alloc
>>      0.03 ±103%      +0.1        0.17 ± 29%  pp.self.xas_descend
>>      0.00            +0.2        0.16 ± 23%  pp.self.xas_expand
>>      0.10 ± 22%      +0.2        0.27 ± 16%  pp.self.rcu_segcblist_enqueue
>>      0.92 ± 35%      +0.3        1.22 ± 11%  pp.self.kmem_cache_free
>>      0.00            +0.4        0.36 ± 16%  pp.self.xas_store
>>      0.32 ± 30%      +0.4        0.71 ± 12%  pp.self.__call_rcu_common
>>      0.18 ± 27%      +0.5        0.65 ±  8%  pp.self.kmem_cache_alloc_lru
>>      0.36 ± 79%      +0.6        0.96 ± 15%  pp.self.__slab_free
>>      0.00            +0.8        0.80 ± 14%  pp.self.radix_tree_node_rcu_free
>>      0.00            +1.0        1.01 ± 16%  pp.self.radix_tree_node_ctor
>> 
>> Some perf profile from a2e459555c5f is: 
>> 
>> -   17.09%     0.09%  singleuser       [kernel.kallsyms]            [k] path_openat   
>>   - 16.99% path_openat                
>>      - 12.23% open_last_lookups      
>>         - 11.33% lookup_open.isra.0
>>            - 9.05% shmem_mknod
>>               - 5.11% simple_offset_add
>>                  - 4.95% __xa_alloc_cyclic 
>>                     - 4.88% __xa_alloc
>>                        - 4.76% xas_store 
>>                           - xas_create
>>                              - 2.40% xas_expand.constprop.0
>>                                 - 2.01% xas_alloc
>>                                    - kmem_cache_alloc_lru
>>                                       - 1.28% ___slab_alloc
>>                                          - 1.22% allocate_slab 
>>                                             - 1.19% shuffle_freelist 
>>                                                - 1.04% setup_object
>>                                                     radix_tree_node_ctor
>> 
>> Please let me know if you need more info.
>> 
>>> 
>>> Is there any hope in getting this reproducer to run on Fedora?
>> 
>> Myself haven't succeeded to reproduce it locally, will keep trying
>> it tomorrow.
> 
> It can be reproduced on a local machien with CentOS 9 (similar to
> Fedora ?),

Excellent, I appreciate this!


> and some steps as:
> 
> * download source code
>  $ wget https://sourceforge.net/projects/aimbench/files/aim-suite9/Initial%20release/s9110.tar.Z
> 
> * untar the file and run 'make'
> 
> * create 2 files 's9workfile' and 'test.config'
> 
> $ cat s9workfile
> # @(#) s9workfile:1.2 1/22/96 00:00:00
> # AIM Independent Resource Benchmark - Suite IX Workfile
> FILESIZE: 5M
> disk_src
> 
> $ cat test.config
> test
> disk_src
> 200s
> /dev/shm/
> 
> * run the test with cmd "./singleuser -nl < test.config" 
> 
> The test case here is 'disk_src', so I picked one file 'disk_src.c'
> and attached it for quick reference.
> 
> The kernel config of my CentOS is different from what 0Day used, so 
> the perf-profile and peformance score are a little different, but
> the regression trend is the same, that commit a2e459555c5f has about
> 20% drop.
> 
> Also the test platform doesn't matter, I tried on several generation 
> of Xeon servers which can all reproduce it.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 16:01         ` Matthew Wilcox
  2023-09-12 16:27           ` Chuck Lever III
@ 2023-09-13 17:45           ` Chuck Lever III
  2024-01-04 19:33           ` Chuck Lever III
  2 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2023-09-13 17:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Feng Tang, Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei,
	Liam Howlett



> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>> 
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>> 
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>> 
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
>> ---------------- --------------------------- 
>> 
>>     26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>>     26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
> 
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory.  The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree.  ie it'll look something like this:
> 
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
> 
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
> 
> The radix tree does not handle this well.  It'll allocate one node for:
> 
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
> 
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers.

I'm able to run the reproducer Feng provided. simple_offset_add()
nearly doubles the cost of shmem_mknod() thanks to the memory
allocations done in xas_create().

However, tmpfs is already fast compared to persistent filesystems.
For instance, even with the simple_offset patch applied:

tmpfs: 158079.00 Directory Searches/second
btrfs: 64978.88 Directory Searches/second


> Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?

Most directories in a tmpfs have a limited lifespan and thus are
unlikely to live long enough to be affected by this issue. The
only one that has a rather unlimited lifespan is the root
directory.

It's hard for me to tell whether this is a pervasive problem
or one we can live with until we find a more suitable data
structure. IMO the benefit of having stable directory offsets
far outweighs the eventual slow down in the root directory.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2023-09-12 16:01         ` Matthew Wilcox
  2023-09-12 16:27           ` Chuck Lever III
  2023-09-13 17:45           ` Chuck Lever III
@ 2024-01-04 19:33           ` Chuck Lever III
  2024-01-05 16:27             ` Liam R. Howlett
  2 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever III @ 2024-01-04 19:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Feng Tang, Sang, Oliver, oe-lkp, lkp, Linux Kernel Mailing List,
	Christian Brauner, linux-mm, Huang, Ying, Yin, Fengwei,
	Liam Howlett



> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>> 
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>> 
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>> 
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
>> ---------------- --------------------------- 
>> 
>>     26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>>     26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>>    941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
> 
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory.  The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree.  ie it'll look something like this:
> 
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
> 
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
> 
> The radix tree does not handle this well.  It'll allocate one node for:
> 
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
> 
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers.  Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?
> 
> The maple tree is more resilient against this kind of shenanigan, but
> we're not there in terms of supporting the kind of allocation you
> want.  For this kind of allocation pattern, you'd get all 7 pointers
> in a single 256-byte node.

Hello Matthew, it's been a couple of kernel releases, so
following up.

Is Maple tree ready for libfs to use it for managing directory
offsets?

Should we just go for broke and convert libfs from xarray to
Maple tree now?


--
Chuck Lever



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2024-01-04 19:33           ` Chuck Lever III
@ 2024-01-05 16:27             ` Liam R. Howlett
  2024-01-05 16:33               ` Chuck Lever III
  0 siblings, 1 reply; 15+ messages in thread
From: Liam R. Howlett @ 2024-01-05 16:27 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Matthew Wilcox, Feng Tang, Sang, Oliver, oe-lkp, lkp,
	Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
	Ying, Yin, Fengwei

* Chuck Lever III <chuck.lever@oracle.com> [240104 14:33]:
> 
> 
> > On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
> >>> Well that's the problem. Since I can't run the reproducer, there's
> >>> nothing I can do to troubleshoot the problem myself.
> >> 
> >> We dug more into the perf and other profiling data from 0Day server
> >> running this case, and it seems that the new simple_offset_add()
> >> called by shmem_mknod() brings extra cost related with slab,
> >> specifically the 'radix_tree_node', which cause the regression.
> >> 
> >> Here is some slabinfo diff for commit a2e459555c5f and its parent:
> >> 
> >> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
> >> ---------------- --------------------------- 
> >> 
> >>     26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
> >>    941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
> >>     26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
> >>    941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
> > 
> > I can't find the benchmark source, but my suspicion is that this
> > creates and deletes a lot of files in a directory.  The 'stable
> > directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> > with a very sparse radix tree.  ie it'll look something like this:
> > 
> > 0 - "."
> > 1 - ".."
> > 6 - "d"
> > 27 - "y"
> > 4000 - "fzz"
> > 65537 - "czzz"
> > 643289767 - "bzzzzzz"
> > 
> > (i didn't work out the names precisely here, but this is approximately
> > what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> > almost all of them)
> > 
> > The radix tree does not handle this well.  It'll allocate one node for:
> > 
> > entries 0-63 (covers the first 4 entries)
> > entries 0-4095
> > entries 3968-4031 (the first 5)
> > entries 0-262143
> > entries 65536-69631
> > entries 65536-65599 (the first 6)
> > entries 0-16777215
> > entries 0-1073741823
> > entries 637534208-654311423
> > entries 643039232-643301375
> > entries 643289088-643293183
> > entries 643289728-643289791 (all 7)
> > 
> > That ends up being 12 nodes (you get 7 nodes per page) to store 7
> > pointers.  Admittedly to get here, you have to do 643289765 creations
> > and nearly as many deletions, so are we going to see it in a
> > non-benchmark situation?
> > 
> > The maple tree is more resilient against this kind of shenanigan, but
> > we're not there in terms of supporting the kind of allocation you
> > want.  For this kind of allocation pattern, you'd get all 7 pointers
> > in a single 256-byte node.
> 
> Hello Matthew, it's been a couple of kernel releases, so
> following up.
> 
> Is Maple tree ready for libfs to use it for managing directory
> offsets?

The feature you are looking for is dense nodes.  It will allow for
a compact tree when you have a number of single indexes mapping to
entries (ideal for many ranges of 1).

I'm actively working on dense nodes, which will yield 31 entries per
node when they are single index mappings.  I'm hoping to have it
completed in the next few weeks and start beating it up with tests
before pushing it out.

> 
> Should we just go for broke and convert libfs from xarray to
> Maple tree now?

We are trying to keep the API close for both the xarray and the maple
tree, so if you do the conversion to one then switching shouldn't be
much work.  I'd try the maple tree to see if the performance is
acceptable today (I may be biased), but I don't know how big of an
effort this conversion would entail.

The maple tree will compress the NULL indexes to a single entry of NULL.
My main concern is the density of information and the number of
allocations the tree will do to keep up with the workload - both will
improve with the dense nodes feature.

If you convert to maple tree, you will get the update for free later as
the node type the tree chooses will be transparent to users.

If you need tagging then you should use the xarray as I haven't started
that feature yet - but I don't think you need that?

I also noticed that Matthew mentioned xa_alloc_cyclic() as the potential
call into the xarray.  The maple tree counterpart isn't used much today
and may need to be optimised.  If you can verify what this test does, we
could produce a test case for the maple tree test suite and optimise if
necessary.

Let us know if you have any other questions or need some pointers on how
to get started with a conversion.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [linus:master] [shmem]  a2e459555c:  aim9.disk_src.ops_per_sec -19.0% regression
  2024-01-05 16:27             ` Liam R. Howlett
@ 2024-01-05 16:33               ` Chuck Lever III
  0 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever III @ 2024-01-05 16:33 UTC (permalink / raw)
  To: Liam Howlett
  Cc: Matthew Wilcox, Feng Tang, Sang, Oliver, oe-lkp, lkp,
	Linux Kernel Mailing List, Christian Brauner, linux-mm, Huang,
	Ying, Yin, Fengwei



> On Jan 5, 2024, at 11:27 AM, Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> * Chuck Lever III <chuck.lever@oracle.com> [240104 14:33]:
>> 
>> 
>>> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <willy@infradead.org> wrote:
>>> 
>>> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>>>> Well that's the problem. Since I can't run the reproducer, there's
>>>>> nothing I can do to troubleshoot the problem myself.
>>>> 
>>>> We dug more into the perf and other profiling data from 0Day server
>>>> running this case, and it seems that the new simple_offset_add()
>>>> called by shmem_mknod() brings extra cost related with slab,
>>>> specifically the 'radix_tree_node', which cause the regression.
>>>> 
>>>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>>> 
>>>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 
>>>> ---------------- --------------------------- 
>>>> 
>>>>    26363           +40.2%      36956        slabinfo.radix_tree_node.active_objs
>>>>   941.00           +40.4%       1321        slabinfo.radix_tree_node.active_slabs
>>>>    26363           +40.3%      37001        slabinfo.radix_tree_node.num_objs
>>>>   941.00           +40.4%       1321        slabinfo.radix_tree_node.num_slabs
>>> 
>>> I can't find the benchmark source, but my suspicion is that this
>>> creates and deletes a lot of files in a directory.  The 'stable
>>> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
>>> with a very sparse radix tree.  ie it'll look something like this:
>>> 
>>> 0 - "."
>>> 1 - ".."
>>> 6 - "d"
>>> 27 - "y"
>>> 4000 - "fzz"
>>> 65537 - "czzz"
>>> 643289767 - "bzzzzzz"
>>> 
>>> (i didn't work out the names precisely here, but this is approximately
>>> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
>>> almost all of them)
>>> 
>>> The radix tree does not handle this well.  It'll allocate one node for:
>>> 
>>> entries 0-63 (covers the first 4 entries)
>>> entries 0-4095
>>> entries 3968-4031 (the first 5)
>>> entries 0-262143
>>> entries 65536-69631
>>> entries 65536-65599 (the first 6)
>>> entries 0-16777215
>>> entries 0-1073741823
>>> entries 637534208-654311423
>>> entries 643039232-643301375
>>> entries 643289088-643293183
>>> entries 643289728-643289791 (all 7)
>>> 
>>> That ends up being 12 nodes (you get 7 nodes per page) to store 7
>>> pointers.  Admittedly to get here, you have to do 643289765 creations
>>> and nearly as many deletions, so are we going to see it in a
>>> non-benchmark situation?
>>> 
>>> The maple tree is more resilient against this kind of shenanigan, but
>>> we're not there in terms of supporting the kind of allocation you
>>> want.  For this kind of allocation pattern, you'd get all 7 pointers
>>> in a single 256-byte node.
>> 
>> Hello Matthew, it's been a couple of kernel releases, so
>> following up.
>> 
>> Is Maple tree ready for libfs to use it for managing directory
>> offsets?
> 
> The feature you are looking for is dense nodes.  It will allow for
> a compact tree when you have a number of single indexes mapping to
> entries (ideal for many ranges of 1).
> 
> I'm actively working on dense nodes, which will yield 31 entries per
> node when they are single index mappings.  I'm hoping to have it
> completed in the next few weeks and start beating it up with tests
> before pushing it out.
> 
>> 
>> Should we just go for broke and convert libfs from xarray to
>> Maple tree now?
> 
> We are trying to keep the API close for both the xarray and the maple
> tree, so if you do the conversion to one then switching shouldn't be
> much work.  I'd try the maple tree to see if the performance is
> acceptable today (I may be biased), but I don't know how big of an
> effort this conversion would entail.
> 
> The maple tree will compress the NULL indexes to a single entry of NULL.
> My main concern is the density of information and the number of
> allocations the tree will do to keep up with the workload - both will
> improve with the dense nodes feature.
> 
> If you convert to maple tree, you will get the update for free later as
> the node type the tree chooses will be transparent to users.
> 
> If you need tagging then you should use the xarray as I haven't started
> that feature yet - but I don't think you need that?

I don't recall using xarray tags for directory offset mapping.


> I also noticed that Matthew mentioned xa_alloc_cyclic() as the potential
> call into the xarray.  The maple tree counterpart isn't used much today
> and may need to be optimised.  If you can verify what this test does, we
> could produce a test case for the maple tree test suite and optimise if
> necessary.
> 
> Let us know if you have any other questions or need some pointers on how
> to get started with a conversion.

Sounds like conversion is worth starting on, at least. I'll try
to clear some time to work on it.

--
Chuck Lever



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-01-05 16:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-08  5:26 [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression kernel test robot
2023-09-08 14:43 ` Chuck Lever III
2023-09-12  1:25   ` Oliver Sang
2023-09-12 13:01     ` Chuck Lever III
2023-09-12 13:19       ` Oliver Sang
2023-09-12 15:14       ` Feng Tang
2023-09-12 15:26         ` Chuck Lever III
2023-09-12 16:01         ` Matthew Wilcox
2023-09-12 16:27           ` Chuck Lever III
2023-09-13 17:45           ` Chuck Lever III
2024-01-04 19:33           ` Chuck Lever III
2024-01-05 16:27             ` Liam R. Howlett
2024-01-05 16:33               ` Chuck Lever III
2023-09-13  6:47         ` Feng Tang
2023-09-13 13:32           ` Chuck Lever III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox