linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: kernel test robot <oliver.sang@intel.com>,
	Hyunmin Lee <hyunminlr@gmail.com>
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
	linux-kernel@vger.kernel.org, Jeungwoo Yoo <casionwoo@gmail.com>,
	Sangyun Kim <sangyun.kim@snu.ac.kr>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>,
	Christoph Lameter <cl@linux.com>,
	David Rientjes <rientjes@google.com>,
	linux-mm@kvack.org, ying.huang@intel.com, feng.tang@intel.com,
	fengwei.yin@intel.com
Subject: Re: [linus:master] [mm/slub] 306c4ac989: stress-ng.seal.ops_per_sec 5.2% improvement
Date: Thu, 25 Jul 2024 12:11:45 +0200	[thread overview]
Message-ID: <a939565a-cdab-4d8b-938e-38e3d837d653@suse.cz> (raw)
In-Reply-To: <202407251553.12f35198-oliver.sang@intel.com>

On 7/25/24 10:04 AM, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a 5.2% improvement of stress-ng.seal.ops_per_sec on:
> 
> 
> commit: 306c4ac9896b07b8872293eb224058ff83f81fac ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Well that's great news, but also highly unlikely that the commit would cause
such an improvement, as it only optimizes a once-per-boot operation of
create_kmalloc_caches(). Maybe there are secondary effects in different
order of slab cache creation resulting in some different cpu cache layout,
but such improvement could be machine and compiler specific and overall fragile.

> testcase: stress-ng
> test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
> parameters:
> 
> 	nr_threads: 100%
> 	testtime: 60s
> 	test: seal
> 	cpufreq_governor: performance
> 
> 
> 
> 
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240725/202407251553.12f35198-oliver.sang@intel.com
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
>   gcc-13/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/seal/stress-ng/60s
> 
> commit: 
>   844776cb65 ("mm/slub: mark racy access on slab->freelist")
>   306c4ac989 ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
> 
> 844776cb65a77ef2 306c4ac9896b07b8872293eb224 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>       2.51 ± 27%      +1.9        4.44 ± 35%  mpstat.cpu.all.idle%
>     975100 ± 19%     +29.5%    1262643 ± 16%  numa-meminfo.node1.AnonPages.max
>     187.06 ±  4%     -11.5%     165.63 ± 10%  sched_debug.cfs_rq:/.runnable_avg.stddev
>       0.05 ± 18%     -40.0%       0.03 ± 58%  vmstat.procs.b
>   58973718            +5.2%   62024061        stress-ng.seal.ops
>     982893            +5.2%    1033732        stress-ng.seal.ops_per_sec
>   59045344            +5.2%   62095668        stress-ng.time.minor_page_faults
>     174957            +1.4%     177400        proc-vmstat.nr_slab_unreclaimable
>   63634761            +5.5%   67148443        proc-vmstat.numa_hit
>   63399995            +5.5%   66914221        proc-vmstat.numa_local
>   73601172            +6.1%   78073549        proc-vmstat.pgalloc_normal
>   59870250            +5.3%   63063514        proc-vmstat.pgfault
>   72718474            +6.0%   77106313        proc-vmstat.pgfree
>  1.983e+10            +1.3%   2.01e+10        perf-stat.i.branch-instructions
>   66023349            +5.6%   69728143        perf-stat.i.cache-misses
>  2.023e+08            +4.7%  2.117e+08        perf-stat.i.cache-references
>       7.22            -1.9%       7.08        perf-stat.i.cpi
>       9738            -5.6%       9196        perf-stat.i.cycles-between-cache-misses
>  8.799e+10            +1.6%  8.939e+10        perf-stat.i.instructions
>       0.14            +1.6%       0.14        perf-stat.i.ipc
>       8.71            +5.1%       9.16        perf-stat.i.metric.K/sec
>     983533            +4.7%    1029816        perf-stat.i.minor-faults
>     983533            +4.7%    1029816        perf-stat.i.page-faults
>       7.30           -18.4%       5.96 ± 44%  perf-stat.overall.cpi
>       9735           -21.3%       7658 ± 44%  perf-stat.overall.cycles-between-cache-misses
>       0.52            +0.1        0.62 ±  7%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64
>       0.56            +0.1        0.67 ±  7%  perf-profile.calltrace.cycles-pp.ftruncate64
>       0.34 ± 70%      +0.3        0.60 ±  7%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
>      48.29            +0.6       48.86        perf-profile.calltrace.cycles-pp.__close
>      48.27            +0.6       48.84        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
>      48.27            +0.6       48.84        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
>      48.26            +0.6       48.83        perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
>       0.00            +0.6        0.58 ±  7%  perf-profile.calltrace.cycles-pp.__x64_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
>      48.21            +0.6       48.80        perf-profile.calltrace.cycles-pp.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
>      48.03            +0.6       48.68        perf-profile.calltrace.cycles-pp.dput.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      48.02            +0.6       48.66        perf-profile.calltrace.cycles-pp.__dentry_kill.dput.__fput.__x64_sys_close.do_syscall_64
>      47.76            +0.7       48.47        perf-profile.calltrace.cycles-pp.evict.__dentry_kill.dput.__fput.__x64_sys_close
>      47.19            +0.7       47.92        perf-profile.calltrace.cycles-pp._raw_spin_lock.evict.__dentry_kill.dput.__fput
>      47.11            +0.8       47.88        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.evict.__dentry_kill.dput
>       0.74            -0.3        0.48 ±  8%  perf-profile.children.cycles-pp.__munmap
>       0.69            -0.2        0.44 ±  9%  perf-profile.children.cycles-pp.__x64_sys_munmap
>       0.68            -0.2        0.44 ±  9%  perf-profile.children.cycles-pp.__vm_munmap
>       0.68            -0.2        0.45 ±  9%  perf-profile.children.cycles-pp.do_vmi_munmap
>       0.65            -0.2        0.42 ±  8%  perf-profile.children.cycles-pp.do_vmi_align_munmap
>       0.44            -0.2        0.28 ±  7%  perf-profile.children.cycles-pp.unmap_region
>       0.48            -0.1        0.36 ±  7%  perf-profile.children.cycles-pp.asm_exc_page_fault
>       0.42            -0.1        0.32 ±  7%  perf-profile.children.cycles-pp.do_user_addr_fault
>       0.42 ±  2%      -0.1        0.32 ±  7%  perf-profile.children.cycles-pp.exc_page_fault
>       0.38 ±  2%      -0.1        0.29 ±  7%  perf-profile.children.cycles-pp.handle_mm_fault
>       0.35 ±  2%      -0.1        0.27 ±  7%  perf-profile.children.cycles-pp.__handle_mm_fault
>       0.33 ±  2%      -0.1        0.26 ±  6%  perf-profile.children.cycles-pp.do_fault
>       0.21 ±  2%      -0.1        0.14 ±  8%  perf-profile.children.cycles-pp.lru_add_drain
>       0.22            -0.1        0.15 ± 11%  perf-profile.children.cycles-pp.alloc_inode
>       0.21 ±  2%      -0.1        0.15 ±  9%  perf-profile.children.cycles-pp.lru_add_drain_cpu
>       0.18 ±  2%      -0.1        0.12 ±  8%  perf-profile.children.cycles-pp.unmap_vmas
>       0.21 ±  2%      -0.1        0.14 ±  7%  perf-profile.children.cycles-pp.folio_batch_move_lru
>       0.17            -0.1        0.11 ±  8%  perf-profile.children.cycles-pp.unmap_page_range
>       0.16 ±  2%      -0.1        0.10 ±  9%  perf-profile.children.cycles-pp.zap_pte_range
>       0.16 ±  2%      -0.1        0.10 ±  9%  perf-profile.children.cycles-pp.zap_pmd_range
>       0.26 ±  2%      -0.1        0.20 ±  7%  perf-profile.children.cycles-pp.shmem_fault
>       0.50            -0.1        0.45 ±  8%  perf-profile.children.cycles-pp.mmap_region
>       0.26 ±  2%      -0.1        0.20 ±  7%  perf-profile.children.cycles-pp.__do_fault
>       0.26            -0.1        0.21 ±  6%  perf-profile.children.cycles-pp.shmem_get_folio_gfp
>       0.19 ±  2%      -0.1        0.14 ± 14%  perf-profile.children.cycles-pp.write
>       0.22 ±  3%      -0.0        0.18 ±  5%  perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
>       0.11 ±  4%      -0.0        0.07 ± 10%  perf-profile.children.cycles-pp.mas_store_gfp
>       0.16 ±  2%      -0.0        0.12 ± 11%  perf-profile.children.cycles-pp.mas_wr_store_entry
>       0.14            -0.0        0.10 ± 10%  perf-profile.children.cycles-pp.mas_wr_node_store
>       0.08            -0.0        0.04 ± 45%  perf-profile.children.cycles-pp.msync
>       0.06            -0.0        0.02 ± 99%  perf-profile.children.cycles-pp.mas_find
>       0.12 ±  4%      -0.0        0.08 ± 11%  perf-profile.children.cycles-pp.inode_init_always
>       0.10 ±  3%      -0.0        0.07 ± 11%  perf-profile.children.cycles-pp.shmem_alloc_inode
>       0.16            -0.0        0.13 ±  9%  perf-profile.children.cycles-pp.__x64_sys_fcntl
>       0.11 ±  4%      -0.0        0.08 ± 11%  perf-profile.children.cycles-pp.shmem_file_write_iter
>       0.10 ±  4%      -0.0        0.08 ±  8%  perf-profile.children.cycles-pp.do_fcntl
>       0.15            -0.0        0.13 ±  8%  perf-profile.children.cycles-pp.destroy_inode
>       0.16 ±  3%      -0.0        0.14 ±  7%  perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
>       0.22 ±  3%      -0.0        0.20 ±  5%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
>       0.08            -0.0        0.06 ± 11%  perf-profile.children.cycles-pp.___slab_alloc
>       0.15 ±  3%      -0.0        0.12 ±  8%  perf-profile.children.cycles-pp.__destroy_inode
>       0.07 ±  7%      -0.0        0.04 ± 45%  perf-profile.children.cycles-pp.__call_rcu_common
>       0.13 ±  2%      -0.0        0.11 ±  8%  perf-profile.children.cycles-pp.perf_event_mmap
>       0.09            -0.0        0.07 ±  9%  perf-profile.children.cycles-pp.memfd_fcntl
>       0.06            -0.0        0.04 ± 44%  perf-profile.children.cycles-pp.native_irq_return_iret
>       0.08 ±  6%      -0.0        0.06 ±  8%  perf-profile.children.cycles-pp.shmem_add_to_page_cache
>       0.12            -0.0        0.10 ±  6%  perf-profile.children.cycles-pp.perf_event_mmap_event
>       0.11 ±  3%      -0.0        0.09 ±  7%  perf-profile.children.cycles-pp.__lruvec_stat_mod_folio
>       0.10            -0.0        0.08 ±  8%  perf-profile.children.cycles-pp.uncharge_batch
>       0.12 ±  4%      -0.0        0.10 ±  6%  perf-profile.children.cycles-pp.entry_SYSCALL_64
>       0.05            +0.0        0.07 ±  5%  perf-profile.children.cycles-pp.__d_alloc
>       0.05            +0.0        0.07 ± 10%  perf-profile.children.cycles-pp.d_alloc_pseudo
>       0.07            +0.0        0.09 ±  7%  perf-profile.children.cycles-pp.file_init_path
>       0.06 ±  6%      +0.0        0.08 ±  8%  perf-profile.children.cycles-pp.security_file_alloc
>       0.07 ±  7%      +0.0        0.09 ±  7%  perf-profile.children.cycles-pp.errseq_sample
>       0.04 ± 44%      +0.0        0.07 ± 10%  perf-profile.children.cycles-pp.apparmor_file_alloc_security
>       0.09            +0.0        0.12 ±  5%  perf-profile.children.cycles-pp.init_file
>       0.15            +0.0        0.18 ±  7%  perf-profile.children.cycles-pp.common_perm_cond
>       0.15 ±  3%      +0.0        0.19 ±  8%  perf-profile.children.cycles-pp.security_file_truncate
>       0.20            +0.0        0.24 ±  7%  perf-profile.children.cycles-pp.notify_change
>       0.06            +0.0        0.10 ±  6%  perf-profile.children.cycles-pp.inode_init_owner
>       0.13            +0.0        0.18 ±  5%  perf-profile.children.cycles-pp.alloc_empty_file
>       0.10            +0.1        0.16 ±  7%  perf-profile.children.cycles-pp.clear_nlink
>       0.47            +0.1        0.56 ±  7%  perf-profile.children.cycles-pp.do_ftruncate
>       0.49            +0.1        0.59 ±  7%  perf-profile.children.cycles-pp.__x64_sys_ftruncate
>       0.59            +0.1        0.70 ±  7%  perf-profile.children.cycles-pp.ftruncate64
>       0.28            +0.1        0.40 ±  6%  perf-profile.children.cycles-pp.alloc_file_pseudo
>      98.62            +0.2       98.77        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
>      98.58            +0.2       98.74        perf-profile.children.cycles-pp.do_syscall_64
>      48.30            +0.6       48.86        perf-profile.children.cycles-pp.__close
>      48.26            +0.6       48.83        perf-profile.children.cycles-pp.__x64_sys_close
>      48.21            +0.6       48.80        perf-profile.children.cycles-pp.__fput
>      48.04            +0.6       48.68        perf-profile.children.cycles-pp.dput
>      48.02            +0.6       48.67        perf-profile.children.cycles-pp.__dentry_kill
>      47.77            +0.7       48.47        perf-profile.children.cycles-pp.evict
>       0.30            -0.1        0.23 ±  7%  perf-profile.self.cycles-pp._raw_spin_lock
>       0.10 ±  4%      -0.0        0.06 ±  7%  perf-profile.self.cycles-pp.__fput
>       0.08 ±  6%      -0.0        0.05 ±  8%  perf-profile.self.cycles-pp.inode_init_always
>       0.06            -0.0        0.04 ± 44%  perf-profile.self.cycles-pp.native_irq_return_iret
>       0.08            -0.0        0.06 ±  7%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
>       0.09            -0.0        0.08 ±  4%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
>       0.07            +0.0        0.09 ±  7%  perf-profile.self.cycles-pp.__shmem_get_inode
>       0.06 ±  7%      +0.0        0.09 ±  9%  perf-profile.self.cycles-pp.errseq_sample
>       0.15 ±  2%      +0.0        0.18 ±  7%  perf-profile.self.cycles-pp.common_perm_cond
>       0.03 ± 70%      +0.0        0.06 ±  7%  perf-profile.self.cycles-pp.apparmor_file_alloc_security
>       0.06            +0.0        0.10 ±  7%  perf-profile.self.cycles-pp.inode_init_owner
>       0.10            +0.1        0.16 ±  6%  perf-profile.self.cycles-pp.clear_nlink
> 
> 
> 
> 
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
> 
> 



      reply	other threads:[~2024-07-25 10:11 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-25  8:04 kernel test robot
2024-07-25 10:11 ` Vlastimil Babka [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a939565a-cdab-4d8b-938e-38e3d837d653@suse.cz \
    --to=vbabka@suse.cz \
    --cc=42.hyeyoo@gmail.com \
    --cc=casionwoo@gmail.com \
    --cc=cl@linux.com \
    --cc=feng.tang@intel.com \
    --cc=fengwei.yin@intel.com \
    --cc=gwan-gyeong.mun@intel.com \
    --cc=hyunminlr@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=rientjes@google.com \
    --cc=sangyun.kim@snu.ac.kr \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox