From: Vlastimil Babka <vbabka@suse.cz>
To: kernel test robot <oliver.sang@intel.com>,
Hyunmin Lee <hyunminlr@gmail.com>
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
linux-kernel@vger.kernel.org, Jeungwoo Yoo <casionwoo@gmail.com>,
Sangyun Kim <sangyun.kim@snu.ac.kr>,
Hyeonggon Yoo <42.hyeyoo@gmail.com>,
Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>,
Christoph Lameter <cl@linux.com>,
David Rientjes <rientjes@google.com>,
linux-mm@kvack.org, ying.huang@intel.com, feng.tang@intel.com,
fengwei.yin@intel.com
Subject: Re: [linus:master] [mm/slub] 306c4ac989: stress-ng.seal.ops_per_sec 5.2% improvement
Date: Thu, 25 Jul 2024 12:11:45 +0200 [thread overview]
Message-ID: <a939565a-cdab-4d8b-938e-38e3d837d653@suse.cz> (raw)
In-Reply-To: <202407251553.12f35198-oliver.sang@intel.com>
On 7/25/24 10:04 AM, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 5.2% improvement of stress-ng.seal.ops_per_sec on:
>
>
> commit: 306c4ac9896b07b8872293eb224058ff83f81fac ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
Well that's great news, but also highly unlikely that the commit would cause
such an improvement, as it only optimizes a once-per-boot operation of
create_kmalloc_caches(). Maybe there are secondary effects in different
order of slab cache creation resulting in some different cpu cache layout,
but such improvement could be machine and compiler specific and overall fragile.
> testcase: stress-ng
> test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
> parameters:
>
> nr_threads: 100%
> testtime: 60s
> test: seal
> cpufreq_governor: performance
>
>
>
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240725/202407251553.12f35198-oliver.sang@intel.com
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
> gcc-13/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/seal/stress-ng/60s
>
> commit:
> 844776cb65 ("mm/slub: mark racy access on slab->freelist")
> 306c4ac989 ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
>
> 844776cb65a77ef2 306c4ac9896b07b8872293eb224
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 2.51 ± 27% +1.9 4.44 ± 35% mpstat.cpu.all.idle%
> 975100 ± 19% +29.5% 1262643 ± 16% numa-meminfo.node1.AnonPages.max
> 187.06 ± 4% -11.5% 165.63 ± 10% sched_debug.cfs_rq:/.runnable_avg.stddev
> 0.05 ± 18% -40.0% 0.03 ± 58% vmstat.procs.b
> 58973718 +5.2% 62024061 stress-ng.seal.ops
> 982893 +5.2% 1033732 stress-ng.seal.ops_per_sec
> 59045344 +5.2% 62095668 stress-ng.time.minor_page_faults
> 174957 +1.4% 177400 proc-vmstat.nr_slab_unreclaimable
> 63634761 +5.5% 67148443 proc-vmstat.numa_hit
> 63399995 +5.5% 66914221 proc-vmstat.numa_local
> 73601172 +6.1% 78073549 proc-vmstat.pgalloc_normal
> 59870250 +5.3% 63063514 proc-vmstat.pgfault
> 72718474 +6.0% 77106313 proc-vmstat.pgfree
> 1.983e+10 +1.3% 2.01e+10 perf-stat.i.branch-instructions
> 66023349 +5.6% 69728143 perf-stat.i.cache-misses
> 2.023e+08 +4.7% 2.117e+08 perf-stat.i.cache-references
> 7.22 -1.9% 7.08 perf-stat.i.cpi
> 9738 -5.6% 9196 perf-stat.i.cycles-between-cache-misses
> 8.799e+10 +1.6% 8.939e+10 perf-stat.i.instructions
> 0.14 +1.6% 0.14 perf-stat.i.ipc
> 8.71 +5.1% 9.16 perf-stat.i.metric.K/sec
> 983533 +4.7% 1029816 perf-stat.i.minor-faults
> 983533 +4.7% 1029816 perf-stat.i.page-faults
> 7.30 -18.4% 5.96 ± 44% perf-stat.overall.cpi
> 9735 -21.3% 7658 ± 44% perf-stat.overall.cycles-between-cache-misses
> 0.52 +0.1 0.62 ± 7% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64
> 0.56 +0.1 0.67 ± 7% perf-profile.calltrace.cycles-pp.ftruncate64
> 0.34 ± 70% +0.3 0.60 ± 7% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
> 48.29 +0.6 48.86 perf-profile.calltrace.cycles-pp.__close
> 48.27 +0.6 48.84 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
> 48.27 +0.6 48.84 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
> 48.26 +0.6 48.83 perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
> 0.00 +0.6 0.58 ± 7% perf-profile.calltrace.cycles-pp.__x64_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
> 48.21 +0.6 48.80 perf-profile.calltrace.cycles-pp.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
> 48.03 +0.6 48.68 perf-profile.calltrace.cycles-pp.dput.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 48.02 +0.6 48.66 perf-profile.calltrace.cycles-pp.__dentry_kill.dput.__fput.__x64_sys_close.do_syscall_64
> 47.76 +0.7 48.47 perf-profile.calltrace.cycles-pp.evict.__dentry_kill.dput.__fput.__x64_sys_close
> 47.19 +0.7 47.92 perf-profile.calltrace.cycles-pp._raw_spin_lock.evict.__dentry_kill.dput.__fput
> 47.11 +0.8 47.88 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.evict.__dentry_kill.dput
> 0.74 -0.3 0.48 ± 8% perf-profile.children.cycles-pp.__munmap
> 0.69 -0.2 0.44 ± 9% perf-profile.children.cycles-pp.__x64_sys_munmap
> 0.68 -0.2 0.44 ± 9% perf-profile.children.cycles-pp.__vm_munmap
> 0.68 -0.2 0.45 ± 9% perf-profile.children.cycles-pp.do_vmi_munmap
> 0.65 -0.2 0.42 ± 8% perf-profile.children.cycles-pp.do_vmi_align_munmap
> 0.44 -0.2 0.28 ± 7% perf-profile.children.cycles-pp.unmap_region
> 0.48 -0.1 0.36 ± 7% perf-profile.children.cycles-pp.asm_exc_page_fault
> 0.42 -0.1 0.32 ± 7% perf-profile.children.cycles-pp.do_user_addr_fault
> 0.42 ± 2% -0.1 0.32 ± 7% perf-profile.children.cycles-pp.exc_page_fault
> 0.38 ± 2% -0.1 0.29 ± 7% perf-profile.children.cycles-pp.handle_mm_fault
> 0.35 ± 2% -0.1 0.27 ± 7% perf-profile.children.cycles-pp.__handle_mm_fault
> 0.33 ± 2% -0.1 0.26 ± 6% perf-profile.children.cycles-pp.do_fault
> 0.21 ± 2% -0.1 0.14 ± 8% perf-profile.children.cycles-pp.lru_add_drain
> 0.22 -0.1 0.15 ± 11% perf-profile.children.cycles-pp.alloc_inode
> 0.21 ± 2% -0.1 0.15 ± 9% perf-profile.children.cycles-pp.lru_add_drain_cpu
> 0.18 ± 2% -0.1 0.12 ± 8% perf-profile.children.cycles-pp.unmap_vmas
> 0.21 ± 2% -0.1 0.14 ± 7% perf-profile.children.cycles-pp.folio_batch_move_lru
> 0.17 -0.1 0.11 ± 8% perf-profile.children.cycles-pp.unmap_page_range
> 0.16 ± 2% -0.1 0.10 ± 9% perf-profile.children.cycles-pp.zap_pte_range
> 0.16 ± 2% -0.1 0.10 ± 9% perf-profile.children.cycles-pp.zap_pmd_range
> 0.26 ± 2% -0.1 0.20 ± 7% perf-profile.children.cycles-pp.shmem_fault
> 0.50 -0.1 0.45 ± 8% perf-profile.children.cycles-pp.mmap_region
> 0.26 ± 2% -0.1 0.20 ± 7% perf-profile.children.cycles-pp.__do_fault
> 0.26 -0.1 0.21 ± 6% perf-profile.children.cycles-pp.shmem_get_folio_gfp
> 0.19 ± 2% -0.1 0.14 ± 14% perf-profile.children.cycles-pp.write
> 0.22 ± 3% -0.0 0.18 ± 5% perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
> 0.11 ± 4% -0.0 0.07 ± 10% perf-profile.children.cycles-pp.mas_store_gfp
> 0.16 ± 2% -0.0 0.12 ± 11% perf-profile.children.cycles-pp.mas_wr_store_entry
> 0.14 -0.0 0.10 ± 10% perf-profile.children.cycles-pp.mas_wr_node_store
> 0.08 -0.0 0.04 ± 45% perf-profile.children.cycles-pp.msync
> 0.06 -0.0 0.02 ± 99% perf-profile.children.cycles-pp.mas_find
> 0.12 ± 4% -0.0 0.08 ± 11% perf-profile.children.cycles-pp.inode_init_always
> 0.10 ± 3% -0.0 0.07 ± 11% perf-profile.children.cycles-pp.shmem_alloc_inode
> 0.16 -0.0 0.13 ± 9% perf-profile.children.cycles-pp.__x64_sys_fcntl
> 0.11 ± 4% -0.0 0.08 ± 11% perf-profile.children.cycles-pp.shmem_file_write_iter
> 0.10 ± 4% -0.0 0.08 ± 8% perf-profile.children.cycles-pp.do_fcntl
> 0.15 -0.0 0.13 ± 8% perf-profile.children.cycles-pp.destroy_inode
> 0.16 ± 3% -0.0 0.14 ± 7% perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
> 0.22 ± 3% -0.0 0.20 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> 0.08 -0.0 0.06 ± 11% perf-profile.children.cycles-pp.___slab_alloc
> 0.15 ± 3% -0.0 0.12 ± 8% perf-profile.children.cycles-pp.__destroy_inode
> 0.07 ± 7% -0.0 0.04 ± 45% perf-profile.children.cycles-pp.__call_rcu_common
> 0.13 ± 2% -0.0 0.11 ± 8% perf-profile.children.cycles-pp.perf_event_mmap
> 0.09 -0.0 0.07 ± 9% perf-profile.children.cycles-pp.memfd_fcntl
> 0.06 -0.0 0.04 ± 44% perf-profile.children.cycles-pp.native_irq_return_iret
> 0.08 ± 6% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.shmem_add_to_page_cache
> 0.12 -0.0 0.10 ± 6% perf-profile.children.cycles-pp.perf_event_mmap_event
> 0.11 ± 3% -0.0 0.09 ± 7% perf-profile.children.cycles-pp.__lruvec_stat_mod_folio
> 0.10 -0.0 0.08 ± 8% perf-profile.children.cycles-pp.uncharge_batch
> 0.12 ± 4% -0.0 0.10 ± 6% perf-profile.children.cycles-pp.entry_SYSCALL_64
> 0.05 +0.0 0.07 ± 5% perf-profile.children.cycles-pp.__d_alloc
> 0.05 +0.0 0.07 ± 10% perf-profile.children.cycles-pp.d_alloc_pseudo
> 0.07 +0.0 0.09 ± 7% perf-profile.children.cycles-pp.file_init_path
> 0.06 ± 6% +0.0 0.08 ± 8% perf-profile.children.cycles-pp.security_file_alloc
> 0.07 ± 7% +0.0 0.09 ± 7% perf-profile.children.cycles-pp.errseq_sample
> 0.04 ± 44% +0.0 0.07 ± 10% perf-profile.children.cycles-pp.apparmor_file_alloc_security
> 0.09 +0.0 0.12 ± 5% perf-profile.children.cycles-pp.init_file
> 0.15 +0.0 0.18 ± 7% perf-profile.children.cycles-pp.common_perm_cond
> 0.15 ± 3% +0.0 0.19 ± 8% perf-profile.children.cycles-pp.security_file_truncate
> 0.20 +0.0 0.24 ± 7% perf-profile.children.cycles-pp.notify_change
> 0.06 +0.0 0.10 ± 6% perf-profile.children.cycles-pp.inode_init_owner
> 0.13 +0.0 0.18 ± 5% perf-profile.children.cycles-pp.alloc_empty_file
> 0.10 +0.1 0.16 ± 7% perf-profile.children.cycles-pp.clear_nlink
> 0.47 +0.1 0.56 ± 7% perf-profile.children.cycles-pp.do_ftruncate
> 0.49 +0.1 0.59 ± 7% perf-profile.children.cycles-pp.__x64_sys_ftruncate
> 0.59 +0.1 0.70 ± 7% perf-profile.children.cycles-pp.ftruncate64
> 0.28 +0.1 0.40 ± 6% perf-profile.children.cycles-pp.alloc_file_pseudo
> 98.62 +0.2 98.77 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
> 98.58 +0.2 98.74 perf-profile.children.cycles-pp.do_syscall_64
> 48.30 +0.6 48.86 perf-profile.children.cycles-pp.__close
> 48.26 +0.6 48.83 perf-profile.children.cycles-pp.__x64_sys_close
> 48.21 +0.6 48.80 perf-profile.children.cycles-pp.__fput
> 48.04 +0.6 48.68 perf-profile.children.cycles-pp.dput
> 48.02 +0.6 48.67 perf-profile.children.cycles-pp.__dentry_kill
> 47.77 +0.7 48.47 perf-profile.children.cycles-pp.evict
> 0.30 -0.1 0.23 ± 7% perf-profile.self.cycles-pp._raw_spin_lock
> 0.10 ± 4% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.__fput
> 0.08 ± 6% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.inode_init_always
> 0.06 -0.0 0.04 ± 44% perf-profile.self.cycles-pp.native_irq_return_iret
> 0.08 -0.0 0.06 ± 7% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
> 0.09 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
> 0.07 +0.0 0.09 ± 7% perf-profile.self.cycles-pp.__shmem_get_inode
> 0.06 ± 7% +0.0 0.09 ± 9% perf-profile.self.cycles-pp.errseq_sample
> 0.15 ± 2% +0.0 0.18 ± 7% perf-profile.self.cycles-pp.common_perm_cond
> 0.03 ± 70% +0.0 0.06 ± 7% perf-profile.self.cycles-pp.apparmor_file_alloc_security
> 0.06 +0.0 0.10 ± 7% perf-profile.self.cycles-pp.inode_init_owner
> 0.10 +0.1 0.16 ± 6% perf-profile.self.cycles-pp.clear_nlink
>
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
prev parent reply other threads:[~2024-07-25 10:11 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-25 8:04 kernel test robot
2024-07-25 10:11 ` Vlastimil Babka [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a939565a-cdab-4d8b-938e-38e3d837d653@suse.cz \
--to=vbabka@suse.cz \
--cc=42.hyeyoo@gmail.com \
--cc=casionwoo@gmail.com \
--cc=cl@linux.com \
--cc=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=gwan-gyeong.mun@intel.com \
--cc=hyunminlr@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=rientjes@google.com \
--cc=sangyun.kim@snu.ac.kr \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox