From: Yang Shi <shy828301@gmail.com>
To: Yin Fengwei <fengwei.yin@intel.com>
Cc: kernel test robot <oliver.sang@intel.com>,
Rik van Riel <riel@surriel.com>,
oe-lkp@lists.linux.dev, lkp@intel.com,
Linux Memory Management List <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
Christopher Lameter <cl@linux.com>,
ying.huang@intel.com, feng.tang@intel.com
Subject: Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression
Date: Wed, 20 Dec 2023 16:26:10 -0800 [thread overview]
Message-ID: <CAHbLzkrd-y2=KHS-zreH8FUmQtOaf_FGuyj78zNNH3FbpCQNnA@mail.gmail.com> (raw)
In-Reply-To: <CAHbLzkrAdZ4GrtnH0NUhwPm=gZzkaGT96xVbiyOQaJ3uCFRDnw@mail.gmail.com>
On Wed, Dec 20, 2023 at 12:09 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Dec 20, 2023 at 12:34 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >
> >
> >
> > On 2023/12/20 13:27, Yang Shi wrote:
> > > On Tue, Dec 19, 2023 at 7:41 AM kernel test robot <oliver.sang@intel.com> wrote:
> > >>
> > >>
> > >>
> > >> Hello,
> > >>
> > >> for this commit, we reported
> > >> "[mm] 96db82a66d: will-it-scale.per_process_ops -95.3% regression"
> > >> in Aug, 2022 when it's in linux-next/master
> > >> https://lore.kernel.org/all/YwIoiIYo4qsYBcgd@xsang-OptiPlex-9020/
> > >>
> > >> later, we reported
> > >> "[mm] f35b5d7d67: will-it-scale.per_process_ops -95.5% regression"
> > >> in Oct, 2022 when it's in linus/master
> > >> https://lore.kernel.org/all/202210181535.7144dd15-yujie.liu@intel.com/
> > >>
> > >> and the commit was reverted finally by
> > >> commit 0ba09b1733878afe838fe35c310715fda3d46428
> > >> Author: Linus Torvalds <torvalds@linux-foundation.org>
> > >> Date: Sun Dec 4 12:51:59 2022 -0800
> > >>
> > >> now we noticed it goes into linux-next/master again.
> > >>
> > >> we are not sure if there is an agreement that the benefit of this commit
> > >> has already overweight performance drop in some mirco benchmark.
> > >>
> > >> we also noticed from https://lore.kernel.org/all/20231214223423.1133074-1-yang@os.amperecomputing.com/
> > >> that
> > >> "This patch was applied to v6.1, but was reverted due to a regression
> > >> report. However it turned out the regression was not due to this patch.
> > >> I ping'ed Andrew to reapply this patch, Andrew may forget it. This
> > >> patch helps promote THP, so I rebased it onto the latest mm-unstable."
> > >
> > > IIRC, Huang Ying's analysis showed the regression for will-it-scale
> > > micro benchmark is fine, it was actually reverted due to kernel build
> > > regression with LLVM reported by Nathan Chancellor. Then the
> > > regression was resolved by commit
> > > 81e506bec9be1eceaf5a2c654e28ba5176ef48d8 ("mm/thp: check and bail out
> > > if page in deferred queue already"). And this patch did improve kernel
> > > build with GCC by ~3% if I remember correctly.
> > >
> > >>
> > >> however, unfortunately, in our latest tests, we still observed below regression
> > >> upon this commit. just FYI.
> > >>
> > >>
> > >>
> > >> kernel test robot noticed a -84.3% regression of stress-ng.pthread.ops_per_sec on:
> > >
> > > Interesting, wasn't the same regression seen last time? And I'm a
> > > little bit confused about how pthread got regressed. I didn't see the
> > > pthread benchmark do any intensive memory alloc/free operations. Do
> > > the pthread APIs do any intensive memory operations? I saw the
> > > benchmark does allocate memory for thread stack, but it should be just
> > > 8K per thread, so it should not trigger what this patch does. With
> > > 1024 threads, the thread stacks may get merged into one single VMA (8M
> > > total), but it may do so even though the patch is not applied.
> > stress-ng.pthread test code is strange here:
> >
> > https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#L573
> >
> > Even it allocates its own stack, but that attr is not passed
> > to pthread_create. So it's still glibc to allocate stack for
> > pthread which is 8M size. This is why this patch can impact
> > the stress-ng.pthread testing.
>
> Aha, nice catch, I overlooked that.
>
> >
> >
> > My understanding is this is different regression (if it's a valid
> > regression). The previous hotspot was in:
> > deferred_split_huge_page
> > deferred_split_huge_page
> > deferred_split_huge_page
> > spin_lock
> >
> > while this time, the hotspot is in (pmd_lock from do_madvise I suppose):
> > - 55.02% zap_pmd_range.isra.0
> > - 53.42% __split_huge_pmd
> > - 51.74% _raw_spin_lock
> > - 51.73% native_queued_spin_lock_slowpath
> > + 3.03% asm_sysvec_call_function
> > - 1.67% __split_huge_pmd_locked
> > - 0.87% pmdp_invalidate
> > + 0.86% flush_tlb_mm_range
> > - 1.60% zap_pte_range
> > - 1.04% page_remove_rmap
> > 0.55% __mod_lruvec_page_state
> >
> >
> > >
> > >>
> > >>
> > >> commit: 1111d46b5cbad57486e7a3fab75888accac2f072 ("mm: align larger anonymous mappings on THP boundaries")
> > >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > >>
> > >> testcase: stress-ng
> > >> test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 128G memory
> > >> parameters:
> > >>
> > >> nr_threads: 1
> > >> disk: 1HDD
> > >> testtime: 60s
> > >> fs: ext4
> > >> class: os
> > >> test: pthread
> > >> cpufreq_governor: performance
> > >>
> > >>
> > >> In addition to that, the commit also has significant impact on the following tests:
> > >>
> > >> +------------------+-----------------------------------------------------------------------------------------------+
> > >> | testcase: change | stream: stream.triad_bandwidth_MBps -12.1% regression |
> > >> | test machine | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory |
> > >> | test parameters | array_size=50000000 |
> > >> | | cpufreq_governor=performance |
> > >> | | iterations=10x |
> > >> | | loop=100 |
> > >> | | nr_threads=25% |
> > >> | | omp=true |
> > >> +------------------+-----------------------------------------------------------------------------------------------+
> > >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed.Average.Integer.mb_s -3.5% regression |
> > >> | test machine | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory |
> > >> | test parameters | cpufreq_governor=performance |
> > >> | | option_a=Average |
> > >> | | option_b=Integer |
> > >> | | test=ramspeed-1.4.3 |
> > >> +------------------+-----------------------------------------------------------------------------------------------+
> > >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed.Average.FloatingPoint.mb_s -3.0% regression |
> > >> | test machine | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory |
> > >> | test parameters | cpufreq_governor=performance |
> > >> | | option_a=Average |
> > >> | | option_b=Floating Point |
> > >> | | test=ramspeed-1.4.3 |
> > >> +------------------+-----------------------------------------------------------------------------------------------+
> > >>
> > >>
> > >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > >> the same patch/commit), kindly add following tags
> > >> | Reported-by: kernel test robot <oliver.sang@intel.com>
> > >> | Closes: https://lore.kernel.org/oe-lkp/202312192310.56367035-oliver.sang@intel.com
> > >>
> > >>
> > >> Details are as below:
> > >> -------------------------------------------------------------------------------------------------->
> > >>
> > >>
> > >> The kernel config and materials to reproduce are available at:
> > >> https://download.01.org/0day-ci/archive/20231219/202312192310.56367035-oliver.sang@intel.com
> > >>
> > >> =========================================================================================
> > >> class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
> > >> os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/1/debian-11.1-x86_64-20220510.cgz/lkp-csl-d02/pthread/stress-ng/60s
> > >>
> > >> commit:
> > >> 30749e6fbb ("mm/memory: replace kmap() with kmap_local_page()")
> > >> 1111d46b5c ("mm: align larger anonymous mappings on THP boundaries")
> > >>
> > >> 30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75
> > >> ---------------- ---------------------------
> > >> %stddev %change %stddev
> > >> \ | \
> > >> 13405796 -65.5% 4620124 cpuidle..usage
> > >> 8.00 +8.2% 8.66 ą 2% iostat.cpu.system
> > >> 1.61 -60.6% 0.63 iostat.cpu.user
> > >> 597.50 ą 14% -64.3% 213.50 ą 14% perf-c2c.DRAM.local
> > >> 1882 ą 14% -74.7% 476.83 ą 7% perf-c2c.HITM.local
> > >> 3768436 -12.9% 3283395 vmstat.memory.cache
> > >> 355105 -75.7% 86344 ą 3% vmstat.system.cs
> > >> 385435 -20.7% 305714 ą 3% vmstat.system.in
> > >> 1.13 -0.2 0.88 mpstat.cpu.all.irq%
> > >> 0.29 -0.2 0.10 ą 2% mpstat.cpu.all.soft%
> > >> 6.76 ą 2% +1.1 7.88 ą 2% mpstat.cpu.all.sys%
> > >> 1.62 -1.0 0.62 ą 2% mpstat.cpu.all.usr%
> > >> 2234397 -84.3% 350161 ą 5% stress-ng.pthread.ops
> > >> 37237 -84.3% 5834 ą 5% stress-ng.pthread.ops_per_sec
> > >> 294706 ą 2% -68.0% 94191 ą 6% stress-ng.time.involuntary_context_switches
> > >> 41442 ą 2% +5023.4% 2123284 stress-ng.time.maximum_resident_set_size
> > >> 4466457 -83.9% 717053 ą 5% stress-ng.time.minor_page_faults
> > >
> > > The larger RSS and fewer page faults are expected.
> > >
> > >> 243.33 +13.5% 276.17 ą 3% stress-ng.time.percent_of_cpu_this_job_got
> > >> 131.64 +27.7% 168.11 ą 3% stress-ng.time.system_time
> > >> 19.73 -82.1% 3.53 ą 4% stress-ng.time.user_time
> > >
> > > Much less user time. And it seems to match the drop of the pthread metric.
> > >
> > >> 7715609 -80.2% 1530125 ą 4% stress-ng.time.voluntary_context_switches
> > >> 76728 -80.8% 14724 ą 4% perf-stat.i.minor-faults
> > >> 5600408 -61.4% 2160997 ą 5% perf-stat.i.node-loads
> > >> 8873996 +52.1% 13499744 ą 5% perf-stat.i.node-stores
> > >> 112409 -81.9% 20305 ą 4% perf-stat.i.page-faults
> > >> 2.55 +89.6% 4.83 perf-stat.overall.MPKI
> > >
> > > Much more TLB misses.
> > >
> > >> 1.51 -0.4 1.13 perf-stat.overall.branch-miss-rate%
> > >> 19.26 +24.5 43.71 perf-stat.overall.cache-miss-rate%
> > >> 1.70 +56.4% 2.65 perf-stat.overall.cpi
> > >> 665.84 -17.5% 549.51 ą 2% perf-stat.overall.cycles-between-cache-misses
> > >> 0.12 ą 4% -0.1 0.04 perf-stat.overall.dTLB-load-miss-rate%
> > >> 0.08 ą 2% -0.0 0.03 perf-stat.overall.dTLB-store-miss-rate%
> > >> 59.16 +0.9 60.04 perf-stat.overall.iTLB-load-miss-rate%
> > >> 1278 +86.1% 2379 ą 2% perf-stat.overall.instructions-per-iTLB-miss
> > >> 0.59 -36.1% 0.38 perf-stat.overall.ipc
> > >
> > > Worse IPC and CPI.
> > >
> > >> 2.078e+09 -48.3% 1.074e+09 ą 4% perf-stat.ps.branch-instructions
> > >> 31292687 -61.2% 12133349 ą 2% perf-stat.ps.branch-misses
> > >> 26057291 -5.9% 24512034 ą 4% perf-stat.ps.cache-misses
> > >> 1.353e+08 -58.6% 56072195 ą 4% perf-stat.ps.cache-references
> > >> 365254 -75.8% 88464 ą 3% perf-stat.ps.context-switches
> > >> 1.735e+10 -22.4% 1.346e+10 ą 2% perf-stat.ps.cpu-cycles
> > >> 60838 -79.1% 12727 ą 6% perf-stat.ps.cpu-migrations
> > >> 3056601 ą 4% -81.5% 565354 ą 4% perf-stat.ps.dTLB-load-misses
> > >> 2.636e+09 -50.7% 1.3e+09 ą 4% perf-stat.ps.dTLB-loads
> > >> 1155253 ą 2% -83.0% 196581 ą 5% perf-stat.ps.dTLB-store-misses
> > >> 1.473e+09 -57.4% 6.268e+08 ą 3% perf-stat.ps.dTLB-stores
> > >> 7997726 -73.3% 2131477 ą 3% perf-stat.ps.iTLB-load-misses
> > >> 5521346 -74.3% 1418623 ą 2% perf-stat.ps.iTLB-loads
> > >> 1.023e+10 -50.4% 5.073e+09 ą 4% perf-stat.ps.instructions
> > >> 75671 -80.9% 14479 ą 4% perf-stat.ps.minor-faults
> > >> 5549722 -61.4% 2141750 ą 4% perf-stat.ps.node-loads
> > >> 8769156 +51.6% 13296579 ą 5% perf-stat.ps.node-stores
> > >> 110795 -82.0% 19977 ą 4% perf-stat.ps.page-faults
> > >> 6.482e+11 -50.7% 3.197e+11 ą 4% perf-stat.total.instructions
> > >> 0.00 ą 37% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc_node.memcg_alloc_slab_cgroups.allocate_slab
> > >> 0.01 ą 18% +8373.1% 0.73 ą 49% perf-sched.sch_delay.avg.ms.__cond_resched.down_read.do_madvise.__x64_sys_madvise.do_syscall_64
> > >> 0.01 ą 16% +4600.0% 0.38 ą 24% perf-sched.sch_delay.avg.ms.__cond_resched.down_read.exit_mm.do_exit.__x64_sys_exit
> > >
> > > More time spent in madvise and munmap. but I'm not sure whether this
> > > is caused by tearing down the address space when exiting the test. If
> > > so it should not count in the regression.
> > It's not for the whole address space tearing down. It's for pthread
> > stack tearing down when pthread exit (can be treated as address space
> > tearing down? I suppose so).
> >
> > https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384
> > https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576
>
> It explains the problem. The madvise() does have some extra overhead
> for handling THP (splitting pmd, deferred split queue, etc).
>
> >
> > Another thing is whether it's worthy to make stack use THP? It may be
> > useful for some apps which need large stack size?
>
> Kernel actually doesn't apply THP to stack (see
> vma_is_temporary_stack()). But kernel can't know whether the VMA is
> stack or not by checking VM_GROWSDOWN | VM_GROWSUP flags. So if glibc
> doesn't set the proper flags to tell kernel the area is stack, kernel
> just treats it as normal anonymous area. So glibc should set up stack
> properly IMHO.
If I read the code correctly, nptl allocates stack by the below code:
mem = __mmap (NULL, size, (guardsize == 0) ? prot : PROT_NONE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
See https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L563
The MAP_STACK is used, but it is a no-op on Linux. So the alternative
is to make MAP_STACK useful on Linux instead of changing glibc. But
the blast radius seems much wider.
>
> >
> >
> > Regards
> > Yin, Fengwei
next prev parent reply other threads:[~2023-12-21 0:26 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-19 15:41 kernel test robot
2023-12-20 5:27 ` Yang Shi
2023-12-20 8:29 ` Yin Fengwei
2023-12-20 15:42 ` Christoph Lameter (Ampere)
2023-12-20 20:14 ` Yang Shi
2023-12-20 20:09 ` Yang Shi
2023-12-21 0:26 ` Yang Shi [this message]
2023-12-21 0:58 ` Yin Fengwei
2023-12-21 1:02 ` Yin Fengwei
2023-12-21 4:49 ` Matthew Wilcox
2023-12-21 4:58 ` Yin Fengwei
2023-12-21 18:07 ` Yang Shi
2023-12-21 18:14 ` Matthew Wilcox
2023-12-22 1:06 ` Yin, Fengwei
2023-12-22 2:23 ` Huang, Ying
2023-12-21 13:39 ` Yin, Fengwei
2023-12-21 18:11 ` Yang Shi
2023-12-22 1:13 ` Yin, Fengwei
2024-01-04 1:32 ` Yang Shi
2024-01-04 8:18 ` Yin Fengwei
2024-01-04 8:39 ` Oliver Sang
2024-01-05 9:29 ` Oliver Sang
2024-01-05 14:52 ` Yin, Fengwei
2024-01-05 18:49 ` Yang Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHbLzkrd-y2=KHS-zreH8FUmQtOaf_FGuyj78zNNH3FbpCQNnA@mail.gmail.com' \
--to=shy828301@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=linux-mm@kvack.org \
--cc=lkp@intel.com \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=riel@surriel.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox