Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yin Fengwei <fengwei.yin@intel.com>
To: Yang Shi <shy828301@gmail.com>
Cc: kernel test robot <oliver.sang@intel.com>,
	Rik van Riel <riel@surriel.com>, <oe-lkp@lists.linux.dev>,
	<lkp@intel.com>,
	"Linux Memory Management List" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Christopher Lameter <cl@linux.com>, <ying.huang@intel.com>,
	<feng.tang@intel.com>
Subject: Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression
Date: Thu, 21 Dec 2023 08:58:42 +0800	[thread overview]
Message-ID: <b1974dc2-0e25-4075-a9ba-5e1face6c0cd@intel.com> (raw)
In-Reply-To: <CAHbLzkrd-y2=KHS-zreH8FUmQtOaf_FGuyj78zNNH3FbpCQNnA@mail.gmail.com>



On 2023/12/21 08:26, Yang Shi wrote:
> On Wed, Dec 20, 2023 at 12:09 PM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Wed, Dec 20, 2023 at 12:34 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 2023/12/20 13:27, Yang Shi wrote:
>>>> On Tue, Dec 19, 2023 at 7:41 AM kernel test robot <oliver.sang@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> for this commit, we reported
>>>>> "[mm]  96db82a66d:  will-it-scale.per_process_ops -95.3% regression"
>>>>> in Aug, 2022 when it's in linux-next/master
>>>>> https://lore.kernel.org/all/YwIoiIYo4qsYBcgd@xsang-OptiPlex-9020/
>>>>>
>>>>> later, we reported
>>>>> "[mm] f35b5d7d67: will-it-scale.per_process_ops -95.5% regression"
>>>>> in Oct, 2022 when it's in linus/master
>>>>> https://lore.kernel.org/all/202210181535.7144dd15-yujie.liu@intel.com/
>>>>>
>>>>> and the commit was reverted finally by
>>>>> commit 0ba09b1733878afe838fe35c310715fda3d46428
>>>>> Author: Linus Torvalds <torvalds@linux-foundation.org>
>>>>> Date:   Sun Dec 4 12:51:59 2022 -0800
>>>>>
>>>>> now we noticed it goes into linux-next/master again.
>>>>>
>>>>> we are not sure if there is an agreement that the benefit of this commit
>>>>> has already overweight performance drop in some mirco benchmark.
>>>>>
>>>>> we also noticed from https://lore.kernel.org/all/20231214223423.1133074-1-yang@os.amperecomputing.com/
>>>>> that
>>>>> "This patch was applied to v6.1, but was reverted due to a regression
>>>>> report.  However it turned out the regression was not due to this patch.
>>>>> I ping'ed Andrew to reapply this patch, Andrew may forget it.  This
>>>>> patch helps promote THP, so I rebased it onto the latest mm-unstable."
>>>>
>>>> IIRC, Huang Ying's analysis showed the regression for will-it-scale
>>>> micro benchmark is fine, it was actually reverted due to kernel build
>>>> regression with LLVM reported by Nathan Chancellor. Then the
>>>> regression was resolved by commit
>>>> 81e506bec9be1eceaf5a2c654e28ba5176ef48d8 ("mm/thp: check and bail out
>>>> if page in deferred queue already"). And this patch did improve kernel
>>>> build with GCC by ~3% if I remember correctly.
>>>>
>>>>>
>>>>> however, unfortunately, in our latest tests, we still observed below regression
>>>>> upon this commit. just FYI.
>>>>>
>>>>>
>>>>>
>>>>> kernel test robot noticed a -84.3% regression of stress-ng.pthread.ops_per_sec on:
>>>>
>>>> Interesting, wasn't the same regression seen last time? And I'm a
>>>> little bit confused about how pthread got regressed. I didn't see the
>>>> pthread benchmark do any intensive memory alloc/free operations. Do
>>>> the pthread APIs do any intensive memory operations? I saw the
>>>> benchmark does allocate memory for thread stack, but it should be just
>>>> 8K per thread, so it should not trigger what this patch does. With
>>>> 1024 threads, the thread stacks may get merged into one single VMA (8M
>>>> total), but it may do so even though the patch is not applied.
>>> stress-ng.pthread test code is strange here:
>>>
>>> https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#L573
>>>
>>> Even it allocates its own stack, but that attr is not passed
>>> to pthread_create. So it's still glibc to allocate stack for
>>> pthread which is 8M size. This is why this patch can impact
>>> the stress-ng.pthread testing.
>>
>> Aha, nice catch, I overlooked that.
>>
>>>
>>>
>>> My understanding is this is different regression (if it's a valid
>>> regression). The previous hotspot was in:
>>>      deferred_split_huge_page
>>>         deferred_split_huge_page
>>>            deferred_split_huge_page
>>>               spin_lock
>>>
>>> while this time, the hotspot is in (pmd_lock from do_madvise I suppose):
>>>      - 55.02% zap_pmd_range.isra.0
>>>         - 53.42% __split_huge_pmd
>>>            - 51.74% _raw_spin_lock
>>>               - 51.73% native_queued_spin_lock_slowpath
>>>                  + 3.03% asm_sysvec_call_function
>>>            - 1.67% __split_huge_pmd_locked
>>>               - 0.87% pmdp_invalidate
>>>                  + 0.86% flush_tlb_mm_range
>>>         - 1.60% zap_pte_range
>>>            - 1.04% page_remove_rmap
>>>                 0.55% __mod_lruvec_page_state
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>> commit: 1111d46b5cbad57486e7a3fab75888accac2f072 ("mm: align larger anonymous mappings on THP boundaries")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>>>>
>>>>> testcase: stress-ng
>>>>> test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 128G memory
>>>>> parameters:
>>>>>
>>>>>           nr_threads: 1
>>>>>           disk: 1HDD
>>>>>           testtime: 60s
>>>>>           fs: ext4
>>>>>           class: os
>>>>>           test: pthread
>>>>>           cpufreq_governor: performance
>>>>>
>>>>>
>>>>> In addition to that, the commit also has significant impact on the following tests:
>>>>>
>>>>> +------------------+-----------------------------------------------------------------------------------------------+
>>>>> | testcase: change | stream: stream.triad_bandwidth_MBps -12.1% regression                                         |
>>>>> | test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory   |
>>>>> | test parameters  | array_size=50000000                                                                           |
>>>>> |                  | cpufreq_governor=performance                                                                  |
>>>>> |                  | iterations=10x                                                                                |
>>>>> |                  | loop=100                                                                                      |
>>>>> |                  | nr_threads=25%                                                                                |
>>>>> |                  | omp=true                                                                                      |
>>>>> +------------------+-----------------------------------------------------------------------------------------------+
>>>>> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed.Average.Integer.mb_s -3.5% regression       |
>>>>> | test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory    |
>>>>> | test parameters  | cpufreq_governor=performance                                                                  |
>>>>> |                  | option_a=Average                                                                              |
>>>>> |                  | option_b=Integer                                                                              |
>>>>> |                  | test=ramspeed-1.4.3                                                                           |
>>>>> +------------------+-----------------------------------------------------------------------------------------------+
>>>>> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed.Average.FloatingPoint.mb_s -3.0% regression |
>>>>> | test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (Coffee Lake) with 16G memory    |
>>>>> | test parameters  | cpufreq_governor=performance                                                                  |
>>>>> |                  | option_a=Average                                                                              |
>>>>> |                  | option_b=Floating Point                                                                       |
>>>>> |                  | test=ramspeed-1.4.3                                                                           |
>>>>> +------------------+-----------------------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>>> the same patch/commit), kindly add following tags
>>>>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>> | Closes: https://lore.kernel.org/oe-lkp/202312192310.56367035-oliver.sang@intel.com
>>>>>
>>>>>
>>>>> Details are as below:
>>>>> -------------------------------------------------------------------------------------------------->
>>>>>
>>>>>
>>>>> The kernel config and materials to reproduce are available at:
>>>>> https://download.01.org/0day-ci/archive/20231219/202312192310.56367035-oliver.sang@intel.com
>>>>>
>>>>> =========================================================================================
>>>>> class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
>>>>>     os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/1/debian-11.1-x86_64-20220510.cgz/lkp-csl-d02/pthread/stress-ng/60s
>>>>>
>>>>> commit:
>>>>>     30749e6fbb ("mm/memory: replace kmap() with kmap_local_page()")
>>>>>     1111d46b5c ("mm: align larger anonymous mappings on THP boundaries")
>>>>>
>>>>> 30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75
>>>>> ---------------- ---------------------------
>>>>>            %stddev     %change         %stddev
>>>>>                \          |                \
>>>>>     13405796           -65.5%    4620124        cpuidle..usage
>>>>>         8.00            +8.2%       8.66 ą  2%  iostat.cpu.system
>>>>>         1.61           -60.6%       0.63        iostat.cpu.user
>>>>>       597.50 ą 14%     -64.3%     213.50 ą 14%  perf-c2c.DRAM.local
>>>>>         1882 ą 14%     -74.7%     476.83 ą  7%  perf-c2c.HITM.local
>>>>>      3768436           -12.9%    3283395        vmstat.memory.cache
>>>>>       355105           -75.7%      86344 ą  3%  vmstat.system.cs
>>>>>       385435           -20.7%     305714 ą  3%  vmstat.system.in
>>>>>         1.13            -0.2        0.88        mpstat.cpu.all.irq%
>>>>>         0.29            -0.2        0.10 ą  2%  mpstat.cpu.all.soft%
>>>>>         6.76 ą  2%      +1.1        7.88 ą  2%  mpstat.cpu.all.sys%
>>>>>         1.62            -1.0        0.62 ą  2%  mpstat.cpu.all.usr%
>>>>>      2234397           -84.3%     350161 ą  5%  stress-ng.pthread.ops
>>>>>        37237           -84.3%       5834 ą  5%  stress-ng.pthread.ops_per_sec
>>>>>       294706 ą  2%     -68.0%      94191 ą  6%  stress-ng.time.involuntary_context_switches
>>>>>        41442 ą  2%   +5023.4%    2123284        stress-ng.time.maximum_resident_set_size
>>>>>      4466457           -83.9%     717053 ą  5%  stress-ng.time.minor_page_faults
>>>>
>>>> The larger RSS and fewer page faults are expected.
>>>>
>>>>>       243.33           +13.5%     276.17 ą  3%  stress-ng.time.percent_of_cpu_this_job_got
>>>>>       131.64           +27.7%     168.11 ą  3%  stress-ng.time.system_time
>>>>>        19.73           -82.1%       3.53 ą  4%  stress-ng.time.user_time
>>>>
>>>> Much less user time. And it seems to match the drop of the pthread metric.
>>>>
>>>>>      7715609           -80.2%    1530125 ą  4%  stress-ng.time.voluntary_context_switches
>>>>>        76728           -80.8%      14724 ą  4%  perf-stat.i.minor-faults
>>>>>      5600408           -61.4%    2160997 ą  5%  perf-stat.i.node-loads
>>>>>      8873996           +52.1%   13499744 ą  5%  perf-stat.i.node-stores
>>>>>       112409           -81.9%      20305 ą  4%  perf-stat.i.page-faults
>>>>>         2.55           +89.6%       4.83        perf-stat.overall.MPKI
>>>>
>>>> Much more TLB misses.
>>>>
>>>>>         1.51            -0.4        1.13        perf-stat.overall.branch-miss-rate%
>>>>>        19.26           +24.5       43.71        perf-stat.overall.cache-miss-rate%
>>>>>         1.70           +56.4%       2.65        perf-stat.overall.cpi
>>>>>       665.84           -17.5%     549.51 ą  2%  perf-stat.overall.cycles-between-cache-misses
>>>>>         0.12 ą  4%      -0.1        0.04        perf-stat.overall.dTLB-load-miss-rate%
>>>>>         0.08 ą  2%      -0.0        0.03        perf-stat.overall.dTLB-store-miss-rate%
>>>>>        59.16            +0.9       60.04        perf-stat.overall.iTLB-load-miss-rate%
>>>>>         1278           +86.1%       2379 ą  2%  perf-stat.overall.instructions-per-iTLB-miss
>>>>>         0.59           -36.1%       0.38        perf-stat.overall.ipc
>>>>
>>>> Worse IPC and CPI.
>>>>
>>>>>    2.078e+09           -48.3%  1.074e+09 ą  4%  perf-stat.ps.branch-instructions
>>>>>     31292687           -61.2%   12133349 ą  2%  perf-stat.ps.branch-misses
>>>>>     26057291            -5.9%   24512034 ą  4%  perf-stat.ps.cache-misses
>>>>>    1.353e+08           -58.6%   56072195 ą  4%  perf-stat.ps.cache-references
>>>>>       365254           -75.8%      88464 ą  3%  perf-stat.ps.context-switches
>>>>>    1.735e+10           -22.4%  1.346e+10 ą  2%  perf-stat.ps.cpu-cycles
>>>>>        60838           -79.1%      12727 ą  6%  perf-stat.ps.cpu-migrations
>>>>>      3056601 ą  4%     -81.5%     565354 ą  4%  perf-stat.ps.dTLB-load-misses
>>>>>    2.636e+09           -50.7%    1.3e+09 ą  4%  perf-stat.ps.dTLB-loads
>>>>>      1155253 ą  2%     -83.0%     196581 ą  5%  perf-stat.ps.dTLB-store-misses
>>>>>    1.473e+09           -57.4%  6.268e+08 ą  3%  perf-stat.ps.dTLB-stores
>>>>>      7997726           -73.3%    2131477 ą  3%  perf-stat.ps.iTLB-load-misses
>>>>>      5521346           -74.3%    1418623 ą  2%  perf-stat.ps.iTLB-loads
>>>>>    1.023e+10           -50.4%  5.073e+09 ą  4%  perf-stat.ps.instructions
>>>>>        75671           -80.9%      14479 ą  4%  perf-stat.ps.minor-faults
>>>>>      5549722           -61.4%    2141750 ą  4%  perf-stat.ps.node-loads
>>>>>      8769156           +51.6%   13296579 ą  5%  perf-stat.ps.node-stores
>>>>>       110795           -82.0%      19977 ą  4%  perf-stat.ps.page-faults
>>>>>    6.482e+11           -50.7%  3.197e+11 ą  4%  perf-stat.total.instructions
>>>>>         0.00 ą 37%    -100.0%       0.00        perf-sched.sch_delay.avg.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc_node.memcg_alloc_slab_cgroups.allocate_slab
>>>>>         0.01 ą 18%   +8373.1%       0.73 ą 49%  perf-sched.sch_delay.avg.ms.__cond_resched.down_read.do_madvise.__x64_sys_madvise.do_syscall_64
>>>>>         0.01 ą 16%   +4600.0%       0.38 ą 24%  perf-sched.sch_delay.avg.ms.__cond_resched.down_read.exit_mm.do_exit.__x64_sys_exit
>>>>
>>>> More time spent in madvise and munmap. but I'm not sure whether this
>>>> is caused by tearing down the address space when exiting the test. If
>>>> so it should not count in the regression.
>>> It's not for the whole address space tearing down. It's for pthread
>>> stack tearing down when pthread exit (can be treated as address space
>>> tearing down? I suppose so).
>>>
>>> https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384
>>> https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576
>>
>> It explains the problem. The madvise() does have some extra overhead
>> for handling THP (splitting pmd, deferred split queue, etc).
>>
>>>
>>> Another thing is whether it's worthy to make stack use THP? It may be
>>> useful for some apps which need large stack size?
>>
>> Kernel actually doesn't apply THP to stack (see
>> vma_is_temporary_stack()). But kernel can't know whether the VMA is
>> stack or not by checking VM_GROWSDOWN | VM_GROWSUP flags. So if glibc
>> doesn't set the proper flags to tell kernel the area is stack, kernel
>> just treats it as normal anonymous area. So glibc should set up stack
>> properly IMHO.
> 
> If I read the code correctly, nptl allocates stack by the below code:
> 
> mem = __mmap (NULL, size, (guardsize == 0) ? prot : PROT_NONE,
>                          MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
> 
> See https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L563
> 
> The MAP_STACK is used, but it is a no-op on Linux. So the alternative
> is to make MAP_STACK useful on Linux instead of changing glibc. But
> the blast radius seems much wider.
Yes. MAP_STACK is also mentioned in manpage of mmap. I did test to
filter out of the MAP_STACK mapping based on this patch. The regression
in stress-ng.pthread was gone. I suppose this is kind of safe because
the madvise call is only applied to glibc allocated stack.


But what I am not sure was whether it's worthy to do such kind of change
as the regression only is seen obviously in micro-benchmark. No evidence
showed the other regressionsin this report is related with madvise. At
least from the perf statstics. Need to check more on stream/ramspeed. 
Thanks.


Regards
Yin, Fengwei

> 
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei

next prev parent reply	other threads:[~2023-12-21  1:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-19 15:41 kernel test robot
2023-12-20  5:27 ` Yang Shi
2023-12-20  8:29   ` Yin Fengwei
2023-12-20 15:42     ` Christoph Lameter (Ampere)
2023-12-20 20:14       ` Yang Shi
2023-12-20 20:09     ` Yang Shi
2023-12-21  0:26       ` Yang Shi
2023-12-21  0:58         ` Yin Fengwei [this message]
2023-12-21  1:02           ` Yin Fengwei
2023-12-21  4:49           ` Matthew Wilcox
2023-12-21  4:58             ` Yin Fengwei
2023-12-21 18:07             ` Yang Shi
2023-12-21 18:14               ` Matthew Wilcox
2023-12-22  1:06                 ` Yin, Fengwei
2023-12-22  2:23                   ` Huang, Ying
2023-12-21 13:39           ` Yin, Fengwei
2023-12-21 18:11             ` Yang Shi
2023-12-22  1:13               ` Yin, Fengwei
2024-01-04  1:32                 ` Yang Shi
2024-01-04  8:18                   ` Yin Fengwei
2024-01-04  8:39                     ` Oliver Sang
2024-01-05  9:29                       ` Oliver Sang
2024-01-05 14:52                         ` Yin, Fengwei
2024-01-05 18:49                         ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b1974dc2-0e25-4075-a9ba-5e1face6c0cd@intel.com \
    --to=fengwei.yin@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=feng.tang@intel.com \
    --cc=linux-mm@kvack.org \
    --cc=lkp@intel.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=riel@surriel.com \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox