linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Mike Rapoport <rppt@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: tkhai@ya.ru, hannes@cmpxchg.org, shakeelb@google.com,
	mhocko@kernel.org, roman.gushchin@linux.dev,
	muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com,
	sultan@kerneltoast.com, dave@stgolabs.net,
	penguin-kernel@i-love.sakura.ne.jp, paulmck@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 0/8] make slab shrink lockless
Date: Tue, 28 Feb 2023 18:53:17 +0800	[thread overview]
Message-ID: <36c737e1-7e1c-7098-8bd5-1767869489d9@bytedance.com> (raw)
In-Reply-To: <63a16f0e-d6e9-29a1-069e-dc76bfd82319@bytedance.com>



On 2023/2/28 18:04, Qi Zheng wrote:
> 
> 
> On 2023/2/27 23:08, Mike Rapoport wrote:
>> Hi,
>>
>> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>>
>>>
>>> On 2023/2/27 03:51, Andrew Morton wrote:
>>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng 
>>>> <zhengqi.arch@bytedance.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This patch series aims to make slab shrink lockless.
>>>>
>>>> What an awesome changelog.
>>>>
>>>>> 2. Survey
>>>>> =========
>>>>
>>>> Especially this part.
>>>>
>>>> Looking through all the prior efforts and at this patchset I am not
>>>> immediately seeing any statements about the overall effect upon
>>>> real-world workloads.  For a good example, does this patchset
>>>> measurably improve throughput or energy consumption on your servers?
>>>
>>> Hi Andrew,
>>>
>>> I re-tested with the following physical machines:
>>>
>>> Architecture:        x86_64
>>> CPU(s):              96
>>> On-line CPU(s) list: 0-95
>>> Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>>
>>> I found that the reason for the hotspot I described in cover letter is
>>> wrong. The reason for the down_read_trylock() hotspot is not because of
>>> the failure to trylock, but simply because of the atomic operation
>>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>>> per cycle).
>>
>> ...
>>> Then we can use the following perf command to view hotspots:
>>>
>>> perf top -U -F 999
>>>
>>> 1) Before applying this patchset:
>>>
>>>    32.31%  [kernel]           [k] down_read_trylock
>>>    19.40%  [kernel]           [k] pv_native_safe_halt
>>>    16.24%  [kernel]           [k] up_read
>>>    15.70%  [kernel]           [k] shrink_slab
>>>     4.69%  [kernel]           [k] _find_next_bit
>>>     2.62%  [kernel]           [k] shrink_node
>>>     1.78%  [kernel]           [k] shrink_lruvec
>>>     0.76%  [kernel]           [k] do_shrink_slab
>>>
>>> 2) After applying this patchset:
>>>
>>>    27.83%  [kernel]           [k] _find_next_bit
>>>    16.97%  [kernel]           [k] shrink_slab
>>>    15.82%  [kernel]           [k] pv_native_safe_halt
>>>     9.58%  [kernel]           [k] shrink_node
>>>     8.31%  [kernel]           [k] shrink_lruvec
>>>     5.64%  [kernel]           [k] do_shrink_slab
>>>     3.88%  [kernel]           [k] mem_cgroup_iter
>>>
>>> 2. At the same time, we use the following perf command to capture IPC
>>> information:
>>>
>>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>>
>>> 1) Before applying this patchset:
>>>
>>>   Performance counter stats for 'system wide' (5 runs):
>>>
>>>        454187219766      cycles                    
>>> test                    (
>>> +-  1.84% )
>>>         78896433101      instructions              test #    0.17  
>>> insn per
>>> cycle           ( +-  0.44% )
>>>
>>>          10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
>>>
>>> 2) After applying this patchset:
>>>
>>>   Performance counter stats for 'system wide' (5 runs):
>>>
>>>        841954709443      cycles                    
>>> test                    (
>>> +- 15.80% )  (98.69%)
>>>        527258677936      instructions              test #    0.63  
>>> insn per
>>> cycle           ( +- 15.11% )  (98.68%)
>>>
>>>            10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
>>>
>>> We can see that IPC drops very seriously when calling
>>> down_read_trylock() at high frequency. After using SRCU,
>>> the IPC is at a normal level.
>>
>> The results you present do show improvement in IPC for an artificial test
>> script. But more interesting would be to see how a real world workloads
>> benefit from your changes.
> 
> Hi Mike and Andrew,
> 
> I did encounter this problem under the real workload of our online
> server. At the end of this email, I posted another call stack and
> hot spot that I found before.
> 
> I scanned the hotspots of all our online servers yesterday and today, 
> but unfortunately did not find the live environment.
> 
> Some of our servers have a large number of containers, and each
> container will mount some file systems. This is likely to trigger
> down_read_trylock() hotspots when the memory pressure of the whole
> machine or the memory pressure of memcg is high.

And the servers where this hotspot has happened (we have a hotspot alarm
record), basically have 96 cores, or 128 cores or even more.

> 
> So I just found a physical server with a similar configuration to the
> online server yesterday for a simulation test. The call stack and the 
> hot spot in the simulation test are almost exactly the same, so in
> theory, when such a hot spot appears on the online server, we can also
> enjoy the improvement of IPC. This will improve the performance of the
> server in memory exhaustion scenarios (memcg or global level).
> 
> And the above scenario is only one aspect, and the other aspect is the
> lock competition scenario mentioned by Kirill. After applying this patch 
> set, slab shrink and register_shrinker() can be completely parallelized,
> which can fix that problem.
> 
> These are the two main benefits for real workloads that I consider.
> 
> Thanks,
> Qi
> 
> call stack
> ----------
> 
> @[
>      down_read_trylock+1
>      shrink_slab+128
>      shrink_node+371
>      do_try_to_free_pages+232
>      try_to_free_pages+243
>      _alloc_pages_slowpath+771
>      _alloc_pages_nodemask+702
>      pagecache_get_page+255
>      filemap_fault+1361
>      ext4_filemap_fault+44
>      __do_fault+76
>      handle_mm_fault+3543
>      do_user_addr_fault+442
>      do_page_fault+48
>      page_fault+62
> ]: 1161690
> @[
>      down_read_trylock+1
>      shrink_slab+128
>      shrink_node+371
>      balance_pgdat+690
>      kswapd+389
>      kthread+246
>      ret_from_fork+31
> ]: 8424884
> @[
>      down_read_trylock+1
>      shrink_slab+128
>      shrink_node+371
>      do_try_to_free_pages+232
>      try_to_free_pages+243
>      __alloc_pages_slowpath+771
>      __alloc_pages_nodemask+702
>      __do_page_cache_readahead+244
>      filemap_fault+1674
>      ext4_filemap_fault+44
>      __do_fault+76
>      handle_mm_fault+3543
>      do_user_addr_fault+442
>      do_page_fault+48
>      page_fault+62
> ]: 20917631
> 
> hotspot
> -------
> 
> 52.22% [kernel]        [k] down_read_trylock
> 19.60% [kernel]        [k] up_read
>   8.86% [kernel]        [k] shrink_slab
>   2.44% [kernel]        [k] idr_find
>   1.25% [kernel]        [k] count_shadow_nodes
>   1.18% [kernel]        [k] shrink lruvec
>   0.71% [kernel]        [k] mem_cgroup_iter
>   0.71% [kernel]        [k] shrink_node
>   0.55% [kernel]        [k] find_next_bit
> 
> 
>>> Thanks,
>>> Qi
>>
> 

-- 
Thanks,
Qi


  reply	other threads:[~2023-02-28 11:00 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-26 14:46 Qi Zheng
2023-02-26 14:46 ` [PATCH v3 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
2023-02-26 14:54   ` Qi Zheng
2023-02-26 14:46 ` [PATCH v3 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-02-26 14:46 ` [PATCH v3 3/8] mm: vmscan: make memcg " Qi Zheng
2023-02-26 14:46 ` [PATCH v3 4/8] mm: vmscan: add shrinker_srcu_generation Qi Zheng
2023-02-26 14:46 ` [PATCH v3 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-02-26 14:46 ` [PATCH v3 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-02-26 14:46 ` [PATCH v3 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-02-26 14:46 ` [PATCH v3 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-02-26 19:51 ` [PATCH v3 0/8] make slab shrink lockless Andrew Morton
2023-02-27 13:31   ` Qi Zheng
2023-02-27 15:08     ` Mike Rapoport
2023-02-27 19:20       ` Kirill Tkhai
2023-02-27 19:32         ` Roman Gushchin
2023-02-27 19:47           ` Kirill Tkhai
2023-02-28 10:08         ` Qi Zheng
2023-02-28 10:04       ` Qi Zheng
2023-02-28 10:53         ` Qi Zheng [this message]
2023-02-28 18:40       ` Michal Hocko
2023-03-01  2:27         ` Qi Zheng
2023-02-27 19:02     ` Roman Gushchin
2023-02-28 10:11       ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36c737e1-7e1c-7098-8bd5-1767869489d9@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=sultan@kerneltoast.com \
    --cc=tkhai@ya.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox