Re: [PATCH v3 0/8] make slab shrink lockless

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Mike Rapoport <rppt@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: tkhai@ya.ru, hannes@cmpxchg.org, shakeelb@google.com,
	mhocko@kernel.org, roman.gushchin@linux.dev,
	muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com,
	sultan@kerneltoast.com, dave@stgolabs.net,
	penguin-kernel@i-love.sakura.ne.jp, paulmck@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 0/8] make slab shrink lockless
Date: Tue, 28 Feb 2023 18:04:53 +0800	[thread overview]
Message-ID: <63a16f0e-d6e9-29a1-069e-dc76bfd82319@bytedance.com> (raw)
In-Reply-To: <Y/zHbhxnQ2YsP+wX@kernel.org>



On 2023/2/27 23:08, Mike Rapoport wrote:
> Hi,
> 
> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote:
>>
>>
>> On 2023/2/27 03:51, Andrew Morton wrote:
>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> This patch series aims to make slab shrink lockless.
>>>
>>> What an awesome changelog.
>>>
>>>> 2. Survey
>>>> =========
>>>
>>> Especially this part.
>>>
>>> Looking through all the prior efforts and at this patchset I am not
>>> immediately seeing any statements about the overall effect upon
>>> real-world workloads.  For a good example, does this patchset
>>> measurably improve throughput or energy consumption on your servers?
>>
>> Hi Andrew,
>>
>> I re-tested with the following physical machines:
>>
>> Architecture:        x86_64
>> CPU(s):              96
>> On-line CPU(s) list: 0-95
>> Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
>>
>> I found that the reason for the hotspot I described in cover letter is
>> wrong. The reason for the down_read_trylock() hotspot is not because of
>> the failure to trylock, but simply because of the atomic operation
>> (cmpxchg). And this will lead to a significant reduction in IPC (insn
>> per cycle).
> 
> ...
>   
>> Then we can use the following perf command to view hotspots:
>>
>> perf top -U -F 999
>>
>> 1) Before applying this patchset:
>>
>>    32.31%  [kernel]           [k] down_read_trylock
>>    19.40%  [kernel]           [k] pv_native_safe_halt
>>    16.24%  [kernel]           [k] up_read
>>    15.70%  [kernel]           [k] shrink_slab
>>     4.69%  [kernel]           [k] _find_next_bit
>>     2.62%  [kernel]           [k] shrink_node
>>     1.78%  [kernel]           [k] shrink_lruvec
>>     0.76%  [kernel]           [k] do_shrink_slab
>>
>> 2) After applying this patchset:
>>
>>    27.83%  [kernel]           [k] _find_next_bit
>>    16.97%  [kernel]           [k] shrink_slab
>>    15.82%  [kernel]           [k] pv_native_safe_halt
>>     9.58%  [kernel]           [k] shrink_node
>>     8.31%  [kernel]           [k] shrink_lruvec
>>     5.64%  [kernel]           [k] do_shrink_slab
>>     3.88%  [kernel]           [k] mem_cgroup_iter
>>
>> 2. At the same time, we use the following perf command to capture IPC
>> information:
>>
>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>
>> 1) Before applying this patchset:
>>
>>   Performance counter stats for 'system wide' (5 runs):
>>
>>        454187219766      cycles                    test                    (
>> +-  1.84% )
>>         78896433101      instructions              test #    0.17  insn per
>> cycle           ( +-  0.44% )
>>
>>          10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
>>
>> 2) After applying this patchset:
>>
>>   Performance counter stats for 'system wide' (5 runs):
>>
>>        841954709443      cycles                    test                    (
>> +- 15.80% )  (98.69%)
>>        527258677936      instructions              test #    0.63  insn per
>> cycle           ( +- 15.11% )  (98.68%)
>>
>>            10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
>>
>> We can see that IPC drops very seriously when calling
>> down_read_trylock() at high frequency. After using SRCU,
>> the IPC is at a normal level.
> 
> The results you present do show improvement in IPC for an artificial test
> script. But more interesting would be to see how a real world workloads
> benefit from your changes.

Hi Mike and Andrew,

I did encounter this problem under the real workload of our online
server. At the end of this email, I posted another call stack and
hot spot that I found before.

I scanned the hotspots of all our online servers yesterday and today, 
but unfortunately did not find the live environment.

Some of our servers have a large number of containers, and each
container will mount some file systems. This is likely to trigger
down_read_trylock() hotspots when the memory pressure of the whole
machine or the memory pressure of memcg is high.

So I just found a physical server with a similar configuration to the
online server yesterday for a simulation test. The call stack and the 
hot spot in the simulation test are almost exactly the same, so in
theory, when such a hot spot appears on the online server, we can also
enjoy the improvement of IPC. This will improve the performance of the
server in memory exhaustion scenarios (memcg or global level).

And the above scenario is only one aspect, and the other aspect is the
lock competition scenario mentioned by Kirill. After applying this patch 
set, slab shrink and register_shrinker() can be completely parallelized,
which can fix that problem.

These are the two main benefits for real workloads that I consider.

Thanks,
Qi

call stack
----------

@[
	down_read_trylock+1
	shrink_slab+128
	shrink_node+371
	do_try_to_free_pages+232
	try_to_free_pages+243
	_alloc_pages_slowpath+771
	_alloc_pages_nodemask+702
	pagecache_get_page+255
	filemap_fault+1361
	ext4_filemap_fault+44
	__do_fault+76
	handle_mm_fault+3543
	do_user_addr_fault+442
	do_page_fault+48
	page_fault+62
]: 1161690
@[
	down_read_trylock+1
	shrink_slab+128
	shrink_node+371
	balance_pgdat+690
	kswapd+389
	kthread+246
	ret_from_fork+31
]: 8424884
@[
	down_read_trylock+1
	shrink_slab+128
	shrink_node+371
	do_try_to_free_pages+232
	try_to_free_pages+243
	__alloc_pages_slowpath+771
	__alloc_pages_nodemask+702
	__do_page_cache_readahead+244
	filemap_fault+1674
	ext4_filemap_fault+44
	__do_fault+76
	handle_mm_fault+3543
	do_user_addr_fault+442
	do_page_fault+48
	page_fault+62
]: 20917631

hotspot
-------

52.22% [kernel]		[k] down_read_trylock
19.60% [kernel]		[k] up_read
  8.86% [kernel]		[k] shrink_slab
  2.44% [kernel]		[k] idr_find
  1.25% [kernel]		[k] count_shadow_nodes
  1.18% [kernel]		[k] shrink lruvec
  0.71% [kernel]		[k] mem_cgroup_iter
  0.71% [kernel]		[k] shrink_node
  0.55% [kernel]		[k] find_next_bit


>   
>> Thanks,
>> Qi
> 

-- 
Thanks,
Qi

next prev parent reply	other threads:[~2023-02-28 10:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-26 14:46 Qi Zheng
2023-02-26 14:46 ` [PATCH v3 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
2023-02-26 14:54   ` Qi Zheng
2023-02-26 14:46 ` [PATCH v3 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-02-26 14:46 ` [PATCH v3 3/8] mm: vmscan: make memcg " Qi Zheng
2023-02-26 14:46 ` [PATCH v3 4/8] mm: vmscan: add shrinker_srcu_generation Qi Zheng
2023-02-26 14:46 ` [PATCH v3 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-02-26 14:46 ` [PATCH v3 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-02-26 14:46 ` [PATCH v3 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-02-26 14:46 ` [PATCH v3 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-02-26 19:51 ` [PATCH v3 0/8] make slab shrink lockless Andrew Morton
2023-02-27 13:31   ` Qi Zheng
2023-02-27 15:08     ` Mike Rapoport
2023-02-27 19:20       ` Kirill Tkhai
2023-02-27 19:32         ` Roman Gushchin
2023-02-27 19:47           ` Kirill Tkhai
2023-02-28 10:08         ` Qi Zheng
2023-02-28 10:04       ` Qi Zheng [this message]
2023-02-28 10:53         ` Qi Zheng
2023-02-28 18:40       ` Michal Hocko
2023-03-01  2:27         ` Qi Zheng
2023-02-27 19:02     ` Roman Gushchin
2023-02-28 10:11       ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=63a16f0e-d6e9-29a1-069e-dc76bfd82319@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=sultan@kerneltoast.com \
    --cc=tkhai@ya.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox