linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	akpm@linux-foundation.org, tkhai@ya.ru, hannes@cmpxchg.org,
	shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev,
	muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com,
	rppt@kernel.org
Cc: sultan@kerneltoast.com, dave@stgolabs.net,
	penguin-kernel@I-love.SAKURA.ne.jp, paulmck@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 3/8] mm: vmscan: make memcg slab shrink lockless
Date: Thu, 9 Mar 2023 14:47:06 +0800	[thread overview]
Message-ID: <bbcd23a0-1869-ec95-87a4-4499b50b9683@bytedance.com> (raw)
In-Reply-To: <a5a07356-048b-562b-6748-d6d5b99acddc@suse.cz>

Hi Vlastimil,

On 2023/3/9 06:46, Vlastimil Babka wrote:
> On 3/7/23 07:56, Qi Zheng wrote:
>> Like global slab shrink, this commit also uses SRCU to make
>> memcg slab shrink lockless.
>>
>> We can reproduce the down_read_trylock() hotspot through the
>> following script:
>>
>> ```
>>
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>>      mkdir -p /sys/fs/cgroup/memory/test
>>      mkdir -p /sys/fs/cgroup/perf_event/test
>>      echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>>      for i in `seq 0 $1`;
>>      do
>>          mkdir -p /sys/fs/cgroup/memory/test/$i;
>>          echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>          echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>>          mkdir -p $DIR/$i;
>>      done
>> }
>>
>> do_mount()
>> {
>>      for i in `seq $1 $2`;
>>      do
>>          mount -t tmpfs $i $DIR/$i;
>>      done
>> }
>>
>> do_touch()
>> {
>>      for i in `seq $1 $2`;
>>      do
>>          echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>          echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs;
>>              dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>>      done
>> }
>>
>> case "$1" in
>>    touch)
>>      do_touch $2 $3
>>      ;;
>>    test)
>>        do_create 4000
>>      do_mount 0 4000
>>      do_touch 0 3000
>>      ;;
>>    *)
>>      exit 1
>>      ;;
>> esac
>> ```
>>
>> Save the above script, then run test and touch commands.
>> Then we can use the following perf command to view hotspots:
>>
>> perf top -U -F 999
>>
>> 1) Before applying this patchset:
>>
>>    32.31%  [kernel]           [k] down_read_trylock
>>    19.40%  [kernel]           [k] pv_native_safe_halt
>>    16.24%  [kernel]           [k] up_read
>>    15.70%  [kernel]           [k] shrink_slab
>>     4.69%  [kernel]           [k] _find_next_bit
>>     2.62%  [kernel]           [k] shrink_node
>>     1.78%  [kernel]           [k] shrink_lruvec
>>     0.76%  [kernel]           [k] do_shrink_slab
>>
>> 2) After applying this patchset:
>>
>>    27.83%  [kernel]           [k] _find_next_bit
>>    16.97%  [kernel]           [k] shrink_slab
>>    15.82%  [kernel]           [k] pv_native_safe_halt
>>     9.58%  [kernel]           [k] shrink_node
>>     8.31%  [kernel]           [k] shrink_lruvec
>>     5.64%  [kernel]           [k] do_shrink_slab
>>     3.88%  [kernel]           [k] mem_cgroup_iter
>>
>> At the same time, we use the following perf command to capture
>> IPC information:
>>
>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10
>>
>> 1) Before applying this patchset:
>>
>>   Performance counter stats for 'system wide' (5 runs):
>>
>>        454187219766      cycles                    test                    ( +-  1.84% )
>>         78896433101      instructions              test #    0.17  insn per cycle           ( +-  0.44% )
>>
>>          10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% )
>>
>> 2) After applying this patchset:
>>
>>   Performance counter stats for 'system wide' (5 runs):
>>
>>        841954709443      cycles                    test                    ( +- 15.80% )  (98.69%)
>>        527258677936      instructions              test #    0.63  insn per cycle           ( +- 15.11% )  (98.68%)
>>
>>            10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% )
>>
>> We can see that IPC drops very seriously when calling
>> down_read_trylock() at high frequency. After using SRCU,
>> the IPC is at a normal level.
> 
> The interpretation looks somewhat weird to me. I'd say the workload is
> stalled a lot as it fails the trylock (there might be some optimistic
> spinning perhaps) and then goes to sleep. See how "pv_native_safe_halt" is
> also more prominent in before. And because of that sleeping, there's less
> instructions executed in the same amount of cycles (as it's a system wide
> collection, otherwise it wouldn't be collecting the sleeping processes).

But in my tests, the trylock basically did not fail, so I think it is
caused by high-frequency atomic operation.

bpftrace -e 'kr:down_read_trylock {@[kstack, retval]=count();} 
interval:s:1 {exit();}'

Attaching 2 probes...

<...>
@[
     shrink_slab+288
     shrink_node+640
     do_try_to_free_pages+203
     try_to_free_mem_cgroup_pages+266
     try_charge_memcg+412
     charge_memcg+51
     __mem_cgroup_charge+44
     __handle_mm_fault+2119
     handle_mm_fault+272
     do_user_addr_fault+712
     exc_page_fault+124
     asm_exc_page_fault+38
     clear_user_erms+14
     read_zero+86
     vfs_read+173
     ksys_read+93
     do_syscall_64+56
     entry_SYSCALL_64_after_hwframe+99
, 1]: 617019
@[
     shrink_slab+288
     shrink_node+640
     do_try_to_free_pages+203
     try_to_free_mem_cgroup_pages+266
     try_charge_memcg+412
     charge_memcg+51
     __mem_cgroup_charge+44
     shmem_add_to_page_cache+545
     shmem_get_folio_gfp+621
     shmem_write_begin+95
     generic_perform_write+257
     __generic_file_write_iter+202
     generic_file_write_iter+97
     vfs_write+704
     ksys_write+93
     do_syscall_64+56
     entry_SYSCALL_64_after_hwframe+99
, 1]: 617065

> 
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Other than that:
> 
> Acked-by: Vlastimil Babka <Vbabka@suse.cz>

Thanks.

> 
> A small thing below:
> 
>> ---
>>   mm/vmscan.c | 46 +++++++++++++++++++++++++++-------------------
>>   1 file changed, 27 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8515ac40bcaf..1de9bc3e5aa2 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -57,6 +57,7 @@
>>   #include <linux/khugepaged.h>
>>   #include <linux/rculist_nulls.h>
>>   #include <linux/random.h>
>> +#include <linux/srcu.h>
> 
> I guess this should have been in patch 2/8 already? It may work accidentaly
> because some other header pulls it transitively...

Yeah, in fact, patch 3/8 also can compile successfully without srcu.h,
but maybe it is better to explicitly include this header file, I will
add it in patch 2/8.

Thanks,
Qi

> 
> 



  reply	other threads:[~2023-03-09  6:47 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-07  6:55 [PATCH v4 0/8] make " Qi Zheng
2023-03-07  6:55 ` [PATCH v4 1/8] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
2023-03-08 14:40   ` Vlastimil Babka
2023-03-08 22:13   ` Kirill Tkhai
2023-03-09  6:33     ` Qi Zheng
2023-03-07  6:55 ` [PATCH v4 2/8] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-03-08 15:02   ` Vlastimil Babka
2023-03-08 22:18   ` Kirill Tkhai
2023-03-07  6:56 ` [PATCH v4 3/8] mm: vmscan: make memcg " Qi Zheng
2023-03-08 22:23   ` Kirill Tkhai
2023-03-08 22:46   ` Vlastimil Babka
2023-03-09  6:47     ` Qi Zheng [this message]
2023-03-07  6:56 ` [PATCH v4 4/8] mm: vmscan: add shrinker_srcu_generation Qi Zheng
2023-03-09  9:23   ` Vlastimil Babka
2023-03-09 10:12     ` Qi Zheng
2023-03-07  6:56 ` [PATCH v4 5/8] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-03-09  9:36   ` Vlastimil Babka
2023-03-09  9:39   ` Vlastimil Babka
2023-03-09 10:14     ` Qi Zheng
2023-03-09 19:30   ` Kirill Tkhai
2023-03-07  6:56 ` [PATCH v4 6/8] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-03-09  9:36   ` Vlastimil Babka
2023-03-07  6:56 ` [PATCH v4 7/8] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-03-08 22:39   ` Kirill Tkhai
2023-03-09  7:06     ` Qi Zheng
2023-03-09  8:11       ` Christian König
2023-03-09  8:32         ` Qi Zheng
2023-03-09 19:34           ` Kirill Tkhai
2023-03-09  9:40   ` Vlastimil Babka
2023-03-09 19:34   ` Kirill Tkhai
2023-03-07  6:56 ` [PATCH v4 8/8] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-03-09  9:42   ` Vlastimil Babka
2023-03-09 19:49   ` Kirill Tkhai
2023-03-07 22:20 ` [PATCH v4 0/8] make slab shrink lockless Andrew Morton
2023-03-08 11:59   ` Qi Zheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bbcd23a0-1869-ec95-87a4-4499b50b9683@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=sultan@kerneltoast.com \
    --cc=tkhai@ya.ru \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox