linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: paulmck@kernel.org
Cc: akpm@linux-foundation.org, tkhai@ya.ru, hannes@cmpxchg.org,
	shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev,
	muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com,
	sultan@kerneltoast.com, dave@stgolabs.net,
	penguin-kernel@I-love.SAKURA.ne.jp, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 0/7] make slab shrink lockless
Date: Fri, 24 Feb 2023 12:08:59 +0800	[thread overview]
Message-ID: <f9d0836e-3673-380c-29e2-21b31447207f@bytedance.com> (raw)
In-Reply-To: <20230223181937.GD2948950@paulmck-ThinkPad-P17-Gen-1>



On 2023/2/24 02:19, Paul E. McKenney wrote:
> On Thu, Feb 23, 2023 at 09:27:18PM +0800, Qi Zheng wrote:
>> Hi all,
>>
>> This patch series aims to make slab shrink lockless.
>>
>> 1. Background
>> =============
>>
>> On our servers, we often find the following system cpu hotspots:
>>
>>    44.16%  [kernel]  [k] down_read_trylock
>>    14.12%  [kernel]  [k] up_read
>>    13.43%  [kernel]  [k] shrink_slab
>>     5.25%  [kernel]  [k] count_shadow_nodes
>>     3.42%  [kernel]  [k] idr_find
>>
>> Then we used bpftrace to capture its calltrace as follows:
>>
>> @[
>>      down_read_trylock+5
>>      shrink_slab+292
>>      shrink_node+640
>>      do_try_to_free_pages+211
>>      try_to_free_mem_cgroup_pages+266
>>      try_charge_memcg+386
>>      charge_memcg+51
>>      __mem_cgroup_charge+44
>>      __handle_mm_fault+1416
>>      handle_mm_fault+260
>>      do_user_addr_fault+459
>>      exc_page_fault+104
>>      asm_exc_page_fault+38
>>      clear_user_rep_good+18
>>      read_zero+100
>>      vfs_read+176
>>      ksys_read+93
>>      do_syscall_64+62
>>      entry_SYSCALL_64_after_hwframe+114
>> ]: 1868979
>>
>> It is easy to see that this is caused by the frequent failure to obtain the
>> read lock of shrinker_rwsem when reclaiming slab memory.
>>
>> Currently, the shrinker_rwsem is a global lock. And the following cases may
>> cause the above system cpu hotspots:
>>
>> a. the write lock of shrinker_rwsem was held for too long. For example, there
>>     are many memcgs in the system, which causes some paths to hold locks and
>>     traverse it for too long. (e.g. expand_shrinker_info())
>> b. the read lock of shrinker_rwsem was held for too long, and a writer came at
>>     this time. Then this writer will be forced to wait and block all subsequent
>>     readers.
>>     For example:
>>     - be scheduled when the read lock of shrinker_rwsem is held in
>>       do_shrink_slab()
>>     - some shrinker are blocked for too long. Like the case mentioned in the
>>       patchset[1].
>>
>> [1]. https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/
>>
>> And all the down_read_trylock() hotspots caused by the above cases can be
>> solved by replacing the shrinker_rwsem trylocks with SRCU.

Hi Paul,

> 
> Glad to see that making SRCU unconditional was helpful!  And I do very
> much like the idea of the shrinker running better!

+1 :)

> 
> The main thing that enabled unconditional SRCU was the code added in
> v5.19 to dynamically allocate SRCU's srcu_node combining tree.  This is
> important for a number of Linux distributions that have NR_CPUS up in the
> thousands, for which this combining tree is quite large.  In v5.19 and
> later, srcu_struct structures without frequent call_srcu() invocations
> never allocate that combining tree.  Even srcu_struct structures that
> have enough call_srcu() activity to cause the lock contention that in
> turn forces the combining tree to be allocated, that combining tree
> is sized for the actual number of CPUs present, which is usually way
> smaller than NR_CPUS.

Thank you very much for such a detailed background introduction. :)

> 
> So if you are going to backport this back past v5.19, you might also
> need those SRCU changes.  Or not, depending on how much memory your
> systems are equipped with.  ;-)

Got it.

Thanks,
Qi

> 
> 							Thanx, Paul
> 
>> 2. Survey
>> =========
>>
>> Before doing the code implementation, I found that there were many similar
>> submissions in the community:
>>
>> a. Davidlohr Bueso submitted a patch in 2015.
>>     Subject: [PATCH -next v2] mm: srcu-ify shrinkers
>>     Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
>>     Result: It was finally merged into the linux-next branch, but failed on arm
>>             allnoconfig (without CONFIG_SRCU)
>>
>> b. Tetsuo Handa submitted a patchset in 2017.
>>     Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock.
>>     Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
>>     Result: Finally chose to use the current simple way (break when
>>             rwsem_is_contended()). And Christoph Hellwig suggested to using SRCU,
>>             but SRCU was not unconditionally enabled at the time.
>>
>> c. Kirill Tkhai submitted a patchset in 2018.
>>     Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab()
>>     Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
>>     Result: At that time, SRCU was not unconditionally enabled, and there were
>>             some objections to enabling SRCU. Later, because Kirill's focus was
>>             moved to other things, this patchset was not continued to be updated.
>>
>> d. Sultan Alsawaf submitted a patch in 2021.
>>     Subject: [PATCH] mm: vmscan: Replace shrinker_rwsem trylocks with SRCU protection
>>     Link: https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/
>>     Result: Rejected because SRCU was not unconditionally enabled.
>>
>> We can find that almost all these historical commits were abandoned because SRCU
>> was not unconditionally enabled. But now SRCU has been unconditionally enable
>> by Paul E. McKenney in 2023 [2], so it's time to replace shrinker_rwsem trylocks
>> with SRCU.
>>
>> [2] https://lore.kernel.org/lkml/20230105003759.GA1769545@paulmck-ThinkPad-P17-Gen-1/
>>
>> 3. Reproduction and testing
>> ===========================
>>
>> We can reproduce the down_read_trylock() hotspot through the following script:
>>
>> ```
>> #!/bin/bash
>> DIR="/root/shrinker/memcg/mnt"
>>
>> do_create()
>> {
>>          mkdir /sys/fs/cgroup/memory/test
>>          echo 200M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
>>          for i in `seq 0 $1`;
>>          do
>>                  mkdir /sys/fs/cgroup/memory/test/$i;
>>                  echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>                  mkdir -p $DIR/$i;
>>          done
>> }
>>
>> do_mount()
>> {
>>          for i in `seq $1 $2`;
>>          do
>>                  mount -t tmpfs $i $DIR/$i;
>>          done
>> }
>>
>> do_touch()
>> {
>>          for i in `seq $1 $2`;
>>          do
>>                  echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs;
>>                  dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 &
>>          done
>> }
>>
>> do_create 2000
>> do_mount 0 2000
>> do_touch 0 1000
>> ```
>>
>> Save the above script and execute it, we can get the following perf hotspots:
>>
>>    46.60%  [kernel]  [k] down_read_trylock
>>    18.70%  [kernel]  [k] up_read
>>    15.44%  [kernel]  [k] shrink_slab
>>     4.37%  [kernel]  [k] _find_next_bit
>>     2.75%  [kernel]  [k] xa_load
>>     2.07%  [kernel]  [k] idr_find
>>     1.73%  [kernel]  [k] do_shrink_slab
>>     1.42%  [kernel]  [k] shrink_lruvec
>>     0.74%  [kernel]  [k] shrink_node
>>     0.60%  [kernel]  [k] list_lru_count_one
>>
>> After applying this patchset, the hotspot becomes as follows:
>>
>>    19.53%  [kernel]  [k] _find_next_bit
>>    14.63%  [kernel]  [k] do_shrink_slab
>>    14.58%  [kernel]  [k] shrink_slab
>>    11.83%  [kernel]  [k] shrink_lruvec
>>     9.33%  [kernel]  [k] __blk_flush_plug
>>     6.67%  [kernel]  [k] mem_cgroup_iter
>>     3.73%  [kernel]  [k] list_lru_count_one
>>     2.43%  [kernel]  [k] shrink_node
>>     1.96%  [kernel]  [k] super_cache_count
>>     1.78%  [kernel]  [k] __rcu_read_unlock
>>     1.38%  [kernel]  [k] __srcu_read_lock
>>     1.30%  [kernel]  [k] xas_descend
>>
>> We can see that the slab reclaim is no longer blocked by shinker_rwsem trylock,
>> which realizes the lockless slab reclaim.
>>
>> This series is based on next-20230217.
>>
>> Comments and suggestions are welcome.
>>
>> Thanks,
>> Qi.
>>
>> Changelog in v1 -> v2:
>>   - add a map_nr_max field to shrinker_info (suggested by Kirill)
>>   - use shrinker_mutex in reparent_shrinker_deferred() (pointed by Kirill)
>>
>> Qi Zheng (7):
>>    mm: vmscan: add a map_nr_max field to shrinker_info
>>    mm: vmscan: make global slab shrink lockless
>>    mm: vmscan: make memcg slab shrink lockless
>>    mm: shrinkers: make count and scan in shrinker debugfs lockless
>>    mm: vmscan: hold write lock to reparent shrinker nr_deferred
>>    mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
>>    mm: shrinkers: convert shrinker_rwsem to mutex
>>
>>   drivers/md/dm-cache-metadata.c |   2 +-
>>   drivers/md/dm-thin-metadata.c  |   2 +-
>>   fs/super.c                     |   2 +-
>>   include/linux/memcontrol.h     |   1 +
>>   mm/shrinker_debug.c            |  38 ++++-----
>>   mm/vmscan.c                    | 142 +++++++++++++++++----------------
>>   6 files changed, 92 insertions(+), 95 deletions(-)
>>
>> -- 
>> 2.20.1
>>

-- 
Thanks,
Qi


      reply	other threads:[~2023-02-24  4:09 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-23 13:27 Qi Zheng
2023-02-23 13:27 ` [PATCH v2 1/7] mm: vmscan: add a map_nr_max field to shrinker_info Qi Zheng
2023-02-25  8:18   ` Qi Zheng
2023-02-25 15:14     ` Kirill Tkhai
2023-02-25 15:52       ` Qi Zheng
2023-02-26 13:54       ` Qi Zheng
2023-02-23 13:27 ` [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-02-23 15:26   ` Rafael Aquini
2023-02-23 15:37     ` Rafael Aquini
2023-02-24  4:09       ` Qi Zheng
2023-02-23 18:24   ` Sultan Alsawaf
2023-02-23 18:39     ` Paul E. McKenney
2023-02-23 19:18       ` Sultan Alsawaf
2023-02-24  4:00     ` Qi Zheng
2023-02-24  4:16       ` Qi Zheng
2023-02-24  8:20       ` Sultan Alsawaf
2023-02-24 10:12         ` Qi Zheng
2023-02-24 21:02       ` Kirill Tkhai
2023-02-24 21:14         ` Kirill Tkhai
2023-02-25  8:08           ` Qi Zheng
2023-02-25 15:30             ` Kirill Tkhai
2023-02-25 15:57               ` Qi Zheng
2023-02-25 16:17                 ` Kirill Tkhai
2023-02-25 16:37                   ` Qi Zheng
2023-02-25 21:28                     ` Kirill Tkhai
2023-02-26 13:56                       ` Qi Zheng
2023-02-23 13:27 ` [PATCH v2 3/7] mm: vmscan: make memcg " Qi Zheng
2023-02-23 13:27 ` [PATCH v2 4/7] mm: shrinkers: make count and scan in shrinker debugfs lockless Qi Zheng
2023-02-23 13:27 ` [PATCH v2 5/7] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-02-23 13:27 ` [PATCH v2 6/7] mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers() Qi Zheng
2023-02-23 13:27 ` [PATCH v2 7/7] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-02-23 18:19 ` [PATCH v2 0/7] make slab shrink lockless Paul E. McKenney
2023-02-24  4:08   ` Qi Zheng [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f9d0836e-3673-380c-29e2-21b31447207f@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=paulmck@kernel.org \
    --cc=penguin-kernel@I-love.SAKURA.ne.jp \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=sultan@kerneltoast.com \
    --cc=tkhai@ya.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox