From: Vlastimil Babka <vbabka@suse.cz>
To: Qi Zheng <zhengqi.arch@bytedance.com>,
akpm@linux-foundation.org, david@fromorbit.com, tkhai@ya.ru,
roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org,
paulmck@kernel.org, tytso@mit.edu
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-arm-msm@vger.kernel.org, dm-devel@redhat.com,
linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org,
virtualization@lists.linux-foundation.org,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
Date: Thu, 22 Jun 2023 17:12:02 +0200 [thread overview]
Message-ID: <cf0d9b12-6491-bf23-b464-9d01e5781203@suse.cz> (raw)
In-Reply-To: <20230622085335.77010-25-zhengqi.arch@bytedance.com>
On 6/22/23 10:53, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in
> shrinkers subsystem, which protects most operations
> such as slab shrink, registration and unregistration
> of shrinkers, etc. This can easily cause problems in
> the following cases.
>
> 1) When the memory pressure is high and there are many
> filesystems mounted or unmounted at the same time,
> slab shrink will be affected (down_read_trylock()
> failed).
>
> Such as the real workload mentioned by Kirill Tkhai:
>
> ```
> One of the real workloads from my experience is start
> of an overcommitted node containing many starting
> containers after node crash (or many resuming containers
> after reboot for kernel update). In these cases memory
> pressure is huge, and the node goes round in long reclaim.
> ```
>
> 2) If a shrinker is blocked (such as the case mentioned
> in [1]) and a writer comes in (such as mount a fs),
> then this writer will be blocked and cause all
> subsequent shrinker-related operations to be blocked.
>
> Even if there is no competitor when shrinking slab, there
> may still be a problem. If we have a long shrinker list
> and we do not reclaim enough memory with each shrinker,
> then the down_read_trylock() may be called with high
> frequency. Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop
> in IPC (instructions per cycle).
>
> We used to implement the lockless slab shrink with
> SRCU [1], but then kernel test robot reported -88.8%
> regression in stress-ng.ramfs.ops_per_sec test case [2],
> so we reverted it [3].
>
> This commit uses the refcount+RCU method [4] proposed by
> by Dave Chinner to re-implement the lockless global slab
> shrink. The memcg slab shrink is handled in the subsequent
> patch.
>
> Currently, the shrinker instances can be divided into
> the following three types:
>
> a) global shrinker instance statically defined in the kernel,
> such as workingset_shadow_shrinker.
>
> b) global shrinker instance statically defined in the kernel
> modules, such as mmu_shrinker in x86.
>
> c) shrinker instance embedded in other structures.
>
> For case a, the memory of shrinker instance is never freed.
> For case b, the memory of shrinker instance will be freed
> after the module is unloaded. But we will call synchronize_rcu()
> in free_module() to wait for RCU read-side critical section to
> exit. For case c, the memory of shrinker instance will be
> dynamically freed by calling kfree_rcu(). So we can use
> rcu_read_{lock,unlock}() to ensure that the shrinker instance
> is valid.
>
> The shrinker::refcount mechanism ensures that the shrinker
> instance will not be run again after unregistration. So the
> structure that records the pointer of shrinker instance can be
> safely freed without waiting for the RCU read-side critical
> section.
>
> In this way, while we implement the lockless slab shrink, we
> don't need to be blocked in unregister_shrinker() to wait
> RCU read-side critical section.
>
> The following are the test results:
>
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>
> 1) Before applying this patchset:
>
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
> (secs) (secs) (secs) (real time) (usr+sys time)
> ramfs 880623 60.02 7.71 226.93 14671.45 3753.09
> ramfs:
> 1 System Management Interrupt
> for a 60.03s run time:
> 5762.40s available CPU time
> 7.71s user time ( 0.13%)
> 226.93s system time ( 3.94%)
> 234.64s total time ( 4.07%)
> load average: 8.54 3.06 2.11
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.03s (1 min, 0.03 secs)
>
> 2) After applying this patchset:
>
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
> (secs) (secs) (secs) (real time) (usr+sys time)
> ramfs 847562 60.02 7.44 230.22 14120.66 3566.23
> ramfs:
> 4 System Management Interrupts
> for a 60.12s run time:
> 5771.95s available CPU time
> 7.44s user time ( 0.13%)
> 230.22s system time ( 3.99%)
> 237.66s total time ( 4.12%)
> load average: 8.18 2.43 0.84
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.12s (1 min, 0.12 secs)
>
> We can see that the ops/s has hardly changed.
>
> [1]. https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.arch@bytedance.com/
> [2]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/
> [3]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zheng@linux.dev/
> [4]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> include/linux/shrinker.h | 6 ++++++
> mm/vmscan.c | 33 ++++++++++++++-------------------
> 2 files changed, 20 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 7bfeb2f25246..b0c6c2df9db8 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -74,6 +74,7 @@ struct shrinker {
>
> refcount_t refcount;
> struct completion completion_wait;
> + struct rcu_head rcu;
>
> void *private_data;
>
> @@ -123,6 +124,11 @@ struct shrinker *shrinker_alloc_and_init(count_objects_cb count,
> void shrinker_free(struct shrinker *shrinker);
> void unregister_and_free_shrinker(struct shrinker *shrinker);
>
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
> static inline void shrinker_put(struct shrinker *shrinker)
> {
> if (refcount_dec_and_test(&shrinker->refcount))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6f9c4750effa..767569698946 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -57,6 +57,7 @@
> #include <linux/khugepaged.h>
> #include <linux/rculist_nulls.h>
> #include <linux/random.h>
> +#include <linux/rculist.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -742,7 +743,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> down_write(&shrinker_rwsem);
> refcount_set(&shrinker->refcount, 1);
> init_completion(&shrinker->completion_wait);
> - list_add_tail(&shrinker->list, &shrinker_list);
> + list_add_tail_rcu(&shrinker->list, &shrinker_list);
> shrinker->flags |= SHRINKER_REGISTERED;
> shrinker_debugfs_add(shrinker);
> up_write(&shrinker_rwsem);
> @@ -800,7 +801,7 @@ void unregister_shrinker(struct shrinker *shrinker)
> wait_for_completion(&shrinker->completion_wait);
>
> down_write(&shrinker_rwsem);
> - list_del(&shrinker->list);
> + list_del_rcu(&shrinker->list);
> shrinker->flags &= ~SHRINKER_REGISTERED;
> if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> unregister_memcg_shrinker(shrinker);
> @@ -845,7 +846,7 @@ EXPORT_SYMBOL(shrinker_free);
> void unregister_and_free_shrinker(struct shrinker *shrinker)
> {
> unregister_shrinker(shrinker);
> - kfree(shrinker);
> + kfree_rcu(shrinker, rcu);
> }
> EXPORT_SYMBOL(unregister_and_free_shrinker);
>
> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, &shrinker_list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
> struct shrink_control sc = {
> .gfp_mask = gfp_mask,
> .nid = nid,
> .memcg = memcg,
> };
>
> + if (!shrinker_try_get(shrinker))
> + continue;
> + rcu_read_unlock();
I don't think you can do this unlock?
> +
> ret = do_shrink_slab(&sc, shrinker, priority);
> if (ret == SHRINK_EMPTY)
> ret = 0;
> freed += ret;
> - /*
> - * Bail out if someone want to register a new shrinker to
> - * prevent the registration from being stalled for long periods
> - * by parallel ongoing shrinking.
> - */
> - if (rwsem_is_contended(&shrinker_rwsem)) {
> - freed = freed ? : 1;
> - break;
> - }
> - }
>
> - up_read(&shrinker_rwsem);
> -out:
> + rcu_read_lock();
That new rcu_read_lock() won't help AFAIK, the whole
list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
safe.
IIUC this is why Dave in [4] suggests unifying shrink_slab() with
shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
> + shrinker_put(shrinker);
> + }
> + rcu_read_unlock();
> cond_resched();
> return freed;
> }
next prev parent reply other threads:[~2023-06-22 15:12 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-22 8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
2023-06-22 8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
2023-06-22 14:47 ` Vlastimil Babka
2023-06-23 12:50 ` [External] " Qi Zheng
2023-06-22 8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
2023-06-23 6:12 ` Dave Chinner
2023-06-23 12:49 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
2023-06-23 13:33 ` Qi Zheng
2023-06-23 14:18 ` Bobs_Email
2023-06-22 8:53 ` [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
2023-06-23 21:49 ` Chuck Lever
2023-06-24 11:17 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 21/29] fs: super: dynamically allocate the s_shrink Qi Zheng
2023-06-22 8:53 ` [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
2023-06-22 8:53 ` [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields Qi Zheng
2023-06-22 8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-06-22 15:12 ` Vlastimil Babka [this message]
2023-06-22 16:42 ` Qi Zheng
2023-06-22 17:41 ` Alan Huang
2023-06-22 18:18 ` Qi Zheng
2023-06-23 6:29 ` Dave Chinner
2023-06-23 13:10 ` Qi Zheng
2023-06-23 22:19 ` Dave Chinner
2023-06-24 11:08 ` Qi Zheng
2023-06-25 3:15 ` Qi Zheng
2023-07-04 4:20 ` Qi Zheng
2023-07-03 16:39 ` Paul E. McKenney
2023-07-04 3:45 ` Qi Zheng
2023-07-05 3:27 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 25/29] mm: vmscan: make memcg " Qi Zheng
2023-06-22 8:53 ` [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless Qi Zheng
2023-06-22 8:53 ` [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-06-22 8:53 ` [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-06-22 8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
2023-06-22 14:53 ` Vlastimil Babka
2023-06-23 13:12 ` Qi Zheng
2023-06-23 5:25 ` Sergey Senozhatsky
2023-06-23 13:24 ` Qi Zheng
2023-06-22 9:02 ` [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cf0d9b12-6491-bf23-b464-9d01e5781203@suse.cz \
--to=vbabka@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=dm-devel@redhat.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=tkhai@ya.ru \
--cc=tytso@mit.edu \
--cc=virtualization@lists.linux-foundation.org \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox