From: "Paul E. McKenney" <paulmck@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
Qi Zheng <zhengqi.arch@bytedance.com>,
akpm@linux-foundation.org, tkhai@ya.ru, roman.gushchin@linux.dev,
djwong@kernel.org, brauner@kernel.org, tytso@mit.edu,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-arm-msm@vger.kernel.org, dm-devel@redhat.com,
linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org,
virtualization@lists.linux-foundation.org,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
Date: Mon, 3 Jul 2023 09:39:23 -0700 [thread overview]
Message-ID: <cc894c77-717a-4e9f-b649-48bab40e7c60@paulmck-laptop> (raw)
In-Reply-To: <ZJU708VIyJ/3StAX@dread.disaster.area>
On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
> > On 6/22/23 10:53, Qi Zheng wrote:
> > > @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
> > > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> > > return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
> > >
> > > - if (!down_read_trylock(&shrinker_rwsem))
> > > - goto out;
> > > -
> > > - list_for_each_entry(shrinker, &shrinker_list, list) {
> > > + rcu_read_lock();
> > > + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
> > > struct shrink_control sc = {
> > > .gfp_mask = gfp_mask,
> > > .nid = nid,
> > > .memcg = memcg,
> > > };
> > >
> > > + if (!shrinker_try_get(shrinker))
> > > + continue;
> > > + rcu_read_unlock();
> >
> > I don't think you can do this unlock?
Sorry to be slow to respond here, this one fell through the cracks.
And thank you to Qi for reminding me!
If you do this unlock, you had jolly well better nail down the current
element (the one referenced by shrinker), for example, by acquiring an
explicit reference count on the object. And presumably this is exactly
what shrinker_try_get() is doing. And a look at your 24/29 confirms this,
at least assuming that shrinker->refcount is set to zero before the call
to synchronize_rcu() in free_module() *and* that synchronize_rcu() doesn't
start until *after* shrinker_put() calls complete(). Plus, as always,
the object must be removed from the list before the synchronize_rcu()
starts. (On these parts of the puzzle, I defer to those more familiar
with this code path. And I strongly suggest carefully commenting this
type of action-at-a-distance design pattern.)
Why is this important? Because otherwise that object might be freed
before you get to the call to rcu_read_lock() at the end of this loop.
And if that happens, list_for_each_entry_rcu() will be walking the
freelist, which is quite bad for the health and well-being of your kernel.
There are a few other ways to make this sort of thing work:
1. Defer the shrinker_put() to the beginning of the loop.
You would need a flag initially set to zero, and then set to
one just before (or just after) the rcu_read_lock() above.
You would also need another shrinker_old pointer to track the
old pointer. Then at the top of the loop, if the flag is set,
invoke shrinker_put() on shrinker_old. This ensures that the
previous shrinker structure stays around long enough to allow
the loop to find the next shrinker structure in the list.
This approach is attractive when the removal code path
can invoke shrinker_put() after the grace period ends.
2. Make shrinker_put() invoke call_rcu() when ->refcount reaches
zero, and have the callback function free the object. This of
course requires adding an rcu_head structure to the shrinker
structure, which might or might not be a reasonable course of
action. If adding that rcu_head is reasonable, this simplifies
the logic quite a bit.
3. For the shrinker-structure-removal code path, remove the shrinker
structure, then remove the initial count from ->refcount,
and then keep doing grace periods until ->refcount is zero,
then do one more. Of course, if the result of removing the
initial count was zero, then only a single additional grace
period is required.
This would need to be carefully commented, as it is a bit
unconventional.
There are probably many other ways, but just to give an idea of a few
other ways to do this.
> > > +
> > > ret = do_shrink_slab(&sc, shrinker, priority);
> > > if (ret == SHRINK_EMPTY)
> > > ret = 0;
> > > freed += ret;
> > > - /*
> > > - * Bail out if someone want to register a new shrinker to
> > > - * prevent the registration from being stalled for long periods
> > > - * by parallel ongoing shrinking.
> > > - */
> > > - if (rwsem_is_contended(&shrinker_rwsem)) {
> > > - freed = freed ? : 1;
> > > - break;
> > > - }
> > > - }
> > >
> > > - up_read(&shrinker_rwsem);
> > > -out:
> > > + rcu_read_lock();
> >
> > That new rcu_read_lock() won't help AFAIK, the whole
> > list_for_each_entry_rcu() needs to be under the single rcu_read_lock() to be
> > safe.
>
> Yeah, that's the pattern we've been taught and the one we can look
> at and immediately say "this is safe".
>
> This is a different pattern, as has been explained bi Qi, and I
> think it *might* be safe.
>
> *However.*
>
> Right now I don't have time to go through a novel RCU list iteration
> pattern it one step at to determine the correctness of the
> algorithm. I'm mostly worried about list manipulations that can
> occur outside rcu_read_lock() section bleeding into the RCU
> critical section because rcu_read_lock() by itself is not a memory
> barrier.
>
> Maybe Paul has seen this pattern often enough he could simply tell
> us what conditions it is safe in. But for me to work that out from
> first principles? I just don't have the time to do that right now.
If the code does just the right sequence of things on the removal path
(remove, decrement reference, wait for reference to go to zero, wait for
grace period, free), then it would work. If this is what is happening,
I would argue for more comments. ;-)
Thanx, Paul
> > IIUC this is why Dave in [4] suggests unifying shrink_slab() with
> > shrink_slab_memcg(), as the latter doesn't iterate the list but uses IDR.
>
> Yes, I suggested the IDR route because radix tree lookups under RCU
> with reference counted objects are a known safe pattern that we can
> easily confirm is correct or not. Hence I suggested the unification
> + IDR route because it makes the life of reviewers so, so much
> easier...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2023-07-03 16:39 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-22 8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
2023-06-22 8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
2023-06-22 14:47 ` Vlastimil Babka
2023-06-23 12:50 ` [External] " Qi Zheng
2023-06-22 8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
2023-06-23 6:12 ` Dave Chinner
2023-06-23 12:49 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
2023-06-23 13:33 ` Qi Zheng
2023-06-23 14:18 ` Bobs_Email
2023-06-22 8:53 ` [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
2023-06-23 21:49 ` Chuck Lever
2023-06-24 11:17 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 21/29] fs: super: dynamically allocate the s_shrink Qi Zheng
2023-06-22 8:53 ` [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
2023-06-22 8:53 ` [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields Qi Zheng
2023-06-22 8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-06-22 15:12 ` Vlastimil Babka
2023-06-22 16:42 ` Qi Zheng
2023-06-22 17:41 ` Alan Huang
2023-06-22 18:18 ` Qi Zheng
2023-06-23 6:29 ` Dave Chinner
2023-06-23 13:10 ` Qi Zheng
2023-06-23 22:19 ` Dave Chinner
2023-06-24 11:08 ` Qi Zheng
2023-06-25 3:15 ` Qi Zheng
2023-07-04 4:20 ` Qi Zheng
2023-07-03 16:39 ` Paul E. McKenney [this message]
2023-07-04 3:45 ` Qi Zheng
2023-07-05 3:27 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 25/29] mm: vmscan: make memcg " Qi Zheng
2023-06-22 8:53 ` [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless Qi Zheng
2023-06-22 8:53 ` [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-06-22 8:53 ` [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-06-22 8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
2023-06-22 14:53 ` Vlastimil Babka
2023-06-23 13:12 ` Qi Zheng
2023-06-23 5:25 ` Sergey Senozhatsky
2023-06-23 13:24 ` Qi Zheng
2023-06-22 9:02 ` [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cc894c77-717a-4e9f-b649-48bab40e7c60@paulmck-laptop \
--to=paulmck@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=dm-devel@redhat.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=tkhai@ya.ru \
--cc=tytso@mit.edu \
--cc=vbabka@suse.cz \
--cc=virtualization@lists.linux-foundation.org \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox