From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8F775C001B0
	for <linux-mm@archiver.kernel.org>; Wed,  5 Jul 2023 03:27:43 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 99E706B0071; Tue,  4 Jul 2023 23:27:42 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 94FFD8D0001; Tue,  4 Jul 2023 23:27:42 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 815FE6B0074; Tue,  4 Jul 2023 23:27:42 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 712966B0071
	for <linux-mm@kvack.org>; Tue,  4 Jul 2023 23:27:42 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 2FD93B0420
	for <linux-mm@kvack.org>; Wed,  5 Jul 2023 03:27:42 +0000 (UTC)
X-FDA: 80976123564.16.DAD9BE9
Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172])
	by imf13.hostedemail.com (Postfix) with ESMTP id 2807D20018
	for <linux-mm@kvack.org>; Wed,  5 Jul 2023 03:27:38 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=ifLsYO8F;
	spf=pass (imf13.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com;
	dmarc=pass (policy=quarantine) header.from=bytedance.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688527660; a=rsa-sha256;
	cv=none;
	b=OiCCpdnkDmjhK93SpJKfzThq+yXehO1w1D+hfeD8bHiPZ7GrHEHlCHlf2qZzH4n7DdWw4a
	lTXMTWXQWp2l5+tmJyZUp/jeDmDMUOQn0DW/j51GCCZP9Dbp5bWqmlmh0YyNpTeRKaZfh0
	/FfTJfgHYSClc9qWt2YNOF9URBIEK6w=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=ifLsYO8F;
	spf=pass (imf13.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com;
	dmarc=pass (policy=quarantine) header.from=bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1688527660;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=;
	b=JTomUN2X3ApRmja5ybFhp2XwdiJTE2qVx1T5QCwjwHSsGXTdlZ6edqncZ+3ycU107lSqEG
	mTrSUxFTojJ2R+84poNrzMXBuv4FWWr44H29esOe2VQiiWWT1y+3J41zstC8WmJew7/5fY
	faK1527ZHIEH2fPlGQwATsRAKVV/nzg=
Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-66d6a9851f3so1321876b3a.0
        for <linux-mm@kvack.org>; Tue, 04 Jul 2023 20:27:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance.com; s=google; t=1688527657; x=1691119657;
        h=content-transfer-encoding:in-reply-to:references:cc:to:from
         :content-language:subject:user-agent:mime-version:date:message-id
         :from:to:cc:subject:date:message-id:reply-to;
        bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=;
        b=ifLsYO8F2I4MBHVlF2Or5hf7tTNep81Mx+BsSk/nCvCzxaPTLrQYJUCegbZ71CLvqI
         jnwmhHljMYjJqvknT1p7Cvh4o3Zetn0nbYVkcxdIejyVlEmEoOc51mjv2rm9Oe7hCkmi
         BhkEzv2gsaM2VKlzfD2ZXlb5JXn4cFBH6rlbCG9Gisi0SSs52NHZD0JL0DYBTusr279z
         ECt8JJbhMcUgBeDhNudIJn3gOdFaKif3GEH22u6ET+bb2NCPdK1crIBUZU+UjNX2cqQu
         Xo1u3OB8WsD6YcqowSF12wIwDOJl+hjeHnPzPy5ZJwY8SE6f5Rm09aAg64VyQm7gLUPN
         jgMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1688527657; x=1691119657;
        h=content-transfer-encoding:in-reply-to:references:cc:to:from
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=;
        b=dAf+E5T1iDpOUL0QY40kjwUFH1r1RWQeAfC2boIMd5tgWJxraRYlZeA81M37TQ6+Iq
         os94yeRpb06yq7w0ouS6wxd6kS7flEFpKJJZlLpgVOOi9UOLoprFdnJ3UY4IhldwcFLA
         a4hVljuQyd88JjE2DBqnTQwc2oF1QSG1pVFlUJ5e4nrhTGM7IBuFK31bJYfxlSU3USSS
         zV+0J4X4yk8MGBidyAbM1ga2UbbB3iehJS9q1ByGSw+B2F7hr/lbdU/WWeZgcs3wJNpm
         24lqm0+ijQWfwli4qqTM+8nlv73/x3jJWei75dEmipUc5jMJWYM4V2NvUFfD2sRfwxsy
         h/Tw==
X-Gm-Message-State: ABy/qLbbFTehXUQDa7ZZ3Q0vH1grl7SROOZ4lMHE01maiy4ormtxlZjw
	Kawxk/5I9wI8XWqY5N2tnUgDRw==
X-Google-Smtp-Source: APBJJlHJ/xdp2D4mprG7cOjZwHFFckWD0OnUdfulMpel6EsGVgn6d2uqpP3j7nke+gBKkBfRchiLhg==
X-Received: by 2002:a05:6a00:1f90:b0:675:8627:a291 with SMTP id bg16-20020a056a001f9000b006758627a291mr16245087pfb.3.1688527657628;
        Tue, 04 Jul 2023 20:27:37 -0700 (PDT)
Received: from [10.70.252.135] ([203.208.167.147])
        by smtp.gmail.com with ESMTPSA id fe10-20020a056a002f0a00b0064fde7ae1ffsm13136627pfb.38.2023.07.04.20.27.30
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Tue, 04 Jul 2023 20:27:37 -0700 (PDT)
Message-ID: <733af312-fb2d-3ec4-54c8-f154447c2051@bytedance.com>
Date: Wed, 5 Jul 2023 11:27:28 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.12.0
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
Content-Language: en-US
From: Qi Zheng <zhengqi.arch@bytedance.com>
To: paulmck@kernel.org, Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>, akpm@linux-foundation.org, tkhai@ya.ru,
 roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org,
 tytso@mit.edu, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
 intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
 linux-arm-msm@vger.kernel.org, dm-devel@redhat.com,
 linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org,
 virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
 linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org,
 linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org
References: <20230622085335.77010-1-zhengqi.arch@bytedance.com>
 <20230622085335.77010-25-zhengqi.arch@bytedance.com>
 <cf0d9b12-6491-bf23-b464-9d01e5781203@suse.cz>
 <ZJU708VIyJ/3StAX@dread.disaster.area>
 <cc894c77-717a-4e9f-b649-48bab40e7c60@paulmck-laptop>
 <3efa68e0-b04f-5c11-4fe2-2db0784064fc@bytedance.com>
In-Reply-To: <3efa68e0-b04f-5c11-4fe2-2db0784064fc@bytedance.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: 2807D20018
X-Stat-Signature: inh3kyj8tfsnon4rejnwofbxier7nqg1
X-Rspam-User: 
X-HE-Tag: 1688527658-678587
X-HE-Meta: U2FsdGVkX1+g6cedsBdR/Z+HgBS3N8rLravmYcNme5Yz7BozKX7hwNSyasAV9hA8F1gbDTZJBGVpzSttEIBfmlB1RopgmZOx47kLMmOFTI23d6FOrR369DirwZSVOp7U4bDZQacoAYQj2zlLZmd8Iq86N5zsHQDncQfjGCRLeQpyf2jP0vQyhB4V1i+us72pAelK+S1FL2tP3jsN0O1NqR+JiWcRm7fL/GaA0qtn+ZvfH1d+aIh6GGET+6o2B/CKsLD+/kFAwodk3hbUVTp5WXYfMI2N5g2gK3l1LcYDa3TaBOs0fcYyN6eTj1mPhtIezzWRIjFHhYRHdOn6faCe6g4t4+1siVYVGavVlNZGP9p5ahE1ATbHcUeOGhl1N2lRBl8JozvOSm+bksie1K3HR/+oshbr0u9SkdbHHxw00cdRrc5MAPSbyvGOHyiMexMqfp3FqRGZCwnjwecmHlEyBBiMwFu/p1DPlptgygJ8Qq3lxN9Z9rcBL8AEBdjRVCi5/WjAmm3IwC7V3nLXmudK0jfIQsPfu5yoPLJQelNsYBt/RTk9Bw8Fp/JqIa9PE1Ta4mbLtNVH8nRSZ3jzvd4N/HZXelfCeUvvs2Cggj/43K67FRwHDO7aBowZxmgnRdjJwC7w9vjHVCS+av26Ubf8nUr8LpUL57rw4Owf8qarRJfZDwIbC2Ied4ITrEP4nN9qnJZmDJUGwlsCWjbUiQ3RcwBVLtDI3wYwAdVKdKEhfPP9Eslho4NtA+Ck6asqJBy2vE594t8GFKp6GmLjj7cwI8CANJc50cMyBpmpoMtrHb0MKQjzs6nsoGR2lwv6laaKwACvsrFQEphTZKBPbPzdwIVdorIa75ZpmTZsHQgg9kYIbQUrPjZ5ubshYHVQK5WmC/5T+HAA28MT88bXC3D/n/6CKUc2rn3EJDCFPFEhZad/Y0bXuhdKRP/9Ugjz6KCT6pH35hnRRrHIyblleEF
 a2le65pF
 Lx6JsMWYLCdWlo8TJWTzOGvClPZsi3x+9EeeX4HxKzJKxTsiGXg/T26QSv4s60zHq6I9fH1LUR/RktFzJbLRKi1+l82jm6kBEQJ/GCKlY+cTo8NDiAz7cO9HyT3JQu+3vIEKnrjQAkdOUnCE1+a7QwrKPED2Jj/qss34/JoZA9NEd9RWqPR5mxslAO+zeilOf32S1I1pVSp6ujOu4VBSnKXS1A3wT23ZH74Q21j6q6B63wo02e2Z24sM60qVr+yZgKQcv6uhLPmejuOeRt4YlafU4HBrvv30FTzIrHmD1iF5VYgethw9B40Iy0VEXoeKoYpgPHkxVxpmXjH1Yq/B9CQlsRi80jmqEnYaT1QgyzBb3TTUI+rW1ottzPI8LgxLqTNb16hT14Zi7cVC/3ah/8sddkcml/wHU6XsuD/7uCuNDtV4=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 2023/7/4 11:45, Qi Zheng wrote:
> 
> 
> On 2023/7/4 00:39, Paul E. McKenney wrote:
>> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t 
>>>>> gfp_mask, int nid,
>>>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>> -    if (!down_read_trylock(&shrinker_rwsem))
>>>>> -        goto out;
>>>>> -
>>>>> -    list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>> +    rcu_read_lock();
>>>>> +    list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>>>           struct shrink_control sc = {
>>>>>               .gfp_mask = gfp_mask,
>>>>>               .nid = nid,
>>>>>               .memcg = memcg,
>>>>>           };
>>>>> +        if (!shrinker_try_get(shrinker))
>>>>> +            continue;
>>>>> +        rcu_read_unlock();
>>>>
>>>> I don't think you can do this unlock?
>>
>> Sorry to be slow to respond here, this one fell through the cracks.
>> And thank you to Qi for reminding me!
>>
>> If you do this unlock, you had jolly well better nail down the current
>> element (the one referenced by shrinker), for example, by acquiring an
>> explicit reference count on the object.  And presumably this is exactly
>> what shrinker_try_get() is doing.  And a look at your 24/29 confirms 
>> this,
>> at least assuming that shrinker->refcount is set to zero before the call
>> to synchronize_rcu() in free_module() *and* that synchronize_rcu() 
>> doesn't
>> start until *after* shrinker_put() calls complete().  Plus, as always,
>> the object must be removed from the list before the synchronize_rcu()
>> starts.  (On these parts of the puzzle, I defer to those more familiar
>> with this code path.  And I strongly suggest carefully commenting this
>> type of action-at-a-distance design pattern.)
> 
> Yeah, I think I've done it like above. A more detailed timing diagram is
> below.
> 
>>
>> Why is this important?  Because otherwise that object might be freed
>> before you get to the call to rcu_read_lock() at the end of this loop.
>> And if that happens, list_for_each_entry_rcu() will be walking the
>> freelist, which is quite bad for the health and well-being of your 
>> kernel.
>>
>> There are a few other ways to make this sort of thing work:
>>
>> 1.    Defer the shrinker_put() to the beginning of the loop.
>>     You would need a flag initially set to zero, and then set to
>>     one just before (or just after) the rcu_read_lock() above.
>>     You would also need another shrinker_old pointer to track the
>>     old pointer.  Then at the top of the loop, if the flag is set,
>>     invoke shrinker_put() on shrinker_old.    This ensures that the
>>     previous shrinker structure stays around long enough to allow
>>     the loop to find the next shrinker structure in the list.
>>
>>     This approach is attractive when the removal code path
>>     can invoke shrinker_put() after the grace period ends.
>>
>> 2.    Make shrinker_put() invoke call_rcu() when ->refcount reaches
>>     zero, and have the callback function free the object.  This of
>>     course requires adding an rcu_head structure to the shrinker
>>     structure, which might or might not be a reasonable course of
>>     action.  If adding that rcu_head is reasonable, this simplifies
>>     the logic quite a bit.
>>
>> 3.    For the shrinker-structure-removal code path, remove the shrinker
>>     structure, then remove the initial count from ->refcount,
>>     and then keep doing grace periods until ->refcount is zero,
>>     then do one more.  Of course, if the result of removing the
>>     initial count was zero, then only a single additional grace
>>     period is required.
>>
>>     This would need to be carefully commented, as it is a bit
>>     unconventional.
> 
> Thanks for such a detailed addition!
> 
>>
>> There are probably many other ways, but just to give an idea of a few
>> other ways to do this.
>>
>>>>> +
>>>>>           ret = do_shrink_slab(&sc, shrinker, priority);
>>>>>           if (ret == SHRINK_EMPTY)
>>>>>               ret = 0;
>>>>>           freed += ret;
>>>>> -        /*
>>>>> -         * Bail out if someone want to register a new shrinker to
>>>>> -         * prevent the registration from being stalled for long 
>>>>> periods
>>>>> -         * by parallel ongoing shrinking.
>>>>> -         */
>>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>> -            freed = freed ? : 1;
>>>>> -            break;
>>>>> -        }
>>>>> -    }
>>>>> -    up_read(&shrinker_rwsem);
>>>>> -out:
>>>>> +        rcu_read_lock();
>>>>
>>>> That new rcu_read_lock() won't help AFAIK, the whole
>>>> list_for_each_entry_rcu() needs to be under the single 
>>>> rcu_read_lock() to be
>>>> safe.
>>>
>>> Yeah, that's the pattern we've been taught and the one we can look
>>> at and immediately say "this is safe".
>>>
>>> This is a different pattern, as has been explained bi Qi, and I
>>> think it *might* be safe.
>>>
>>> *However.*
>>>
>>> Right now I don't have time to go through a novel RCU list iteration
>>> pattern it one step at to determine the correctness of the
>>> algorithm. I'm mostly worried about list manipulations that can
>>> occur outside rcu_read_lock() section bleeding into the RCU
>>> critical section because rcu_read_lock() by itself is not a memory
>>> barrier.
>>>
>>> Maybe Paul has seen this pattern often enough he could simply tell
>>> us what conditions it is safe in. But for me to work that out from
>>> first principles? I just don't have the time to do that right now.
>>
>> If the code does just the right sequence of things on the removal path
>> (remove, decrement reference, wait for reference to go to zero, wait for
>> grace period, free), then it would work.  If this is what is happening,
>> I would argue for more comments.  ;-)
> 
> The order of the removal path is slightly different from this:
> 
>      shrink_slab                 unregister_shrinker
>      ===========                 ===================
> 
>     shrinker_try_get()
>     rcu_read_unlock()
>                                  1. decrement initial reference
>                  shrinker_put()
>                  2. wait for reference to go to zero
>                  wait_for_completion()
>     rcu_read_lock()
> 
>     shrinker_put()
>                  3. remove the shrinker from list
>                  list_del_rcu()
>                                  4. wait for grace period
>                  kfree_rcu()/synchronize_rcu()
> 
> 
>     list_for_each_entry()
> 
>     shrinker_try_get()
>     rcu_read_unlock()
>                  5. free the shrinker
> 
> So the order is: decrement reference, wait for reference to go to zero,
> remove, wait for grace period, free.
> 
> I think this can work. And we can only do the *step 3* after we hold the
> RCU read lock again, right? Please let me know if I missed something.

Oh, you are right, It would be better to move step 3 to step 1. We
should first remove the shrinker from the shrinker_list to prevent
other traversers from finding it again, otherwise the following
situations may occur theoretically:

CPU 0                 CPU 1

shrinker_try_get()

                       shrinker_try_get()

shrinker_put()
shrinker_try_get()
                       shrinker_put()

Thanks,
Qi

> 
> Thanks,
> Qi
> 
>>
>>                             Thanx, Paul
>>
>>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses 
>>>> IDR.
>>>
>>> Yes, I suggested the IDR route because radix tree lookups under RCU
>>> with reference counted objects are a known safe pattern that we can
>>> easily confirm is correct or not.  Hence I suggested the unification
>>> + IDR route because it makes the life of reviewers so, so much
>>> easier...
>>>
>>> Cheers,
>>>
>>> Dave.
>>> -- 
>>> Dave Chinner
>>> david@fromorbit.com