From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F775C001B0 for ; Wed, 5 Jul 2023 03:27:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 99E706B0071; Tue, 4 Jul 2023 23:27:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 94FFD8D0001; Tue, 4 Jul 2023 23:27:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 815FE6B0074; Tue, 4 Jul 2023 23:27:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 712966B0071 for ; Tue, 4 Jul 2023 23:27:42 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2FD93B0420 for ; Wed, 5 Jul 2023 03:27:42 +0000 (UTC) X-FDA: 80976123564.16.DAD9BE9 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) by imf13.hostedemail.com (Postfix) with ESMTP id 2807D20018 for ; Wed, 5 Jul 2023 03:27:38 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ifLsYO8F; spf=pass (imf13.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688527660; a=rsa-sha256; cv=none; b=OiCCpdnkDmjhK93SpJKfzThq+yXehO1w1D+hfeD8bHiPZ7GrHEHlCHlf2qZzH4n7DdWw4a lTXMTWXQWp2l5+tmJyZUp/jeDmDMUOQn0DW/j51GCCZP9Dbp5bWqmlmh0YyNpTeRKaZfh0 /FfTJfgHYSClc9qWt2YNOF9URBIEK6w= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ifLsYO8F; spf=pass (imf13.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688527660; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=; b=JTomUN2X3ApRmja5ybFhp2XwdiJTE2qVx1T5QCwjwHSsGXTdlZ6edqncZ+3ycU107lSqEG mTrSUxFTojJ2R+84poNrzMXBuv4FWWr44H29esOe2VQiiWWT1y+3J41zstC8WmJew7/5fY faK1527ZHIEH2fPlGQwATsRAKVV/nzg= Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-66d6a9851f3so1321876b3a.0 for ; Tue, 04 Jul 2023 20:27:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1688527657; x=1691119657; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=; b=ifLsYO8F2I4MBHVlF2Or5hf7tTNep81Mx+BsSk/nCvCzxaPTLrQYJUCegbZ71CLvqI jnwmhHljMYjJqvknT1p7Cvh4o3Zetn0nbYVkcxdIejyVlEmEoOc51mjv2rm9Oe7hCkmi BhkEzv2gsaM2VKlzfD2ZXlb5JXn4cFBH6rlbCG9Gisi0SSs52NHZD0JL0DYBTusr279z ECt8JJbhMcUgBeDhNudIJn3gOdFaKif3GEH22u6ET+bb2NCPdK1crIBUZU+UjNX2cqQu Xo1u3OB8WsD6YcqowSF12wIwDOJl+hjeHnPzPy5ZJwY8SE6f5Rm09aAg64VyQm7gLUPN jgMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688527657; x=1691119657; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=; b=dAf+E5T1iDpOUL0QY40kjwUFH1r1RWQeAfC2boIMd5tgWJxraRYlZeA81M37TQ6+Iq os94yeRpb06yq7w0ouS6wxd6kS7flEFpKJJZlLpgVOOi9UOLoprFdnJ3UY4IhldwcFLA a4hVljuQyd88JjE2DBqnTQwc2oF1QSG1pVFlUJ5e4nrhTGM7IBuFK31bJYfxlSU3USSS zV+0J4X4yk8MGBidyAbM1ga2UbbB3iehJS9q1ByGSw+B2F7hr/lbdU/WWeZgcs3wJNpm 24lqm0+ijQWfwli4qqTM+8nlv73/x3jJWei75dEmipUc5jMJWYM4V2NvUFfD2sRfwxsy h/Tw== X-Gm-Message-State: ABy/qLbbFTehXUQDa7ZZ3Q0vH1grl7SROOZ4lMHE01maiy4ormtxlZjw Kawxk/5I9wI8XWqY5N2tnUgDRw== X-Google-Smtp-Source: APBJJlHJ/xdp2D4mprG7cOjZwHFFckWD0OnUdfulMpel6EsGVgn6d2uqpP3j7nke+gBKkBfRchiLhg== X-Received: by 2002:a05:6a00:1f90:b0:675:8627:a291 with SMTP id bg16-20020a056a001f9000b006758627a291mr16245087pfb.3.1688527657628; Tue, 04 Jul 2023 20:27:37 -0700 (PDT) Received: from [10.70.252.135] ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id fe10-20020a056a002f0a00b0064fde7ae1ffsm13136627pfb.38.2023.07.04.20.27.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 04 Jul 2023 20:27:37 -0700 (PDT) Message-ID: <733af312-fb2d-3ec4-54c8-f154447c2051@bytedance.com> Date: Wed, 5 Jul 2023 11:27:28 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless Content-Language: en-US From: Qi Zheng To: paulmck@kernel.org, Dave Chinner Cc: Vlastimil Babka , akpm@linux-foundation.org, tkhai@ya.ru, roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org, tytso@mit.edu, linux-kernel@vger.kernel.org, linux-mm@kvack.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, dm-devel@redhat.com, linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org References: <20230622085335.77010-1-zhengqi.arch@bytedance.com> <20230622085335.77010-25-zhengqi.arch@bytedance.com> <3efa68e0-b04f-5c11-4fe2-2db0784064fc@bytedance.com> In-Reply-To: <3efa68e0-b04f-5c11-4fe2-2db0784064fc@bytedance.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 2807D20018 X-Stat-Signature: inh3kyj8tfsnon4rejnwofbxier7nqg1 X-Rspam-User: X-HE-Tag: 1688527658-678587 X-HE-Meta: U2FsdGVkX1+g6cedsBdR/Z+HgBS3N8rLravmYcNme5Yz7BozKX7hwNSyasAV9hA8F1gbDTZJBGVpzSttEIBfmlB1RopgmZOx47kLMmOFTI23d6FOrR369DirwZSVOp7U4bDZQacoAYQj2zlLZmd8Iq86N5zsHQDncQfjGCRLeQpyf2jP0vQyhB4V1i+us72pAelK+S1FL2tP3jsN0O1NqR+JiWcRm7fL/GaA0qtn+ZvfH1d+aIh6GGET+6o2B/CKsLD+/kFAwodk3hbUVTp5WXYfMI2N5g2gK3l1LcYDa3TaBOs0fcYyN6eTj1mPhtIezzWRIjFHhYRHdOn6faCe6g4t4+1siVYVGavVlNZGP9p5ahE1ATbHcUeOGhl1N2lRBl8JozvOSm+bksie1K3HR/+oshbr0u9SkdbHHxw00cdRrc5MAPSbyvGOHyiMexMqfp3FqRGZCwnjwecmHlEyBBiMwFu/p1DPlptgygJ8Qq3lxN9Z9rcBL8AEBdjRVCi5/WjAmm3IwC7V3nLXmudK0jfIQsPfu5yoPLJQelNsYBt/RTk9Bw8Fp/JqIa9PE1Ta4mbLtNVH8nRSZ3jzvd4N/HZXelfCeUvvs2Cggj/43K67FRwHDO7aBowZxmgnRdjJwC7w9vjHVCS+av26Ubf8nUr8LpUL57rw4Owf8qarRJfZDwIbC2Ied4ITrEP4nN9qnJZmDJUGwlsCWjbUiQ3RcwBVLtDI3wYwAdVKdKEhfPP9Eslho4NtA+Ck6asqJBy2vE594t8GFKp6GmLjj7cwI8CANJc50cMyBpmpoMtrHb0MKQjzs6nsoGR2lwv6laaKwACvsrFQEphTZKBPbPzdwIVdorIa75ZpmTZsHQgg9kYIbQUrPjZ5ubshYHVQK5WmC/5T+HAA28MT88bXC3D/n/6CKUc2rn3EJDCFPFEhZad/Y0bXuhdKRP/9Ugjz6KCT6pH35hnRRrHIyblleEF a2le65pF Lx6JsMWYLCdWlo8TJWTzOGvClPZsi3x+9EeeX4HxKzJKxTsiGXg/T26QSv4s60zHq6I9fH1LUR/RktFzJbLRKi1+l82jm6kBEQJ/GCKlY+cTo8NDiAz7cO9HyT3JQu+3vIEKnrjQAkdOUnCE1+a7QwrKPED2Jj/qss34/JoZA9NEd9RWqPR5mxslAO+zeilOf32S1I1pVSp6ujOu4VBSnKXS1A3wT23ZH74Q21j6q6B63wo02e2Z24sM60qVr+yZgKQcv6uhLPmejuOeRt4YlafU4HBrvv30FTzIrHmD1iF5VYgethw9B40Iy0VEXoeKoYpgPHkxVxpmXjH1Yq/B9CQlsRi80jmqEnYaT1QgyzBb3TTUI+rW1ottzPI8LgxLqTNb16hT14Zi7cVC/3ah/8sddkcml/wHU6XsuD/7uCuNDtV4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/7/4 11:45, Qi Zheng wrote: > > > On 2023/7/4 00:39, Paul E. McKenney wrote: >> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote: >>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >>>> On 6/22/23 10:53, Qi Zheng wrote: >>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t >>>>> gfp_mask, int nid, >>>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>>>> -    if (!down_read_trylock(&shrinker_rwsem)) >>>>> -        goto out; >>>>> - >>>>> -    list_for_each_entry(shrinker, &shrinker_list, list) { >>>>> +    rcu_read_lock(); >>>>> +    list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>>>>           struct shrink_control sc = { >>>>>               .gfp_mask = gfp_mask, >>>>>               .nid = nid, >>>>>               .memcg = memcg, >>>>>           }; >>>>> +        if (!shrinker_try_get(shrinker)) >>>>> +            continue; >>>>> +        rcu_read_unlock(); >>>> >>>> I don't think you can do this unlock? >> >> Sorry to be slow to respond here, this one fell through the cracks. >> And thank you to Qi for reminding me! >> >> If you do this unlock, you had jolly well better nail down the current >> element (the one referenced by shrinker), for example, by acquiring an >> explicit reference count on the object.  And presumably this is exactly >> what shrinker_try_get() is doing.  And a look at your 24/29 confirms >> this, >> at least assuming that shrinker->refcount is set to zero before the call >> to synchronize_rcu() in free_module() *and* that synchronize_rcu() >> doesn't >> start until *after* shrinker_put() calls complete().  Plus, as always, >> the object must be removed from the list before the synchronize_rcu() >> starts.  (On these parts of the puzzle, I defer to those more familiar >> with this code path.  And I strongly suggest carefully commenting this >> type of action-at-a-distance design pattern.) > > Yeah, I think I've done it like above. A more detailed timing diagram is > below. > >> >> Why is this important?  Because otherwise that object might be freed >> before you get to the call to rcu_read_lock() at the end of this loop. >> And if that happens, list_for_each_entry_rcu() will be walking the >> freelist, which is quite bad for the health and well-being of your >> kernel. >> >> There are a few other ways to make this sort of thing work: >> >> 1.    Defer the shrinker_put() to the beginning of the loop. >>     You would need a flag initially set to zero, and then set to >>     one just before (or just after) the rcu_read_lock() above. >>     You would also need another shrinker_old pointer to track the >>     old pointer.  Then at the top of the loop, if the flag is set, >>     invoke shrinker_put() on shrinker_old.    This ensures that the >>     previous shrinker structure stays around long enough to allow >>     the loop to find the next shrinker structure in the list. >> >>     This approach is attractive when the removal code path >>     can invoke shrinker_put() after the grace period ends. >> >> 2.    Make shrinker_put() invoke call_rcu() when ->refcount reaches >>     zero, and have the callback function free the object.  This of >>     course requires adding an rcu_head structure to the shrinker >>     structure, which might or might not be a reasonable course of >>     action.  If adding that rcu_head is reasonable, this simplifies >>     the logic quite a bit. >> >> 3.    For the shrinker-structure-removal code path, remove the shrinker >>     structure, then remove the initial count from ->refcount, >>     and then keep doing grace periods until ->refcount is zero, >>     then do one more.  Of course, if the result of removing the >>     initial count was zero, then only a single additional grace >>     period is required. >> >>     This would need to be carefully commented, as it is a bit >>     unconventional. > > Thanks for such a detailed addition! > >> >> There are probably many other ways, but just to give an idea of a few >> other ways to do this. >> >>>>> + >>>>>           ret = do_shrink_slab(&sc, shrinker, priority); >>>>>           if (ret == SHRINK_EMPTY) >>>>>               ret = 0; >>>>>           freed += ret; >>>>> -        /* >>>>> -         * Bail out if someone want to register a new shrinker to >>>>> -         * prevent the registration from being stalled for long >>>>> periods >>>>> -         * by parallel ongoing shrinking. >>>>> -         */ >>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) { >>>>> -            freed = freed ? : 1; >>>>> -            break; >>>>> -        } >>>>> -    } >>>>> -    up_read(&shrinker_rwsem); >>>>> -out: >>>>> +        rcu_read_lock(); >>>> >>>> That new rcu_read_lock() won't help AFAIK, the whole >>>> list_for_each_entry_rcu() needs to be under the single >>>> rcu_read_lock() to be >>>> safe. >>> >>> Yeah, that's the pattern we've been taught and the one we can look >>> at and immediately say "this is safe". >>> >>> This is a different pattern, as has been explained bi Qi, and I >>> think it *might* be safe. >>> >>> *However.* >>> >>> Right now I don't have time to go through a novel RCU list iteration >>> pattern it one step at to determine the correctness of the >>> algorithm. I'm mostly worried about list manipulations that can >>> occur outside rcu_read_lock() section bleeding into the RCU >>> critical section because rcu_read_lock() by itself is not a memory >>> barrier. >>> >>> Maybe Paul has seen this pattern often enough he could simply tell >>> us what conditions it is safe in. But for me to work that out from >>> first principles? I just don't have the time to do that right now. >> >> If the code does just the right sequence of things on the removal path >> (remove, decrement reference, wait for reference to go to zero, wait for >> grace period, free), then it would work.  If this is what is happening, >> I would argue for more comments.  ;-) > > The order of the removal path is slightly different from this: > >     shrink_slab                 unregister_shrinker >     ===========                 =================== > >    shrinker_try_get() >    rcu_read_unlock() >                                 1. decrement initial reference >                 shrinker_put() >                 2. wait for reference to go to zero >                 wait_for_completion() >    rcu_read_lock() > >    shrinker_put() >                 3. remove the shrinker from list >                 list_del_rcu() >                                 4. wait for grace period >                 kfree_rcu()/synchronize_rcu() > > >    list_for_each_entry() > >    shrinker_try_get() >    rcu_read_unlock() >                 5. free the shrinker > > So the order is: decrement reference, wait for reference to go to zero, > remove, wait for grace period, free. > > I think this can work. And we can only do the *step 3* after we hold the > RCU read lock again, right? Please let me know if I missed something. Oh, you are right, It would be better to move step 3 to step 1. We should first remove the shrinker from the shrinker_list to prevent other traversers from finding it again, otherwise the following situations may occur theoretically: CPU 0 CPU 1 shrinker_try_get() shrinker_try_get() shrinker_put() shrinker_try_get() shrinker_put() Thanks, Qi > > Thanks, > Qi > >> >>                             Thanx, Paul >> >>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with >>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses >>>> IDR. >>> >>> Yes, I suggested the IDR route because radix tree lookups under RCU >>> with reference counted objects are a known safe pattern that we can >>> easily confirm is correct or not.  Hence I suggested the unification >>> + IDR route because it makes the life of reviewers so, so much >>> easier... >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com