From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30740C001DF for ; Thu, 27 Jul 2023 03:34:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 73EDA6B0072; Wed, 26 Jul 2023 23:34:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6EE3A6B0074; Wed, 26 Jul 2023 23:34:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 58F5B8D0001; Wed, 26 Jul 2023 23:34:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 492796B0072 for ; Wed, 26 Jul 2023 23:34:51 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 13C59140408 for ; Thu, 27 Jul 2023 03:34:51 +0000 (UTC) X-FDA: 81055975182.16.781A3AF Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf01.hostedemail.com (Postfix) with ESMTP id 1244D40002 for ; Thu, 27 Jul 2023 03:34:47 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=JH3Gdn8e; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf01.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690428889; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pSHiqW/1Kwa6XMfrutppYGKtzk0jA+6GZMwxRhGaywI=; b=VGtIZpVAL5DpHzZsLuRwVW0wCbHfkwMtOiqyMhQ5ETEXJQcuXzTwe+tVj+Me6/44IMYFLn VtUzmBUJshrqo84m/5M5SPH+AuewxTM/kWZ59UsJC6DH5YxkRD0857ZAzKyjeTltOZJ1Mk mvvzSyc44PqbFxPfHaRTGksjRVdnQL8= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=JH3Gdn8e; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf01.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.174 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690428889; a=rsa-sha256; cv=none; b=H2t4Kzd8m5dGFR6O0o0sjLAtcTSpn16F2j5daoMrjZdIp/z1MGvD4WALZ4NpgRCkVG6qZi 0DTk0vhdqTyqx2PJChzUq1Dogf5dw/tyXG1XRKKQBsSyPTL/HQzbNXVklX6aXNrGFYj1YD 0QluWWUb3oKz8BntMdNvK5RnPV4arfY= Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-682eef7d752so141134b3a.0 for ; Wed, 26 Jul 2023 20:34:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690428886; x=1691033686; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=pSHiqW/1Kwa6XMfrutppYGKtzk0jA+6GZMwxRhGaywI=; b=JH3Gdn8erluTpD0SiBv0fHlvH29QoJbVthZHImXx40T/IvgK0JIjZ0GSYiuzzLmgEA bHXNFFpOY9uD+vNkdeoyqNVINpCZUHYDiHMEBDkcaaYiJnpqdxqxP9HCH+uCZSOxVPaX a40KLwmD/5OmsTBlZF8d2TLCkDFfaCI3tBilLI9wA3S9O/RNu5Wh5IyCFOIuf9bbZc9M BojY8O58Av3dHh2cjAZwQFUo45GIDCrR2qRfQCljzavnidON0gNgsmWW0OgyxUqV3yLa +g0Rro6R0vH8urzZTF9hLw+CTeEMI8A36Bu5e/fFgdet0UR9Esy2HUMsL8V+5OEcbFBS k6XQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690428886; x=1691033686; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pSHiqW/1Kwa6XMfrutppYGKtzk0jA+6GZMwxRhGaywI=; b=GBYZSmLOs+QGaOwbQRu/qJet6j+KovsDi8DQIni6Qke+MlkOkEUaVzGsI9fSEzJEmb SpHvE5SPl+aJKU240IjIdYRn5TkE3kh5oUL5PX/R8yFngZdLjENdwo0MyhLmtNiBbFRx 5qhrIhhmkUHsvIuL7HKOzLN8GXoXJheICgLdzCqAW8v1BVyovFAZ2BwZRgShFCeW5b3g Dg+3FDnbh9P2Q7CHPQvf9iwaYCo3N2v9xTpHHDfIQVPLU5TB4Z5taKBbE7ZrPmnjJbtU /SxxxnUvdrkaixXWXtirXSqsNjpqONCbdkEGbESu3rXJdvc4imFYVKXvW5dd51z7GFv7 iLZw== X-Gm-Message-State: ABy/qLbGA6MIRlcKxPRvZ5RbDJI3aMk/LxmgNQ/E5wQz/Kr1WFILMzFI 3SQlZiFQDikwPaTZ6YRwODcMyw== X-Google-Smtp-Source: APBJJlGnfHXNUKtpZ8goterp0kPe1+GeF2c87BsTqhWXyrNp01VQ9DDS/+17IDHKN7LAqAttouOeag== X-Received: by 2002:a05:6a21:339b:b0:137:4fd0:e2e6 with SMTP id yy27-20020a056a21339b00b001374fd0e2e6mr5017607pzb.6.1690428886364; Wed, 26 Jul 2023 20:34:46 -0700 (PDT) Received: from [10.70.252.135] ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id z25-20020aa791d9000000b006828ee9fa69sm328803pfa.206.2023.07.26.20.34.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 26 Jul 2023 20:34:46 -0700 (PDT) Message-ID: Date: Thu, 27 Jul 2023 11:34:30 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless Content-Language: en-US To: Dave Chinner Cc: akpm@linux-foundation.org, tkhai@ya.ru, vbabka@suse.cz, roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org, paulmck@kernel.org, tytso@mit.edu, steven.price@arm.com, cel@kernel.org, senozhatsky@chromium.org, yujie.liu@intel.com, gregkh@linuxfoundation.org, muchun.song@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, kvm@vger.kernel.org, xen-devel@lists.xenproject.org, linux-erofs@lists.ozlabs.org, linux-f2fs-devel@lists.sourceforge.net, cluster-devel@redhat.com, linux-nfs@vger.kernel.org, linux-mtd@lists.infradead.org, rcu@vger.kernel.org, netdev@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, dm-devel@redhat.com, linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org References: <20230724094354.90817-1-zhengqi.arch@bytedance.com> <20230724094354.90817-45-zhengqi.arch@bytedance.com> <19ad6d06-8a14-6102-5eae-2134dc2c5061@bytedance.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 1244D40002 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ghieqqz8q4k95xugodj14ojmh97yg8th X-HE-Tag: 1690428887-312942 X-HE-Meta: U2FsdGVkX19BXs58Ru28uiJesVsh/4GJDbQF27GHZ0nSgQ3funckXPK+++k3kiBF2HXf449cMB5JWktnFoRn9ko3XliJ1gMrjbV0FOdfL5HY7iKTl0mgTT7MHC+xv5s3CeOIDoVZv01CCyheDVOHj8h2dwWmLxmCCVV3/rQxjt6eFU5c84IbMxFV2G0+N0SEBZ9ob/1HX6XOJAtSs+p7rfQmWZ9jlJMQ5fV4uzY0t5Gl/WsUq7LMEWFEnVMF0da/pZeiViYV4z47xGznO3bJKvLj9qTf1fZjIfc4OIeSdaq66cNnk1LsHPocMxm7lked+QD4+07Jp/DtkYrkc6xdlKvWzhW50zVu4tXt0iPP8N31xBbUe6O0flAkBUDIjRZG+qGvoRPzqsHtzBCEICwGshjwcw9DCbsTB4leUjhG2BR7FNzPwgUOavjV9sdidXxdm3QkWZkZ7uCSCDh9w6yPbhlRqpcat5VA4FL8MKyXOc3C+FSILpgzPOsCsQ6Aoh1iJGDp1BESsu0eXywyNEWvtxlem4t0FaKmVfL3c8XUvftXEBlHicPrANAOwyWlBaq5sbR+Rfxbr/TD8UHnywkAjvU12NdKxAGTihV3XuEf5qYP9RYzSwp+xH403VAHiklL8VQARZRS+kmXEh5jT8+QjjeYu+46S3B4s2JEY5qd7gYC4+nFMcLszsYTGjACoXFQmNzHSmR3bM4HFv5hv4jJxITsosT5sHTCXCLGjBXZNoPe0rqKDYi/1pNxKxf52nfbnsHRY9/qaoogs9wGe9PbVRWlRde8/xzmIa7zzml+YUq7N+6WKLHLGMWfDLOwBk0jEE5j24+sWia6xjEZ1XKvjo+CufsNxvbH5vy4g/dUkXSKFr5Z8dOp1dQ4YbJzWF820SgN/gwKGGXHuTR+5U0aRy3fv59hk0m+8c3CVOvWw3WHoJvlzfWfPUm0tr3aFIsteuowD+9TzmRMosDDo7p a6kxLFYB e4sF5K4oPFWYyumjLtQupn86B5oQILKwEA+Pciut4pIjssHD1ABfJb+YHw+U1LBe/vS3K5vaLfk88lo68rseLWnueu5fUUBEk7dazkMY0QXfGf+IHIhibi3cn8PYvr1Mjp4cm2SCR9ONWZjSNMRQzBbGtbBZcTYFUaGqNkUVqOBEx6Wy3Tp1vTCrOe0oI+9PhjRMn3NXQIKCQw6aKxJPMy7y+hwLvKPIKOCnd+2FLEnMT+y7Wf9UNt0Jo4UGLme5OoQDAZMBMYx2FgWHCIXWJ/2JChnb9FMFwuAozhzcsRB98ITgPSy3nnQ5KxYYsJeCF7p88YTgrmNWAP85QkVpbMuQabGhpT/s/kG2lyseEZNk/MjWvsxVUYHhAmA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Dave, On 2023/7/27 07:09, Dave Chinner wrote: > On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote: >> On 2023/7/26 16:08, Dave Chinner wrote: >>> On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote: >>>> @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker *shrinker); >>>> void shrinker_register(struct shrinker *shrinker); >>>> void shrinker_unregister(struct shrinker *shrinker); >>>> +static inline bool shrinker_try_get(struct shrinker *shrinker) >>>> +{ >>>> + return READ_ONCE(shrinker->registered) && >>>> + refcount_inc_not_zero(&shrinker->refcount); >>>> +} >>> >>> Why do we care about shrinker->registered here? If we don't set >>> the refcount to 1 until we have fully initialised everything, then >>> the shrinker code can key entirely off the reference count and >>> none of the lookup code needs to care about whether the shrinker is >>> registered or not. >> >> The purpose of checking shrinker->registered here is to stop running >> shrinker after calling shrinker_free(), which can prevent the following >> situations from happening: >> >> CPU 0 CPU 1 >> >> shrinker_try_get() >> >> shrinker_try_get() >> >> shrinker_put() >> shrinker_try_get() >> shrinker_put() > > I don't see any race here? What is wrong with having multiple active > users at once? Maybe I'm overthinking. What I think is that if there are multiple users at once, it may cause the above-mentioned livelock, which will cause shrinker_free() to wait for a long time. But this probability should be very low. > >>> >>> This should use a completion, then it is always safe under >>> rcu_read_lock(). This also gets rid of the shrinker_lock spin lock, >>> which only exists because we can't take a blocking lock under >>> rcu_read_lock(). i.e: >>> >>> >>> void shrinker_put(struct shrinker *shrinker) >>> { >>> if (refcount_dec_and_test(&shrinker->refcount)) >>> complete(&shrinker->done); >>> } >>> >>> void shrinker_free() >>> { >>> ..... >>> refcount_dec(&shrinker->refcount); >> >> I guess what you mean is shrinker_put(), because here may be the last >> refcount. > > Yes, I did. > >>> wait_for_completion(&shrinker->done); >>> /* >>> * lookups on the shrinker will now all fail as refcount has >>> * fallen to zero. We can now remove it from the lists and >>> * free it. >>> */ >>> down_write(shrinker_rwsem); >>> list_del_rcu(&shrinker->list); >>> up_write(&shrinker_rwsem); >>> call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb); >>> } >>> >>> .... >>> >>>> @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered); >>>> void shrinker_register(struct shrinker *shrinker) >>>> { >>>> - down_write(&shrinker_rwsem); >>>> - list_add_tail(&shrinker->list, &shrinker_list); >>>> - shrinker->flags |= SHRINKER_REGISTERED; >>>> + refcount_set(&shrinker->refcount, 1); >>>> + >>>> + spin_lock(&shrinker_lock); >>>> + list_add_tail_rcu(&shrinker->list, &shrinker_list); >>>> + spin_unlock(&shrinker_lock); >>>> + >>>> shrinker_debugfs_add(shrinker); >>>> - up_write(&shrinker_rwsem); >>>> + WRITE_ONCE(shrinker->registered, true); >>>> } >>>> EXPORT_SYMBOL(shrinker_register); >>> >>> This just looks wrong - you are trying to use WRITE_ONCE() as a >>> release barrier to indicate that the shrinker is now set up fully. >>> That's not necessary - the refcount is an atomic and along with the >>> rcu locks they should provides all the barriers we need. i.e. >> >> The reason I used WRITE_ONCE() here is because the shrinker->registered >> will be read and written concurrently (read in shrinker_try_get() and >> written in shrinker_free()), which is why I added shrinker::registered >> field instead of using SHRINKER_REGISTERED flag (this can reduce the >> addition of WRITE_ONCE()/READ_ONCE()). > > Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to > use the field like this. You need release/acquire memory ordering > here. i.e. smp_store_release()/smp_load_acquire(). > > As it is, the refcount_inc_not_zero() provides a control dependency, > as documented in include/linux/refcount.h, refcount_dec_and_test() > provides release memory ordering. The only thing I think we may need > is a write barrier before refcount_set(), such that if > refcount_inc_not_zero() sees a non-zero value, it is guaranteed to > see an initialised structure... > > i.e. refcounts provide all the existence and initialisation > guarantees. Hence I don't see the need to use shrinker->registered > like this and it can remain a bit flag protected by the > shrinker_rwsem(). Ah, I didn't consider the memory order with refcount when I added WRITE_ONCE/READ_ONCE to shrinker->registered, just didn't want KCSAN to complain (there are multiple visitors at the same time, one of which is a writer). And the livelock case mentioned above is indeed unlikely to happen, so I will delete shrinker->registered in the next version. > > >>> void shrinker_register(struct shrinker *shrinker) >>> { >>> down_write(&shrinker_rwsem); >>> list_add_tail_rcu(&shrinker->list, &shrinker_list); >>> shrinker->flags |= SHRINKER_REGISTERED; >>> shrinker_debugfs_add(shrinker); >>> up_write(&shrinker_rwsem); >>> >>> /* >>> * now the shrinker is fully set up, take the first >>> * reference to it to indicate that lookup operations are >>> * now allowed to use it via shrinker_try_get(). >>> */ >>> refcount_set(&shrinker->refcount, 1); >>> } >>> >>>> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c >>>> index f1becfd45853..c5573066adbf 100644 >>>> --- a/mm/shrinker_debug.c >>>> +++ b/mm/shrinker_debug.c >>>> @@ -5,6 +5,7 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>>> /* defined in vmscan.c */ >>>> extern struct rw_semaphore shrinker_rwsem; >>>> @@ -161,17 +162,21 @@ int shrinker_debugfs_add(struct shrinker *shrinker) >>>> { >>>> struct dentry *entry; >>>> char buf[128]; >>>> - int id; >>>> - >>>> - lockdep_assert_held(&shrinker_rwsem); >>>> + int id, ret = 0; >>>> /* debugfs isn't initialized yet, add debugfs entries later. */ >>>> if (!shrinker_debugfs_root) >>>> return 0; >>>> + down_write(&shrinker_rwsem); >>>> + if (shrinker->debugfs_entry) >>>> + goto fail; >>>> + >>>> id = ida_alloc(&shrinker_debugfs_ida, GFP_KERNEL); >>>> - if (id < 0) >>>> - return id; >>>> + if (id < 0) { >>>> + ret = id; >>>> + goto fail; >>>> + } >>>> shrinker->debugfs_id = id; >>>> snprintf(buf, sizeof(buf), "%s-%d", shrinker->name, id); >>>> @@ -180,7 +185,8 @@ int shrinker_debugfs_add(struct shrinker *shrinker) >>>> entry = debugfs_create_dir(buf, shrinker_debugfs_root); >>>> if (IS_ERR(entry)) { >>>> ida_free(&shrinker_debugfs_ida, id); >>>> - return PTR_ERR(entry); >>>> + ret = PTR_ERR(entry); >>>> + goto fail; >>>> } >>>> shrinker->debugfs_entry = entry; >>>> @@ -188,7 +194,10 @@ int shrinker_debugfs_add(struct shrinker *shrinker) >>>> &shrinker_debugfs_count_fops); >>>> debugfs_create_file("scan", 0220, entry, shrinker, >>>> &shrinker_debugfs_scan_fops); >>>> - return 0; >>>> + >>>> +fail: >>>> + up_write(&shrinker_rwsem); >>>> + return ret; >>>> } >>>> int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...) >>>> @@ -243,6 +252,11 @@ struct dentry *shrinker_debugfs_detach(struct shrinker *shrinker, >>>> shrinker->name = NULL; >>>> *debugfs_id = entry ? shrinker->debugfs_id : -1; >>>> + /* >>>> + * Ensure that shrinker->registered has been set to false before >>>> + * shrinker->debugfs_entry is set to NULL. >>>> + */ >>>> + smp_wmb(); >>>> shrinker->debugfs_entry = NULL; >>>> return entry; >>>> @@ -266,14 +280,26 @@ static int __init shrinker_debugfs_init(void) >>>> shrinker_debugfs_root = dentry; >>>> /* Create debugfs entries for shrinkers registered at boot */ >>>> - down_write(&shrinker_rwsem); >>>> - list_for_each_entry(shrinker, &shrinker_list, list) >>>> + rcu_read_lock(); >>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>>> + if (!shrinker_try_get(shrinker)) >>>> + continue; >>>> + rcu_read_unlock(); >>>> + >>>> if (!shrinker->debugfs_entry) { >>>> - ret = shrinker_debugfs_add(shrinker); >>>> - if (ret) >>>> - break; >>>> + /* Paired with smp_wmb() in shrinker_debugfs_detach() */ >>>> + smp_rmb(); >>>> + if (READ_ONCE(shrinker->registered)) >>>> + ret = shrinker_debugfs_add(shrinker); >>>> } >>>> - up_write(&shrinker_rwsem); >>>> + >>>> + rcu_read_lock(); >>>> + shrinker_put(shrinker); >>>> + >>>> + if (ret) >>>> + break; >>>> + } >>>> + rcu_read_unlock(); >>>> return ret; >>>> } >>> >>> And all this churn and complexity can go away because the >>> shrinker_rwsem is still used to protect shrinker_register() >>> entirely.... >> >> My consideration is that during this process, there may be a >> driver probe failure and then shrinker_free() is called (the >> shrinker_debugfs_init() is called in late_initcall stage). In >> this case, we need to use RCU+refcount to ensure that the shrinker >> is not freed. > > Yeah, you're trying to work around the lack of a > wait_for_completion() call in shrinker_free(). > > With that, this doesn't need RCU at all, and the iteration can be > done fully under the shrinker_rwsem() safely and so none of this > code needs to change. Oh, indeed, here does not need to be changed. Thanks, Qi > > Cheers, > > Dave.