From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CC8BC7EE25 for ; Thu, 8 Jun 2023 23:18:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 94D4A6B0072; Thu, 8 Jun 2023 19:18:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8FCA48E0001; Thu, 8 Jun 2023 19:18:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 79D986B0075; Thu, 8 Jun 2023 19:18:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 68DFF6B0072 for ; Thu, 8 Jun 2023 19:18:01 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 378B38020E for ; Thu, 8 Jun 2023 23:18:01 +0000 (UTC) X-FDA: 80881145562.02.D6D5B36 Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by imf04.hostedemail.com (Postfix) with ESMTP id 2598540016 for ; Thu, 8 Jun 2023 23:17:58 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=fromorbit-com.20221208.gappssmtp.com header.s=20221208 header.b=HeNLRTif; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf04.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686266279; a=rsa-sha256; cv=none; b=8MgqkArIfhSacmI4LJkBiq6XeVzU2x5rnXHjW67sJORIOIQ1UrI8alhvzQKU81yLQSwvAi iqHlVCCyZ7Z4zajOfq5iEigRquhpXNSMBBKHPAKxRiDjaZN0SZOtwcF/UNEikofffs93LQ npU3xzSwqP340cdFX7loGsvNtQvylu0= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=fromorbit-com.20221208.gappssmtp.com header.s=20221208 header.b=HeNLRTif; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf04.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.169 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686266279; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=k1cBU8CvnLRhFfPk7oGDbturpNUeqfy4EAK1i9I4mZY=; b=z+3JdPqnOrd3kexCAHfVUa3PkN2ExP9TxYGbmE/xriutaSo9bU5T+qcWRbTVJXcpDgUf4o aZLFqVCx1QEL9eq6OlH3W69xmKIh0O0QDpAaxjgxzCfpKLJQ9kE1hbypDQX4R0QomcfD2N asG8orX/ml+pjC55SoGnNRHryrNHo6o= Received: by mail-pg1-f169.google.com with SMTP id 41be03b00d2f7-543c6a2aa07so212098a12.0 for ; Thu, 08 Jun 2023 16:17:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20221208.gappssmtp.com; s=20221208; t=1686266278; x=1688858278; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=k1cBU8CvnLRhFfPk7oGDbturpNUeqfy4EAK1i9I4mZY=; b=HeNLRTif69FEJZsrAxRthdhHPA2H8RoQIKQMj7Ba3IQpRy05P7Vm05z6w6fb1Pu2L+ mL5w9Iw8NjaexLH+zhRhPa7/iES4KAOEJVBHhQX7NlfIVKaKgmYej52dsXEoP//Yh5hq NIJoKf7GX+aDqdg3UMAfra2+uc4Rjf5hhHKq7IjfJ9cEb4sevkU+B+sMo8OPYwudQdzB 96Nz3dcmwxwgoV5Kdbaagxhc3D5CPTeh9sA43AgUFns2iwxGs+lkEP9kBv6GTT3FWt7s RRhMqLcqxY4/k6hqJH5ANzOST7+QD7U3BnHGi1d1F6ungTQmWIUSfoXX9JLKBqWhkKvR l6bQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686266278; x=1688858278; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=k1cBU8CvnLRhFfPk7oGDbturpNUeqfy4EAK1i9I4mZY=; b=LnzPQw9OB4ozgy92ZWk5j4gU+O8oIVfbiWi9WCvDod/XPCT786wrHbQrcTKbKHdUME HvlaHYcAbXBzuYikU2ZeVjOBunqgTFnIfHDqBoQIXpEIqfBjQGaow/LQXS2SM75o/1OD qNGfqsLAHohABWzwwchl6nAM9gMQswerzejsTn3vNpU0ZqE1DT224g6jfc4/n4TNtpbF GJdhHTmVDaY8JlNN83IkLQsfreRyo6BwL1R0xWjCdvRUvFMrxaGNHsxCoPhK6YVr6gNC dKPWiNNMuKm6eeXUO9bbakwy+GA3OzBZd3EU6W3tKsVmNJXtGSproyoIOH64gGP3qNWJ 7G8g== X-Gm-Message-State: AC+VfDxWxp/Ydbu4I6L9DTWZK5a7DTpzFXnMduYwfLhgmJ9LQ2VbPFrS FgybJQ7mCamKwEhbzoONfiTVZQ== X-Google-Smtp-Source: ACHHUZ5Dw0QrQ8tgaZzHeyvYZfD1gsaQnW0ALhslQjfpgFVcXSPbguCIKDSdF8GAohY1XxhkEcjQTw== X-Received: by 2002:a17:90a:2ce5:b0:258:817a:814e with SMTP id n92-20020a17090a2ce500b00258817a814emr4896491pjd.28.1686266277912; Thu, 08 Jun 2023 16:17:57 -0700 (PDT) Received: from dread.disaster.area (pa49-179-79-151.pa.nsw.optusnet.com.au. [49.179.79.151]) by smtp.gmail.com with ESMTPSA id oe1-20020a17090b394100b0023d386e4806sm1724156pjb.57.2023.06.08.16.17.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Jun 2023 16:17:57 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1q7Ot8-009T5r-1z; Fri, 09 Jun 2023 09:17:54 +1000 Date: Fri, 9 Jun 2023 09:17:54 +1000 From: Dave Chinner To: Theodore Ts'o Cc: Roman Gushchin , Kirill Tkhai , akpm@linux-foundation.org, vbabka@suse.cz, viro@zeniv.linux.org.uk, brauner@kernel.org, djwong@kernel.org, hughd@google.com, paulmck@kernel.org, muchun.song@linux.dev, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, zhengqi.arch@bytedance.com Subject: Re: [PATCH v2 3/3] fs: Use delayed shrinker unregistration Message-ID: References: <168599103578.70911.9402374667983518835.stgit@pro.pro> <168599180526.70911.14606767590861123431.stgit@pro.pro> <20230608163622.GA1435580@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230608163622.GA1435580@mit.edu> X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2598540016 X-Stat-Signature: 7g8rjz4kh1mapaaf7e1o4yzwf7sswpf4 X-HE-Tag: 1686266278-387378 X-HE-Meta: U2FsdGVkX19rSo7OslLKm6igbA0e/lF8c47Th/DkmSqv//PSvkJy2xMyrCmHP7006JykAFl17s+IE8EpmFNVdvLaPojDwvnkPuRKnxUK7jAet8gOUkluBRrm5NqD9lvYWImj+5rkbneT+9PtaDQXXdig89RVvly3abwhjfNlrU2X4ifLilVi0XaSPIvyPsXqqU0/Hmq5DKMurVYJvHotRauurDsik9Z0hr8B73Q/wVFXERshRVOmNrbHvN/NsKpTeS45L9XjqZ1kDFOhavfQUrpxnpQ+ApNTv049GD4/ridgcwPbnvB5ecdnZWuY9kIKFpfUl+IR2gt+ireI6OMOGvfZiJD6fGyeJ1nyLwhj/wBAyXayfaBD6zyZ8e0b0Kw9XBXFPvfUYFfdEjP7jOYWRadskbLhuQ7h01vncnZI8tXlNGQ5iT5Klvk3B3du8MVB0xXmb7fjTQkqXOiTl297AuIuXUbG74z96X5QY7GZt246dtPtQ4qBfDurA2TLUtFEBlreGOzrsvCx6ZNXfWqxH3ci+cXpkpozwz0o9JgfiUJ0jYvGvXTqecjgSm2TQtDUL0BQHgiOFnqMPYffJQ1DMT2Cuvk8ybv9EGBZc2foBQs9cTuYfBWJJKCIxPuIuyheb8JYfDA/vBQiLjpZ8ljqEFDh54bEtHNkHifDT6dA6G7kC6UhxtsvqyJW2AVR3e7QZUcU1pOKnXXPFO+kOyDOLoRecZjbYCTUsk7oSNXqTijcWhVvoqBJSEENHWV9zlwo3ta4aUvTrmCy1sSwfpInNo5u/dag47jWThIhTlr+2tzPkJHAupRC1OmjkhT2KyO0xU5Ji8KCM6fvh/C0FyU/7nJRvtalpjGe52bjpgUQPrWIwCBxQ+lYA/XyDw9jT6H/ec2+F1ZlfudHPafilO4naLGc7lOQWcXvlIkrgb0wI8NgqY+JfhAUo+6luoQog42NlXH7e7IuQ45xF/1c6Mw hnWmuA1/ rUwWuO+NYzS71UD6/ZTby3I9cL+ZhEmf8wlfq2lwbMATGXfASull9xikKc+sZQrLAzxt3kNXeY49pbnNeYHxf04qS+/Rpi4EmsCwtEBlLh072scDm1aY8v3rBEu7zpUl3VUv4navcr/qF5J149PR69bmipRtJdkqthpDcqKlC3bcMvC7KWux+44ZfwjY2s5QnyvqIW2BPErQvSz8KsDNwkj0YuvXq4MG3uAavQaUScRbqk+BUQQ1EzG+/ga/f7OX6PfNwgyB+nEIKD4TklVE8wQprIzclQwfRrqKIwy2sbXiHr5DIvZX7VUi1S+DdERRqZ4DqhRI1o5ar/7jxhMWXdeRUSnZSgynPONKMDfTuwcsYvjzrYvxlv73swCJ4UiV/yTiw3ENIK6yUpxSpMVIj1dbFGM+v14R8b5aa4KuuK2DaTO2NbA7u8u3+pYYsCRZuQtmJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jun 08, 2023 at 12:36:22PM -0400, Theodore Ts'o wrote: > On Tue, Jun 06, 2023 at 11:24:32AM +1000, Dave Chinner wrote: > > On Mon, Jun 05, 2023 at 05:38:27PM -0700, Roman Gushchin wrote: > > > Hm, it makes the API more complex and easier to mess with. Like what will happen > > > if the second part is never called? Or it's called without the first part being > > > called first? > > > > Bad things. > > > > Also, it doesn't fix the three other unregister_shrinker() calls in > > the XFS unmount path, nor the three in the ext4/mbcache/jbd2 unmount > > path. > > > > Those are just some of the unregister_shrinker() calls that have > > dynamic contexts that would also need this same fix; I haven't > > audited the 3 dozen other unregister_shrinker() calls around the > > kernel to determine if any of them need similar treatment, too. > > > > IOWs, this patchset is purely a band-aid to fix the reported > > regression, not an actual fix for the underlying problems caused by > > moving the shrinker infrastructure to SRCU protection. This is why > > I really want the SRCU changeover reverted. > > There's been so much traffic on linux-fsdevel so I missed this thread > until Darrick pointed it out to me today. (Thanks, Darrick!) > > From his description, and my quick read-through of this thread.... > I'm worried. > > Given that we're at -rc5 now, and the file system folks didn't get > consulted until fairly late in the progress, and the fact that this > may cause use-after-free problems that could lead to security issues, > perhaps we shoould consider reverting the SRCU changeover now, and try > again for the next merge window? Yes, please, because I think we can fix this in a much better way and make things a whole lot simpler at the same time. The root cause of the SRCU usage is that mm/memcg developers still haven't unified the non-memcg and the memcg based shrinker implementations. shrink_slab_memcg() doesn't require SRCU protection as it does not iterate the shrinker list at all; it requires *shrinker instance* lifetime protection. The problem is shrink_slab() doing the root memcg/global shrinker work - it iterates the shrinker list directly, and this is the list walk that SRCU is necessary for to "make shrinkers lockless" Going back to shrink_slab_memcg(), it does a lookup of the shrinker instance by idr_find(); idr_find() is a wrapper around radix_tree_lookup(), which means we can use RCU protection and reference counting to both validate the shrinker instance *and* guarantee that it isn't free from under us as we execute the shrinker. This requires, as I mentioned elsewhere in this thread, that the shrinker instance to be dynamically allocated, not embedded in other structures. Dynamically allocated shrinker instances and reference counting them means we can do this in shrink_slab_memcg() to ensure shrinker instance validity and lifetime rules are followed: rcu_read_lock() shrinker = idr_find(&shrinker_idr, i); if (!shrinker || !refcount_inc_not_zero(&shrinker->refcount)) { /* shrinker is being torn down */ clear_bit(i, info->map); rcu_read_unlock(); continue; } rcu_read_unlock(); /* do shrinker stuff */ if (refcount_dec_and_test(&shrinker->refcount)) { /* shrinker is being torn down, waiting for us */ wakeup(&shrinker->completion_wait); } /* unsafe to reference shrinker now */ And we enable the shrinker to run simply by: shrinker_register() { ..... /* allow the shrinker to run now */ refcount_set(shrinker->refcount, 1); return 0; } We then shut down the shrinker so we can tear it down like so: shrinker_unregister() { DECLARE_WAITQUEUE(wait, current); add_wait_queue_exclusive(shrinker->completion_wait, &wait); if (!refcount_dec_and_test(&shrinker->refcount))) { /* Wait for running instances to exit */ __set_current_state(TASK_UNINTERRUPTIBLE); schedule(); } remove_wait_queue(wq, &wait); /* We own the entire shrinker instance now, start tearing it down */ ..... /* Free the shrinker itself after a RCU grace period expires */ kfree_rcu(shrinker, shrinker->rcu_head); } So, essentially we don't need SCRU at all to do lockless shrinker lookup for the memcg shrinker side of the fence, nor do we need synchronise_srcu() to wait for shrinker instances to finish running before we can tear stuff down. There is no global state in this at all; everything is per-shrinker instance. But SRCU is needed to protect the global shrinker list walk because it hasn't been converted to always use the root memcg and be iterated as if it is just another memcg. IOWs, using SRCU to protect the global shrinker list walk is effectively slapping a bandaid over a more fundamental problem that we've known about for a long time. So the first thing that has to be done here is unify the shrinker infrastructure under the memcg implementation. The global shrinker state should be tracked in the root memcg, just like any other memcg shrinker is tracked. If memcg's are not enabled, then we should be creating a dummy global memcg that a get_root_memcg() helper returns when memcgs are disabled to tracks all the global shrinker state. then shrink_slab just doesn't care about what memcg is passed to it, it just does the same thing. IOWs, shrink_slab gets entirely replaced by shrink_slab_memcg(), and the need for SRCU goes away entirely because shrinker instance lookups are all RCU+refcount protected. Yes, this requires we change how shrinker instances are instantiated by subsystems, but this is mostly simple transformation of existing code. But it doesn't require changing shrinker implementations, it doesn't require a new teardown API, and it doesn't change the context under which shrinkers might run. All the existing RCU protected stuff in the shrinker maps and memcgs can remain that way. We can also keep the shrinker rwsem/mutex for all the little internal bits of setup/teardown/non-rcu coordination that are needed; once the list iteration is lockless, there will be almost no contention on that lock at all... This isn't all that hard - it's just replicating a well known design pattern (i.e. refcount+RCU) we use all over the kernel combined with taking advantage of IDR being backed by RCU-safe infrastructure. If I had been cc'd on the original SRCU patches, this is exactly what I would have suggested as a better solution to the problem. We end up with cleaner, more robust and better performing infrastructure. This is far better than layering more complexity on top of what is really a poor foundation.... Cheers, Dave. -- Dave Chinner david@fromorbit.com