From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86FA8E7849C for ; Sun, 1 Oct 2023 22:38:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 92CE96B0177; Sun, 1 Oct 2023 18:38:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DD4F6B0179; Sun, 1 Oct 2023 18:38:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A4F66B017F; Sun, 1 Oct 2023 18:38:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 698076B0177 for ; Sun, 1 Oct 2023 18:38:08 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 33A2240274 for ; Sun, 1 Oct 2023 22:38:08 +0000 (UTC) X-FDA: 81298357056.19.F209073 Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) by imf17.hostedemail.com (Postfix) with ESMTP id 40BEA40005 for ; Sun, 1 Oct 2023 22:38:06 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Gr3YqDoj; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf17.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696199886; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=33/aTyyQaSBPQRJLH+G7VmXLrBqd7SSr9ClqecbFlkw=; b=4yNN7/aMuAeJeKCy6uOFFOt0CF/qZeYgaJWiP48bq6fHZ8G/K+1ETqtKLS8qHcAPqMe+8t o8+E/Plug3XpNJdj1rBuutOiUv/+YL4koAyjDMx+szRaAOdMZpgKXSL/dMsJJk9vV+fP+w XHPID2Oi5vBmmiXk5njQgU9eytLO8ao= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Gr3YqDoj; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf17.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696199886; a=rsa-sha256; cv=none; b=755nVxqiJm5/YQNl25zV0ATES118BiNG+1BoiGizILnsFQX00REFLml6Yo5e6XxAwW/a/g PmP0L5bjlmA9Qq+fuaPjbI1QXmbLSi5zAqhk90SsaH0iu4nMCMpRIVP+7xuEAuYBhp46V8 zi+1v4jtaT2SWaxSiRr5CipJcPkp/hk= Received: by mail-pg1-f173.google.com with SMTP id 41be03b00d2f7-565e395e7a6so8709366a12.0 for ; Sun, 01 Oct 2023 15:38:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1696199885; x=1696804685; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=33/aTyyQaSBPQRJLH+G7VmXLrBqd7SSr9ClqecbFlkw=; b=Gr3YqDoj6qm9OHRl740ilCKEOgQMTLbm9EVG0r4uuEFF5ArPsYff2VGg6Q2ndYWtVc tZ+w9n8o/CWun21Zvuwqic9bFffQ46RQfehi9aLl4VBLsoyO7ltguMJPTd0QaHR/o6NW HXpS72APPsGNJZR0qb1VoXGPlgFsncjH1JmfaeqY//EYGVpPnQqiOG/v16j8qIW7RqkU 6aA1SbSxWaBKmSJckPo3TcHFSZdjrypQtxAPkoxpgvYxp2GMFYcjf8Etwa5E9dwBODg4 oqa8XMYMUQVc+lvSnyCjgM+wyz7l8JnxctuhInwfUzVqm8FA0EMaBfLp2igZ0FlYUc3k sOig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696199885; x=1696804685; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=33/aTyyQaSBPQRJLH+G7VmXLrBqd7SSr9ClqecbFlkw=; b=iuqSPPg9eq1ujpT1XJNcjbcnQ21szOFrfzPn1hDpej9nPXECiOWoBXcNWoBwWu/rNG vAbwSK+lc0P3ioDxfVRktWga6Jrxxf1HmF2BQtjt0yGY9C3kapqABJPcSY9lb+DDWk1/ E5o8iVj33poGe4VU4sgKp2jGh7ROkrvCf0sNJpzMTtAJlvN8KFDK5KcyVjoa/WtMLT1z 3IyUOn7mr8LfkNZ511yH5KN/Aj7L0a024j4zt3soaiFVpDdm7AvH9SO60ThlbraYbSQz wd38sFQ60sMmWhlLr729Dk7T++bWlp/PwS6VIsNP/HF805FIRzckIcVPvehK5DxHKD+t jR6g== X-Gm-Message-State: AOJu0YxXDOirlZB4vO+wHurFote7W9WL3piGiEzSmtSpSRgs7Hny5aS/ 9h2CQg4Ok5drTK0j9M+r7x+J2w== X-Google-Smtp-Source: AGHT+IEPNDmJlYBnKRJkMqorQBxfTle5K4Et4feITxPrFrpzEg7FRMTgtzw/IqrHe45lWNn+Ew340Q== X-Received: by 2002:a05:6a00:9a8:b0:690:cae9:714d with SMTP id u40-20020a056a0009a800b00690cae9714dmr9793046pfg.13.1696199884695; Sun, 01 Oct 2023 15:38:04 -0700 (PDT) Received: from dread.disaster.area (pa49-180-20-59.pa.nsw.optusnet.com.au. [49.180.20.59]) by smtp.gmail.com with ESMTPSA id q26-20020a62ae1a000000b006875a366acfsm18446309pff.8.2023.10.01.15.38.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 01 Oct 2023 15:38:03 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qn54a-008DIa-22; Mon, 02 Oct 2023 09:38:00 +1100 Date: Mon, 2 Oct 2023 09:38:00 +1100 From: Dave Chinner To: Matthew Wilcox Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Removal of KM_NOFS Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 40BEA40005 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: pqfbotoge7khe4db1wsi5gen6gmoemap X-HE-Tag: 1696199886-52300 X-HE-Meta: U2FsdGVkX1+zoEW8yedlNB3gS7/HZ8KoCaFjaSI8uO00juT58HF0GGDTvW4pV4h6lljljK9ND38zNWAE1zBIKXoio79teFGqwQerewcVV8uBzChd6rTEEm8CMgcKh0e13NKuJanjrkIB4DeUOp56hb12IDUNPOrmKqy9knvoJ8/4PJgWfKmXZgMqyEHPJc1V505iDUnQElxLOtnctXgs1G2lpGP1mnG3kcqFxkV5Wh3Ag3aFXXx5nylPRgur3qcsaVLjHNDVlZ/NohFpIr5JbROTuJX2VuhEymNwLXtAjKhLvrWz1ctq/pCDe0JmECDBVcw/7OnryUyttVaDgVL39enGBTyUWtePBhAKx73AydVD3C6gVKRIwEjayRis8U6Aj12iK5vSOzXi2k1Sd+TThz8FTo9O6+9j9aWMh5wHVG1MXqZjWMS13uFmV7e2eb5M8RXkxUyB19aCKpx28w4jT3KtSX5V4LW1Mk3OHrChwJU5DZyOK8Z7KCmdB3NaY+HLn/PohkzTI5eWasdLrxrtdi2K/QOQMNWDYaEsyoYKX0wc1RWyl1egUsDns9IOCLX2EOy7yXikayDCXSot9VF1brdlg0alw9e63/5DsJnLZDXQcEvmrGevz6NHl8z101XUgrlE9HdhYQ/30QIhQzygQZ3dDPtuyY3XR20UqGziN7SEQeuIUXL66CI58fQjbu2D+kEBdeBLuKfttUz7saWMttxU71sbL+XHSiW4EkryEwzxpAOZZBPruARtokSuY5wWVHpGRhaK7Sjp6SBRmfLoVm6rXB6rsjvglGEImrmJO/QoZAmLFgOFt9rznoxQy9FA8H/a4o+aVT6aZ416+qhKijL74GB0W9v2nY3dNEOHbOD1BKaP4NHmGLivoddtPab3uLXZLoDgUk8Ge36pQ5kCoy73hf7ha22uLiW5p610Zgb/zQAx05vfSwY7x3fXqv8phqhi8XshMUxibOfsABn YCFSHmrO v4nsI5y14JL+nlpS0Hktg73+HuWd9XG4lSlI6IZHqZlfxUPvJP1q9RR35V2qoSlZDBVH9iy7nOYousxKyCnC6FJplxFNhQ62ap+ffga+HpOhjLObBSlOmBufXDQjcoYwgZYGdkjROw0Gofu6V+9CvsP2iwNyto4UAdZjCSU5PKKi9z3Ch59XD9zVLduOC4FhzeDHDqscJAJY85WVt3xeqpK3wVqRTdSosftotXrNywCy6auzM3DFbfDPYag63oOIWCjGHCvdn9T9bA0ilkavROFGW0WVioKk5Oqu6OazjbiO/ihCAcpV+fAiK5PE4Kuxg0wchx3acF2e3C3UNqUgdmk45pNOBFOW5RMsAkvFMe3IFkCgO0fVlXWnmeVCdwSHWVW6z1EFTVip0OMov+zDpYyaJRaM7kOlF4V/83M/iKnSkLuoqKtY6jKMvFzyl7e3P9BGJmlKNg9e2+WBkway8DpxKXYf/eIEyZtQjqBZZPHuApOIOe9mbKIptdw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Sep 29, 2023 at 11:18:19PM +0100, Matthew Wilcox wrote: > I had a long plane ride yesterday, and I started on "Removing GFP_NOFS". > TLDR: I don't know enough about XFS to do this first step. There are > various options and none of them are "obviously" the right thing to do. > > The overall intent is to get rid of the __GFP_FS flag entirely; make > GFP_NOFS the same as GFP_KERNEL (and a later patch could rename all > the uses of GFP_NOFS to GFP_KERNEL). That is, the only way to prevent > the memory allocator from entering fs reclaim would be by calling > memalloc_nofs_save(). > > XFS already calls memalloc_nofs_save() when starting a transaction. > This is almost certainly the right thing to do; many things which > could be freed through fs reclaim would need to start a transaction, > and we don't want to do a nested transaction. But it turns out there > are several places that can't enter fs reclaim for other reasons. > > Having boldly just removed __GFP_FS, I encountered problems (ie > lockdep got chatty) in XFS and now I don't think I know enough to > take on the prerequisite project of removing KM_NOFS. While this is > obviously _possible_ (simply add calls to memalloc_nofs_save() and > memalloc_nofs_restore() around calls currently marked as KM_NOFS), > that's not really how the scoped API is supposed to be used. Instead, > one is supposed to improve the understandability of the code by marking > the sections where, say, a lock is taken as now being unsafe to enter > fs reclaim because that lock is held. Here's the problem. We have used KM_NOFS in the past just to shut up lockdep false positives. In many cases, it is perfectly safe to take the lock in both task context and in reclaim context and it will not deadlock because the task context has an active reference on the inode and so it cannot be locked in both task and reclaim context at the same time. i.e. This path: > That potentially deadlocks against > > -> #0 (&xfs_nondir_ilock_class#3){++++}-{3:3}: > __lock_acquire+0x148e/0x26d0 > lock_acquire+0xb8/0x280 > down_write_nested+0x3f/0xe0 > xfs_ilock+0xe3/0x260 > xfs_icwalk_ag+0x68c/0xa50 > xfs_icwalk+0x3e/0xa0 > xfs_reclaim_inodes_nr+0x7c/0xa0 > xfs_fs_free_cached_objects+0x14/0x20 > super_cache_scan+0x17d/0x1c0 > do_shrink_slab+0x16a/0x680 > shrink_slab+0x52a/0x8a0 > shrink_node+0x308/0x7a0 > balance_pgdat+0x28d/0x710 can only be reached when the inode has been removed from the VFS cache and has been marked as XFS_IRECLAIM and so all new lookups on that inode will fail until the inode has been RCU freed. Hence anything that does a GFP_KERNEL allocation whilst holding an inode lock can have lockdep complain about deadlocks against locking the inode in the reclaim path. More recently, __GFP_NOLOCKDEP was added to allow us to annotate these allocation points that historically have used KM_NOFS to shut up lockdep false positives. However, no effort has been made to go back and find all the KM_NOFS sites that should be converted to __GFP_NOLOCKDEP. I suspect that most of the remaining uses of KM_NOFS are going to fall into this category; certainly anything to do with reading inodes into memory and populating extent lists (most of the KM_NOFS uses) from non-modifying lookup paths (e.g. open(O_RDONLY), read(), etc) can trigger this lockdep false positive if they don't use KM_NOFS or __GFP_NOLOCKDEP... > We could pop the nofs setting anywhere in this call chain, but _really_ > what we should be doing is calling memalloc_nofs_save() when we take > the xfs_nondir_ilock_class#3. But ... there are a lot of places we > take the ilock, and it's kind of a big deal to add memalloc_nofs_save() > calls to all of them. And then I looked at _why_ we take the lock, and > it's kind of stupid; we're just waiting for other callers to free it. It's purpose has been to serialising against racing lockless inode lockups from the inode writeback and inode freeing code (i.e. unlink) that walks all the inodes in an on-disc cluster. That might have grabbed the inode between the flush checks and the deletion from the index. This is old code (dates back to before we had lockless RCU lookups) but we've reworked how inode clusters do writeback and how those lockless lookups work a couple of times since they were introduced. In general, though, we've treated this relcaim code as "if it ain't broke, don't fix it", so we haven't removed that final ILOCK because maybe we've missed some subtle interaction that still requires it. I think we may have been overly cautious, but I'll need to look at the lockless lookup code again in a bit more detail before I form a solid opinion on that.. > ie xfs_reclaim_inode() does: > > if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) > goto out; > ... > xfs_iunlock(ip, XFS_ILOCK_EXCL); > ... > if (!radix_tree_delete(&pag->pag_ici_root, > XFS_INO_TO_AGINO(ip->i_mount, ino))) > ... > xfs_ilock(ip, XFS_ILOCK_EXCL); > > ie we did the trylock, and it succeeded. We know we don't have the > lock in process context. It feels like we could legitimately use > xfs_lock_inumorder() to use a different locking class to do this wait. Unfortunately for us, this used to be only one of many lock points that caused these lockdep issues - anywhere we took an inode lock in the reclaim path (i.e. superblock shrinker context) could fire a lockdep false positive Lockdep just doesn't understand reference counted life cycle contexts. Yes, we do have subclasses for inode locking, and we used to even re-initialise the entire node lockdep map context when the inode entered reclaim context to avoid false positives. However, with the scope of all the different places in inode reclaim context that could lock th einode, the only way to reliably avoid these reclaim false positives was to sprinkle KM_NOFS around allocation sites which could be run from both GFP_KERNEL and GFP_NOFS contexts with an inode locked. That said, now that most of the inode work needed in memory reclaim context has been taken completely out of the shrinker context (i.e. the per-cpu background inodegc workqueues). Hence that final ILOCK in xfs_reclaim_inode() is likely the only place that remains where this ILOCK false positive can trigger, so maybe it's a simpler problem to fix than it once was.... -Dave. -- Dave Chinner david@fromorbit.com