From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 86FA8E7849C
	for <linux-mm@archiver.kernel.org>; Sun,  1 Oct 2023 22:38:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 92CE96B0177; Sun,  1 Oct 2023 18:38:08 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8DD4F6B0179; Sun,  1 Oct 2023 18:38:08 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7A4F66B017F; Sun,  1 Oct 2023 18:38:08 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 698076B0177
	for <linux-mm@kvack.org>; Sun,  1 Oct 2023 18:38:08 -0400 (EDT)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 33A2240274
	for <linux-mm@kvack.org>; Sun,  1 Oct 2023 22:38:08 +0000 (UTC)
X-FDA: 81298357056.19.F209073
Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173])
	by imf17.hostedemail.com (Postfix) with ESMTP id 40BEA40005
	for <linux-mm@kvack.org>; Sun,  1 Oct 2023 22:38:06 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Gr3YqDoj;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com;
	spf=pass (imf17.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=david@fromorbit.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1696199886;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=33/aTyyQaSBPQRJLH+G7VmXLrBqd7SSr9ClqecbFlkw=;
	b=4yNN7/aMuAeJeKCy6uOFFOt0CF/qZeYgaJWiP48bq6fHZ8G/K+1ETqtKLS8qHcAPqMe+8t
	o8+E/Plug3XpNJdj1rBuutOiUv/+YL4koAyjDMx+szRaAOdMZpgKXSL/dMsJJk9vV+fP+w
	XHPID2Oi5vBmmiXk5njQgU9eytLO8ao=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Gr3YqDoj;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com;
	spf=pass (imf17.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.173 as permitted sender) smtp.mailfrom=david@fromorbit.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696199886; a=rsa-sha256;
	cv=none;
	b=755nVxqiJm5/YQNl25zV0ATES118BiNG+1BoiGizILnsFQX00REFLml6Yo5e6XxAwW/a/g
	PmP0L5bjlmA9Qq+fuaPjbI1QXmbLSi5zAqhk90SsaH0iu4nMCMpRIVP+7xuEAuYBhp46V8
	zi+1v4jtaT2SWaxSiRr5CipJcPkp/hk=
Received: by mail-pg1-f173.google.com with SMTP id 41be03b00d2f7-565e395e7a6so8709366a12.0
        for <linux-mm@kvack.org>; Sun, 01 Oct 2023 15:38:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1696199885; x=1696804685; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=33/aTyyQaSBPQRJLH+G7VmXLrBqd7SSr9ClqecbFlkw=;
        b=Gr3YqDoj6qm9OHRl740ilCKEOgQMTLbm9EVG0r4uuEFF5ArPsYff2VGg6Q2ndYWtVc
         tZ+w9n8o/CWun21Zvuwqic9bFffQ46RQfehi9aLl4VBLsoyO7ltguMJPTd0QaHR/o6NW
         HXpS72APPsGNJZR0qb1VoXGPlgFsncjH1JmfaeqY//EYGVpPnQqiOG/v16j8qIW7RqkU
         6aA1SbSxWaBKmSJckPo3TcHFSZdjrypQtxAPkoxpgvYxp2GMFYcjf8Etwa5E9dwBODg4
         oqa8XMYMUQVc+lvSnyCjgM+wyz7l8JnxctuhInwfUzVqm8FA0EMaBfLp2igZ0FlYUc3k
         sOig==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696199885; x=1696804685;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=33/aTyyQaSBPQRJLH+G7VmXLrBqd7SSr9ClqecbFlkw=;
        b=iuqSPPg9eq1ujpT1XJNcjbcnQ21szOFrfzPn1hDpej9nPXECiOWoBXcNWoBwWu/rNG
         vAbwSK+lc0P3ioDxfVRktWga6Jrxxf1HmF2BQtjt0yGY9C3kapqABJPcSY9lb+DDWk1/
         E5o8iVj33poGe4VU4sgKp2jGh7ROkrvCf0sNJpzMTtAJlvN8KFDK5KcyVjoa/WtMLT1z
         3IyUOn7mr8LfkNZ511yH5KN/Aj7L0a024j4zt3soaiFVpDdm7AvH9SO60ThlbraYbSQz
         wd38sFQ60sMmWhlLr729Dk7T++bWlp/PwS6VIsNP/HF805FIRzckIcVPvehK5DxHKD+t
         jR6g==
X-Gm-Message-State: AOJu0YxXDOirlZB4vO+wHurFote7W9WL3piGiEzSmtSpSRgs7Hny5aS/
	9h2CQg4Ok5drTK0j9M+r7x+J2w==
X-Google-Smtp-Source: AGHT+IEPNDmJlYBnKRJkMqorQBxfTle5K4Et4feITxPrFrpzEg7FRMTgtzw/IqrHe45lWNn+Ew340Q==
X-Received: by 2002:a05:6a00:9a8:b0:690:cae9:714d with SMTP id u40-20020a056a0009a800b00690cae9714dmr9793046pfg.13.1696199884695;
        Sun, 01 Oct 2023 15:38:04 -0700 (PDT)
Received: from dread.disaster.area (pa49-180-20-59.pa.nsw.optusnet.com.au. [49.180.20.59])
        by smtp.gmail.com with ESMTPSA id q26-20020a62ae1a000000b006875a366acfsm18446309pff.8.2023.10.01.15.38.03
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 01 Oct 2023 15:38:03 -0700 (PDT)
Received: from dave by dread.disaster.area with local (Exim 4.96)
	(envelope-from <david@fromorbit.com>)
	id 1qn54a-008DIa-22;
	Mon, 02 Oct 2023 09:38:00 +1100
Date: Mon, 2 Oct 2023 09:38:00 +1100
From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: Removal of KM_NOFS
Message-ID: <ZRn0yFmKa2JWaTNL@dread.disaster.area>
References: <ZRdNK39vc4TuR7g8@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ZRdNK39vc4TuR7g8@casper.infradead.org>
X-Rspamd-Queue-Id: 40BEA40005
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: pqfbotoge7khe4db1wsi5gen6gmoemap
X-HE-Tag: 1696199886-52300
X-HE-Meta: U2FsdGVkX1+zoEW8yedlNB3gS7/HZ8KoCaFjaSI8uO00juT58HF0GGDTvW4pV4h6lljljK9ND38zNWAE1zBIKXoio79teFGqwQerewcVV8uBzChd6rTEEm8CMgcKh0e13NKuJanjrkIB4DeUOp56hb12IDUNPOrmKqy9knvoJ8/4PJgWfKmXZgMqyEHPJc1V505iDUnQElxLOtnctXgs1G2lpGP1mnG3kcqFxkV5Wh3Ag3aFXXx5nylPRgur3qcsaVLjHNDVlZ/NohFpIr5JbROTuJX2VuhEymNwLXtAjKhLvrWz1ctq/pCDe0JmECDBVcw/7OnryUyttVaDgVL39enGBTyUWtePBhAKx73AydVD3C6gVKRIwEjayRis8U6Aj12iK5vSOzXi2k1Sd+TThz8FTo9O6+9j9aWMh5wHVG1MXqZjWMS13uFmV7e2eb5M8RXkxUyB19aCKpx28w4jT3KtSX5V4LW1Mk3OHrChwJU5DZyOK8Z7KCmdB3NaY+HLn/PohkzTI5eWasdLrxrtdi2K/QOQMNWDYaEsyoYKX0wc1RWyl1egUsDns9IOCLX2EOy7yXikayDCXSot9VF1brdlg0alw9e63/5DsJnLZDXQcEvmrGevz6NHl8z101XUgrlE9HdhYQ/30QIhQzygQZ3dDPtuyY3XR20UqGziN7SEQeuIUXL66CI58fQjbu2D+kEBdeBLuKfttUz7saWMttxU71sbL+XHSiW4EkryEwzxpAOZZBPruARtokSuY5wWVHpGRhaK7Sjp6SBRmfLoVm6rXB6rsjvglGEImrmJO/QoZAmLFgOFt9rznoxQy9FA8H/a4o+aVT6aZ416+qhKijL74GB0W9v2nY3dNEOHbOD1BKaP4NHmGLivoddtPab3uLXZLoDgUk8Ge36pQ5kCoy73hf7ha22uLiW5p610Zgb/zQAx05vfSwY7x3fXqv8phqhi8XshMUxibOfsABn
 YCFSHmrO
 v4nsI5y14JL+nlpS0Hktg73+HuWd9XG4lSlI6IZHqZlfxUPvJP1q9RR35V2qoSlZDBVH9iy7nOYousxKyCnC6FJplxFNhQ62ap+ffga+HpOhjLObBSlOmBufXDQjcoYwgZYGdkjROw0Gofu6V+9CvsP2iwNyto4UAdZjCSU5PKKi9z3Ch59XD9zVLduOC4FhzeDHDqscJAJY85WVt3xeqpK3wVqRTdSosftotXrNywCy6auzM3DFbfDPYag63oOIWCjGHCvdn9T9bA0ilkavROFGW0WVioKk5Oqu6OazjbiO/ihCAcpV+fAiK5PE4Kuxg0wchx3acF2e3C3UNqUgdmk45pNOBFOW5RMsAkvFMe3IFkCgO0fVlXWnmeVCdwSHWVW6z1EFTVip0OMov+zDpYyaJRaM7kOlF4V/83M/iKnSkLuoqKtY6jKMvFzyl7e3P9BGJmlKNg9e2+WBkway8DpxKXYf/eIEyZtQjqBZZPHuApOIOe9mbKIptdw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Sep 29, 2023 at 11:18:19PM +0100, Matthew Wilcox wrote:
> I had a long plane ride yesterday, and I started on "Removing GFP_NOFS".
> TLDR: I don't know enough about XFS to do this first step.  There are
> various options and none of them are "obviously" the right thing to do.
> 
> The overall intent is to get rid of the __GFP_FS flag entirely; make
> GFP_NOFS the same as GFP_KERNEL (and a later patch could rename all
> the uses of GFP_NOFS to GFP_KERNEL).  That is, the only way to prevent
> the memory allocator from entering fs reclaim would be by calling
> memalloc_nofs_save().
> 
> XFS already calls memalloc_nofs_save() when starting a transaction.
> This is almost certainly the right thing to do; many things which
> could be freed through fs reclaim would need to start a transaction,
> and we don't want to do a nested transaction.  But it turns out there
> are several places that can't enter fs reclaim for other reasons.
> 
> Having boldly just removed __GFP_FS, I encountered problems (ie
> lockdep got chatty) in XFS and now I don't think I know enough to
> take on the prerequisite project of removing KM_NOFS.  While this is
> obviously _possible_ (simply add calls to memalloc_nofs_save() and
> memalloc_nofs_restore() around calls currently marked as KM_NOFS),
> that's not really how the scoped API is supposed to be used.  Instead,
> one is supposed to improve the understandability of the code by marking
> the sections where, say, a lock is taken as now being unsafe to enter
> fs reclaim because that lock is held.

Here's the problem. We have used KM_NOFS in the past just to shut up
lockdep false positives. In many cases, it is perfectly safe to take
the lock in both task context and in reclaim context and it will not
deadlock because the task context has an active reference on the
inode and so it cannot be locked in both task and reclaim context
at the same time.

i.e. This path:

> That potentially deadlocks against
> 
> -> #0 (&xfs_nondir_ilock_class#3){++++}-{3:3}:
>        __lock_acquire+0x148e/0x26d0
>        lock_acquire+0xb8/0x280
>        down_write_nested+0x3f/0xe0
>        xfs_ilock+0xe3/0x260
>        xfs_icwalk_ag+0x68c/0xa50
>        xfs_icwalk+0x3e/0xa0
>        xfs_reclaim_inodes_nr+0x7c/0xa0
>        xfs_fs_free_cached_objects+0x14/0x20
>        super_cache_scan+0x17d/0x1c0
>        do_shrink_slab+0x16a/0x680
>        shrink_slab+0x52a/0x8a0
>        shrink_node+0x308/0x7a0
>        balance_pgdat+0x28d/0x710

can only be reached when the inode has been removed from the VFS
cache and has been marked as XFS_IRECLAIM and so all new lookups on
that inode will fail until the inode has been RCU freed.

Hence anything that does a GFP_KERNEL allocation whilst holding an
inode lock can have lockdep complain about deadlocks against
locking the inode in the reclaim path.

More recently, __GFP_NOLOCKDEP was added to allow us to annotate
these allocation points that historically have used KM_NOFS to
shut up lockdep false positives. However, no effort has been made to go
back and find all the KM_NOFS sites that should be converted to
__GFP_NOLOCKDEP.

I suspect that most of the remaining uses of KM_NOFS are going to
fall into this category; certainly anything to do with reading
inodes into memory and populating extent lists (most of the KM_NOFS
uses) from non-modifying lookup paths (e.g. open(O_RDONLY), read(),
etc) can trigger this lockdep false positive if they don't use
KM_NOFS or __GFP_NOLOCKDEP...

> We could pop the nofs setting anywhere in this call chain, but _really_
> what we should be doing is calling memalloc_nofs_save() when we take
> the xfs_nondir_ilock_class#3.  But ... there are a lot of places we
> take the ilock, and it's kind of a big deal to add memalloc_nofs_save()
> calls to all of them.  And then I looked at _why_ we take the lock, and
> it's kind of stupid; we're just waiting for other callers to free it.

It's purpose has been to serialising against racing lockless inode
lockups from the inode writeback and inode freeing code (i.e.
unlink) that walks all the inodes in an on-disc cluster. That might
have grabbed the inode between the flush checks and the deletion
from the index. This is old code (dates back to before we had
lockless RCU lookups) but we've reworked how inode clusters do
writeback and how those lockless lookups work a couple of times
since they were introduced.

In general, though, we've treated this relcaim code as "if it ain't
broke, don't fix it", so we haven't removed that final ILOCK because
maybe we've missed some subtle interaction that still requires it. I
think we may have been overly cautious, but I'll need to look at the
lockless lookup code again in a bit more detail before I form a
solid opinion on that..

> ie xfs_reclaim_inode() does:
> 
>        if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>                 goto out;
> ...
>         xfs_iunlock(ip, XFS_ILOCK_EXCL);
> ...
>         if (!radix_tree_delete(&pag->pag_ici_root,
>                                 XFS_INO_TO_AGINO(ip->i_mount, ino)))
> ...
>         xfs_ilock(ip, XFS_ILOCK_EXCL);
> 
> ie we did the trylock, and it succeeded.  We know we don't have the
> lock in process context.  It feels like we could legitimately use
> xfs_lock_inumorder() to use a different locking class to do this wait.

Unfortunately for us, this used to be only one of many lock points
that caused these lockdep issues - anywhere we took an inode lock in
the reclaim path (i.e. superblock shrinker context) could fire a
lockdep false positive Lockdep just doesn't understand reference
counted life cycle contexts.

Yes, we do have subclasses for inode locking, and we used to even
re-initialise the entire node lockdep map context when the inode
entered reclaim context to avoid false positives. However, with the
scope of all the different places in inode reclaim context that
could lock th einode, the only way to reliably avoid these reclaim
false positives was to sprinkle KM_NOFS around allocation sites
which could be run from both GFP_KERNEL and GFP_NOFS contexts with
an inode locked.

That said, now that most of the inode work needed in memory reclaim
context has been taken completely out of the shrinker context (i.e.
the per-cpu background inodegc workqueues). Hence that final
ILOCK in xfs_reclaim_inode() is likely the only place that remains
where this ILOCK false positive can trigger, so maybe it's a simpler
problem to fix than it once was....

-Dave.
-- 
Dave Chinner
david@fromorbit.com