From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16616C433DF for ; Mon, 15 Jun 2020 23:06:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AC7A420644 for ; Mon, 15 Jun 2020 23:06:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AC7A420644 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EF54D6B0003; Mon, 15 Jun 2020 19:06:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EA6386B0005; Mon, 15 Jun 2020 19:06:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DBB166B0006; Mon, 15 Jun 2020 19:06:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0252.hostedemail.com [216.40.44.252]) by kanga.kvack.org (Postfix) with ESMTP id C12496B0003 for ; Mon, 15 Jun 2020 19:06:26 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 676931E4F5 for ; Mon, 15 Jun 2020 23:06:26 +0000 (UTC) X-FDA: 76932981972.20.car93_341125026dfa Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id 32F93180BF307 for ; Mon, 15 Jun 2020 23:06:26 +0000 (UTC) X-HE-Tag: car93_341125026dfa X-Filterd-Recvd-Size: 7107 Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Mon, 15 Jun 2020 23:06:25 +0000 (UTC) Received: from dread.disaster.area (pa49-180-124-177.pa.nsw.optusnet.com.au [49.180.124.177]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 69597D7BB12; Tue, 16 Jun 2020 09:06:16 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1jkyB3-0000zE-Vj; Tue, 16 Jun 2020 09:06:05 +1000 Date: Tue, 16 Jun 2020 09:06:05 +1000 From: Dave Chinner To: Michal Hocko Cc: Holger =?iso-8859-1?Q?Hoffst=E4tte?= , Yafang Shao , linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages Message-ID: <20200615230605.GV2040@dread.disaster.area> References: <1592222181-9832-1-git-send-email-laoar.shao@gmail.com> <20200615145331.GK25296@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20200615145331.GK25296@dhcp22.suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=W5xGqiek c=1 sm=1 tr=0 a=k3aV/LVJup6ZGWgigO6cSA==:117 a=k3aV/LVJup6ZGWgigO6cSA==:17 a=8nJEP1OIZ-IA:10 a=nTHF0DUjJn0A:10 a=7-415B0cAAAA:8 a=sTEJVhqn8-_9IwXEd_sA:9 a=wPNLvfGTeEIA:10 a=biEYGPWJfzWAr4FL6Ov7:22 X-Rspamd-Queue-Id: 32F93180BF307 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote: > On Mon 15-06-20 16:25:52, Holger Hoffst=E4tte wrote: > > On 2020-06-15 13:56, Yafang Shao wrote: > [...] > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > > index b356118..1ccfbf2 100644 > > > --- a/fs/xfs/xfs_aops.c > > > +++ b/fs/xfs/xfs_aops.c > > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(s= truct iomap_ioend *ioend) > > > struct writeback_control *wbc) > > > { > > > struct xfs_writepage_ctx wpc =3D { }; > > > + unsigned int nofs_flag; > > > + int ret; > > > xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED); > > > - return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_op= s); > > > + > > > + /* > > > + * We can allocate memory here while doing writeback on behalf of > > > + * memory reclaim. To avoid memory allocation deadlocks set the > > > + * task-wide nofs context for the following operations. > > > + */ > > > + nofs_flag =3D memalloc_nofs_save(); > > > + ret =3D iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_o= ps); > > > + memalloc_nofs_restore(nofs_flag); > > > + > > > + return ret; > > > } > > > STATIC int > > >=20 > >=20 > > Not sure if I did something wrong, but while the previous version of = this patch > > worked fine, this one gave me (with v2 removed obviously): > >=20 > > [ +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:154= 4 iomap_do_writepage+0x6b4/0x780 >=20 > This corresponds to > /* > * Given that we do not allow direct reclaim to call us, we sho= uld > * never be called in a recursive filesystem reclaim context. > */ > if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) > goto redirty; >=20 > which effectivelly says that memalloc_nofs_save/restore cannot be used > for that code path. No it doesn't. Everyone is ignoring the -context- of this code, most especially the previous 3 lines of code and it's comment: /* * Refuse to write the page out if we are called from reclaim con= text. * * This avoids stack overflows when called from deeply used stack= s in * random callers for direct reclaim or memcg reclaim. We explic= itly * allow reclaim from kswapd as the stack usage there is relative= ly low. * * This should never happen except in the case of a VM regression= so * warn about it. */ if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) =3D=3D PF_MEMALLOC)) goto redirty; You will see that we specifically avoid this path from reclaim context except for kswapd. And kswapd always runs with GFP_KERNEL context so we allow writeback from it, so it will pass both this check and the NOFS check above.=20 IOws, we can't trigger to the WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) check from a memory reclaim context: this PF_MEMALLOC_NOFS check here is not doing what people think it is. History lesson time, eh? The recursion protection here -used- to use PF_FSTRANS, not PF_MEMALLOC_NOFS. See commit 9070733b4efac ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS"). This hunk is most instructive when you look at the comment: * Given that we do not allow direct reclaim to call us, we shoul= d * never be called while in a filesystem transaction. */ - if (WARN_ON_ONCE(current->flags & PF_FSTRANS)) + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) goto redirty; It wasn't for memory allocation recursion protection in XFS - it was for transaction reservation recursion protection by something trying to flush data pages while holding a transaction reservation. Doing this could deadlock the journal because the existing reservation could prevent the nested reservation for being able to reserve space in the journal and that is a self-deadlock vector. IOWs, this check is not protecting against memory reclaim recursion bugs at all (that's the previous check I quoted). This check is protecting against the filesystem calling writepages directly from a context where it can self-deadlock. So what we are seeing here is that the PF_FSTRANS -> PF_MEMALLOC_NOFS abstraction lost all the actual useful information about what type of error this check was protecting against. > Your stack trace doesn't point to a reclaim path > which shows that this path is shared and also underlines that this is > not really an intended use of the api. Actually, Michal, it was your PF_FSTRANS -> PF_MEMALLOC_NOFS abstraction of this code that turned this from "exactly what PF_FSTRANS was intended to catch" to what you call "unintended use of the API".... IOWs, putting the iomap_writepage path under NOFS context is the right thing to do from a "prevent memory reclaim" perspective, but now we are hitting against the problems of repurposing filesystem specific flags for subtlely different generic semantics... I suspect we need to re-introduce PF_FSTRANS, set/clear/transfer it again in all the places XFS used to transfer it, and change this iomap check to use PF_FSTRANS and not PF_MEMALLOC_NOFS, because it's clearly not and never has been a memory reclaim recursion warning check.... Cheers, Dave. --=20 Dave Chinner david@fromorbit.com