From: Dave Chinner <david@fromorbit.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: "Holger Hoffstätte" <holger@applied-asynchrony.com>,
"Yafang Shao" <laoar.shao@gmail.com>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org
Subject: Re: [PATCH v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages
Date: Tue, 16 Jun 2020 09:06:05 +1000 [thread overview]
Message-ID: <20200615230605.GV2040@dread.disaster.area> (raw)
In-Reply-To: <20200615145331.GK25296@dhcp22.suse.cz>
On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote:
> On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > On 2020-06-15 13:56, Yafang Shao wrote:
> [...]
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index b356118..1ccfbf2 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > struct writeback_control *wbc)
> > > {
> > > struct xfs_writepage_ctx wpc = { };
> > > + unsigned int nofs_flag;
> > > + int ret;
> > > xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > - return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > +
> > > + /*
> > > + * We can allocate memory here while doing writeback on behalf of
> > > + * memory reclaim. To avoid memory allocation deadlocks set the
> > > + * task-wide nofs context for the following operations.
> > > + */
> > > + nofs_flag = memalloc_nofs_save();
> > > + ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > + memalloc_nofs_restore(nofs_flag);
> > > +
> > > + return ret;
> > > }
> > > STATIC int
> > >
> >
> > Not sure if I did something wrong, but while the previous version of this patch
> > worked fine, this one gave me (with v2 removed obviously):
> >
> > [ +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
>
> This corresponds to
> /*
> * Given that we do not allow direct reclaim to call us, we should
> * never be called in a recursive filesystem reclaim context.
> */
> if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> goto redirty;
>
> which effectivelly says that memalloc_nofs_save/restore cannot be used
> for that code path.
No it doesn't. Everyone is ignoring the -context- of this code, most
especially the previous 3 lines of code and it's comment:
/*
* Refuse to write the page out if we are called from reclaim context.
*
* This avoids stack overflows when called from deeply used stacks in
* random callers for direct reclaim or memcg reclaim. We explicitly
* allow reclaim from kswapd as the stack usage there is relatively low.
*
* This should never happen except in the case of a VM regression so
* warn about it.
*/
if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
PF_MEMALLOC))
goto redirty;
You will see that we specifically avoid this path from reclaim
context except for kswapd. And kswapd always runs with GFP_KERNEL
context so we allow writeback from it, so it will pass both this
check and the NOFS check above.
IOws, we can't trigger to the WARN_ON_ONCE(current->flags &
PF_MEMALLOC_NOFS)) check from a memory reclaim context: this
PF_MEMALLOC_NOFS check here is not doing what people think it is.
History lesson time, eh?
The recursion protection here -used- to use PF_FSTRANS, not
PF_MEMALLOC_NOFS. See commit 9070733b4efac ("xfs: abstract
PF_FSTRANS to PF_MEMALLOC_NOFS"). This hunk is most instructive
when you look at the comment:
* Given that we do not allow direct reclaim to call us, we should
* never be called while in a filesystem transaction.
*/
- if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;
It wasn't for memory allocation recursion protection in XFS - it was
for transaction reservation recursion protection by something trying
to flush data pages while holding a transaction reservation. Doing
this could deadlock the journal because the existing reservation
could prevent the nested reservation for being able to reserve space
in the journal and that is a self-deadlock vector.
IOWs, this check is not protecting against memory reclaim recursion
bugs at all (that's the previous check I quoted). This check is
protecting against the filesystem calling writepages directly from a
context where it can self-deadlock.
So what we are seeing here is that the PF_FSTRANS ->
PF_MEMALLOC_NOFS abstraction lost all the actual useful information
about what type of error this check was protecting against.
> Your stack trace doesn't point to a reclaim path
> which shows that this path is shared and also underlines that this is
> not really an intended use of the api.
Actually, Michal, it was your PF_FSTRANS -> PF_MEMALLOC_NOFS
abstraction of this code that turned this from "exactly what
PF_FSTRANS was intended to catch" to what you call "unintended use
of the API"....
IOWs, putting the iomap_writepage path under NOFS context is the
right thing to do from a "prevent memory reclaim" perspective, but
now we are hitting against the problems of repurposing filesystem
specific flags for subtlely different generic semantics...
I suspect we need to re-introduce PF_FSTRANS, set/clear/transfer it
again in all the places XFS used to transfer it, and change this
iomap check to use PF_FSTRANS and not PF_MEMALLOC_NOFS, because it's
clearly not and never has been a memory reclaim recursion warning
check....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2020-06-15 23:06 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-15 11:56 Yafang Shao
2020-06-15 14:25 ` Holger Hoffstätte
2020-06-15 14:51 ` Yafang Shao
2020-06-15 14:53 ` Michal Hocko
2020-06-15 15:07 ` Matthew Wilcox
2020-06-15 23:23 ` Dave Chinner
2020-06-15 15:08 ` Yafang Shao
2020-06-15 23:06 ` Dave Chinner [this message]
2020-06-16 7:56 ` Michal Hocko
2020-06-16 10:17 ` Yafang Shao
2020-06-16 8:16 ` Michal Hocko
2020-06-16 9:05 ` Yafang Shao
2020-06-16 9:27 ` Michal Hocko
2020-06-16 9:39 ` Yafang Shao
2020-06-16 10:48 ` Michal Hocko
2020-06-16 11:42 ` Yafang Shao
2020-06-18 0:34 ` Dave Chinner
2020-06-18 11:04 ` Yafang Shao
2020-06-22 1:23 ` [xfs] 59d77e81c5: WARNING:at_fs/iomap/buffered-io.c:#iomap_do_writepage kernel test robot
2020-06-22 12:20 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200615230605.GV2040@dread.disaster.area \
--to=david@fromorbit.com \
--cc=holger@applied-asynchrony.com \
--cc=laoar.shao@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mhocko@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox