linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range
Date: Fri, 19 Jul 2024 11:10:28 +1000	[thread overview]
Message-ID: <Zpm9BFLjU0DkHBWc@dread.disaster.area> (raw)
In-Reply-To: <20240718130212.23905-1-bfoster@redhat.com>

On Thu, Jul 18, 2024 at 09:02:08AM -0400, Brian Foster wrote:
> Hi all,
> 
> This is a stab at fixing the iomap zero range problem where it doesn't
> correctly handle the case of an unwritten mapping with dirty pagecache.
> The gist is that we scan the mapping for dirty cache, zero any
> already-dirty folios via buffered writes as normal, but then otherwise
> skip clean ranges once we have a chance to validate those ranges against
> races with writeback or reclaim.
> 
> This is somewhat simplistic in terms of how it scans, but that is
> intentional based on the existing use cases for zero range. From poking
> around a bit, my current sense is that there isn't any user of zero
> range that would ever expect to see more than a single dirty folio.

The current code generally only zeroes a single filesystem block or
less because that's all we need to zero for partial writes.  This is
not going to be true for very much longer with XFS forcealign
functionality, and I suspect it's not true right now for large rt
extent sizes when doing sub-extent writes. In these cases, we are
going to have to zero multiple filesystem blocks during truncate,
hole punch, unaligned writes, etc.

So even if we don't do this now, I think this is something we will
almost certainly be doing in the next kernel release or two.

> Most
> callers either straddle the EOF folio or flush in higher level code for
> presumably (fs) context specific reasons. If somebody has an example to
> the contrary, please let me know because I'd love to be able to use it
> for testing.

Check the xfs_inode_has_bigrtalloc() and xfs_inode_alloc_unitsize()
cases. These are currently being worked on and expanded and factored
so eventually these cases will all fall under
xfs_inode_alloc_unitsize().

> The caveat to this approach is that it only works for filesystems that
> implement folio_ops->iomap_valid(), which is currently just XFS. GFS2
> doesn't use ->iomap_valid() and does call zero range, but AFAICT it
> doesn't actually export unwritten mappings so I suspect this is not a
> problem. My understanding is that ext4 iomap support is in progress, but
> I've not yet dug into what that looks like (though I suspect similar to
> XFS). The concern is mainly that this leaves a landmine for fs that
> might grow support for unwritten mappings && zero range but not
> ->iomap_valid(). We'd likely never know zero range was broken for such
> fs until stale data exposure problems start to materialize.
> 
> I considered adding a fallback to just add a flush at the top of
> iomap_zero_range() so at least all future users would be correct, but I
> wanted to gate that on the absence of ->iomap_valid() and folio_ops
> isn't provided until iomap_begin() time. I suppose another way around
> that could be to add a flags param to iomap_zero_range() where the
> caller could explicitly opt out of a flush, but that's still kind of
> ugly. I dunno, maybe better than nothing..?

We want to avoid the flush in this case if we can - what XFS does is
a workaround for iomap not handling dirty data over unwritten
extents. That first flush causes performance issues with certain
truncate heavy workloads, so we really want to avoid it in the
generic code if we can.

> So IMO, this raises the question of whether this is just unnecessarily
> overcomplicated. The KISS principle implies that it would also be
> perfectly fine to do a conditional "flush and stale" in zero range
> whenever we see the combination of an unwritten mapping and dirty
> pagecache (the latter checked before or during ->iomap_begin()). That's
> simple to implement and AFAICT would work/perform adequately and
> generically for all filesystems. I have one or two prototypes of this
> sort of thing if folks want to see it as an alternative.

If we are going to zero the range, and the range is already
unwritten, then why do we need to flush the data in the cache to
make it clean and written before running the zeroing? Why not just
invalidate the entire cache over the unwritten region and so return it
all to containing zeroes (i.e. is unwritten!) without doing any IO.

Yes, if some of the range is under writeback, the invalidation will
have to wait for that to complete - invalidate_inode_pages2_range()
does this for us - but after the invalidation those regions will now
be written and iomap revalidation after page cache invalidation will
detect this.

So maybe the solution is simply to invalidate the cache over
unwritten extents and then revalidate the iomap? If the iomap is
still valid, then we can skip the unwritten extent completely. If
the invalidation returned -EBUSY or the iomap is stale, then remap
it and try again?

If we don't have an iomap validation function, then we could check
filemap_range_needs_writeback() before calling
invalidate_inode_pages2_range() as that will tell us if there were
folios that might have been under writeback during the invalidation.
In that case, we can treat "needs writeback" the same as a failed
iomap revalidation.

So what am I missing? What does the flush actually accomplish that
simply calling invalidate_inode_pages2_range() to throw the data we
need to zero away doesn't?

-Dave.
-- 
Dave Chinner
david@fromorbit.com


  parent reply	other threads:[~2024-07-19  1:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-18 13:02 Brian Foster
2024-07-18 13:02 ` [PATCH 3/4] iomap: fix handling of dirty folios over unwritten extents Brian Foster
2024-07-19  0:25   ` Dave Chinner
2024-07-19 15:17     ` Brian Foster
2024-07-18 13:02 ` [PATCH 4/4] xfs: remove unnecessary flush of eof page from truncate Brian Foster
     [not found] ` <20240718130212.23905-2-bfoster@redhat.com>
2024-07-18 15:09   ` [PATCH 1/4] filemap: return pos of first dirty folio from range_has_writeback Matthew Wilcox
2024-07-18 15:36 ` [PATCH RFC 0/4] iomap: zero dirty folios over unwritten mappings on zero range Josef Bacik
2024-07-18 16:02   ` Darrick J. Wong
2024-07-18 16:40     ` Brian Foster
2024-07-19  1:10 ` Dave Chinner [this message]
2024-07-19 15:22   ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zpm9BFLjU0DkHBWc@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bfoster@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox