Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Zhang Yi <yi.zhang@huaweicloud.com>,
	Christoph Hellwig <hch@infradead.org>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com,
	willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com,
	chengzhihao1@huawei.com, yukuai3@huawei.com,
	wangkefeng.wang@huawei.com
Subject: Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation
Date: Thu, 29 Feb 2024 15:29:16 -0800	[thread overview]
Message-ID: <20240229232916.GE1927156@frogsfrogsfrogs> (raw)
In-Reply-To: <ZeERAob9Imwh01bG@dread.disaster.area>

On Fri, Mar 01, 2024 at 10:19:30AM +1100, Dave Chinner wrote:
> On Thu, Feb 29, 2024 at 04:59:34PM +0800, Zhang Yi wrote:
> > Hello, Dave!
> > 
> > On 2024/2/29 6:25, Dave Chinner wrote:
> > > On Wed, Feb 28, 2024 at 04:53:32PM +0800, Zhang Yi wrote:
> > >> On 2024/2/13 13:46, Christoph Hellwig wrote:
> > >>> Wouldn't it make more sense to just move the size manipulation to the
> > >>> write-only code?  An untested version of that is below.  With this
> > >>> the naming of the status variable becomes even more confusing than
> > >>> it already is, maybe we need to do a cleanup of the *_write_end
> > >>> calling conventions as it always returns the passed in copied value
> > >>> or 0.
> > >>>
> > >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > >>> index 3dab060aed6d7b..8401a9ca702fc0 100644
> > >>> --- a/fs/iomap/buffered-io.c
> > >>> +++ b/fs/iomap/buffered-io.c
> > >>> @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
> > >>>  		size_t copied, struct folio *folio)
> > >>>  {
> > >>>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > >>> -	loff_t old_size = iter->inode->i_size;
> > >>> -	size_t ret;
> > >>> -
> > >>> -	if (srcmap->type == IOMAP_INLINE) {
> > >>> -		ret = iomap_write_end_inline(iter, folio, pos, copied);
> > >>> -	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> > >>> -		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
> > >>> -				copied, &folio->page, NULL);
> > >>> -	} else {
> > >>> -		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> > >>> -	}
> > >>> -
> > >>> -	/*
> > >>> -	 * Update the in-memory inode size after copying the data into the page
> > >>> -	 * cache.  It's up to the file system to write the updated size to disk,
> > >>> -	 * preferably after I/O completion so that no stale data is exposed.
> > >>> -	 */
> > >>> -	if (pos + ret > old_size) {
> > >>> -		i_size_write(iter->inode, pos + ret);
> > >>> -		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> > >>> -	}
> > >>
> > >> I've recently discovered that if we don't increase i_size in
> > >> iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs
> > >> depends on iomap_zero_iter() to increase i_size in some cases.
> > >>
> > >>  generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> > >>  (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details)
> > >>
> > >>  _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> > >>  *** xfs_repair -n output ***
> > >>  Phase 1 - find and verify superblock...
> > >>  Phase 2 - using internal log
> > >>          - zero log...
> > >>          - scan filesystem freespace and inode maps...
> > >>  sb_fdblocks 10916, counted 10923
> > >>          - found root inode chunk
> > >>  ...
> > >>
> > >> After debugging and analysis, I found the root cause of the problem is
> > >> related to the pre-allocations of xfs. xfs pre-allocates some blocks to
> > >> reduce fragmentation during buffer append writing, then if we write new
> > >> data or do file copy(reflink) after the end of the pre-allocating range,
> > >> xfs would zero-out and write back the pre-allocate space(e.g.
> > >> xfs_file_write_checks() -> xfs_zero_range()), so we have to update
> > >> i_size before writing back in iomap_zero_iter(), otherwise, it will
> > >> result in stale delayed extent.
> > > 
> > > Ok, so this is long because the example is lacking in clear details
> > > so to try to understand it I've laid it out in detail to make sure
> > > I've understood it correctly.
> > > 
> > 
> > Thanks for the graph, the added detail makes things clear and easy to
> > understand. To be honest, it's not exactly the same as the results I
> > captured and described (the position A\B\C\D\E\F I described is
> > increased one by one), but the root cause of the problem is the same,
> > so it doesn't affect our understanding of the problem.
> 
> OK, good :)
> 
> .....
> 
> > > However, if this did actually write zeroes to disk, this would end
> > > up with:
> > > 
> > > 	A          C           B     E       F      D
> > > 	+wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > > 	                      EOF   EOF
> > >                       (in memory)   (on disk)
> > > 
> > > Which is wrong - the file extension and zeros should not get exposed
> > > to the user until the entire reflink completes. This would expose
> > > zeros at the EOF and a file size that the user never asked for after
> > > a crash. Experience tells me that they would report this as
> > > "filesystem corrupting data on crash".
> > > 
> > > If we move where i_size gets updated by iomap_zero_iter(), we get:
> > > 
> > > 	A          C           B     E       F      D
> > > 	+wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > > 	                            EOF
> > >                                 (in memory)
> > > 		                 (on disk)
> > > 
> > > Which is also wrong, because now the user can see the size change
> > > and read zeros in the middle of the clone operation, which is also
> > > wrong.
> > > 
> > > IOWs, we do not want to move the in-memory or on-disk EOF as a
> > > result of zeroing delalloc extents beyond EOF as it opens up
> > > transient, non-atomic on-disk states in the event of a crash.
> > > 
> > > So, catch-22: we need to move the in-memory EOF to write back zeroes
> > > beyond EOF, but that would move the on-disk EOF to E before the
> > > clone operation starts. i.e. it makes clone non-atomic.
> > 
> > Make sense. IIUC, I also notice that xfs_file_write_checks() zero
> > out EOF blocks if the later write offset is beyond the size of the
> > file.  Think about if we replace the reflink operation to a buffer
> > write E to F, although it doesn't call xfs_flush_unmap_range()
> > directly, but if it could be raced by another background write
> > back, and trigger the same problem (I've not try to reproduce it,
> > so please correct me if I understand wrong).
> 
> Correct, but the write is about to extend the file size when it
> writes into the cache beyond the zeroed region. There is no cache
> invalidate possible in this path, so the write of data moves the
> in-memory EOF past the zeroes in cache and everything works just
> fine.
> 
> If it races with concurrent background writeback, the writeback will
> skip the zeroed range beyond EOF until they are exposed by the first
> data write beyond the zeroed post-eof region which moves the
> in-memory EOF.
> 
> truncate(to a larger size) also does this same zeroing - the page
> cache is zeroed before we move the EOF in memory, and so the
> writeback will only occur once the in-memory EOF is moved. i.e. it
> effectively does:
> 
> 	xfs_zero_range(oldsize to newsize)
> 	truncate_setsize(newsize)
> 	filemap_write_and_wait_range(old size to new size)
> 
> > > What should acutally result from the iomap_zero_range() call from
> > > xfs_reflink_remap_prep() is a state like this:
> > > 
> > > 	A          C           B     E       F      D
> > > 	+wwwwwwwwww+DDDDDDDDDDD+uuuuu+rrrrrrr+dddddd+
> > > 	          EOF         EOF
> > >                (on disk)  (in memory)
> > > 
> > > where 'u' are unwritten extent blocks.
> > > 
> > 
> > Yeah, this is a good solution.
> > 
> > In xfs_file_write_checks(), I don't fully understand why we need
> > the xfs_zero_range().
> 
> The EOF block may only be partially written. Hence on extension, we
> have to guarantee the part of that block beyond the current EOF is
> zero if the write leaves a hole between the current EOF and the
> start of the new extending write.
> 
> > Theoretically, iomap have already handled
> > partial block zeroing for both buffered IO and DIO, so I guess
> > the only reason we still need it is to handle pre-allocated blocks
> > (no?).
> 
> Historically speaking, Linux is able to leak data beyond EOF on
> writeback of partial EOF blocks (e.g. mmap() can write to the EOF
> page beyond EOF without failing. We try to mitigate these cases
> where we can, but we have to consider that at any time the data in
> the cache beyond EOF can be non-zero thanks to mmap() and so any
> file extension *must* zero any region beyond EOF cached in the page
> cache.
> 
> > If so，would it be better to call xfs_free_eofblocks() to
> > release all the preallocated extents in range? If not, maybe we
> > could only zero out mapped partial blocks and also release
> > preallocated extents?
> 
> No, that will cause all sorts of other performance problems,
> especially for reflinked files that triggering COW
> operations...
> 
> > 
> > In xfs_reflink_remap_prep(), I read the commit 410fdc72b05a ("xfs:
> > zero posteof blocks when cloning above eof"), xfs used to release
> > preallocations, the change log said it didn't work because of the
> > PREALLOC flag, but the 'force' parameter is 'true' when calling
> > xfs_can_free_eofblocks(), so I don't get the problem met. Could we
> > fall back to use xfs_free_eofblocks() and make a state like this?
> > 
> >  	A          C           B     E       F      D
> >  	+wwwwwwwwww+DDDDDDDDDDD+hhhhh+rrrrrrr+dddddd+
> >  	          EOF         EOF
> >                 (on disk)  (in memory)
> 
> It could, but that then requires every place that may call
> xfs_zero_range() to be aware of this need to trim EOF blocks to do
> the right thing in all cases. We don't want to remove speculative
> delalloc in the write() path nor in the truncate(up) case, and so it
> doesn't fix the general problem of zeroing specualtive delalloc
> beyond EOF requiring writeback to push page caceh pages to disk
> before the inode size has been updated.
> 
> The general solution is to have zeroing of speculative prealloc
> extents beyond EOF simply convert the range to unwritten and then
> invalidate any cached pages over that range. At this point, we are
> guaranteed to have zeroes across that range, all without needing to
> do any IO at all...

That (separate iomap ops that do the delalloc -> unwritten allocation)
seems a lot more straightforward to me than whacking preallocations.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

next prev parent reply	other threads:[~2024-02-29 23:29 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-27  1:57 [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Zhang Yi
2024-01-27  1:58 ` [PATCH v3 01/26] ext4: refactor ext4_da_map_blocks() Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 02/26] ext4: convert to exclusive lock while inserting delalloc extents Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks() Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-05-09 15:16   ` Luis Henriques
     [not found]     ` <20240509163953.GI3620298@mit.edu>
2024-05-09 17:23       ` Luis Henriques
2024-05-10  3:39         ` Zhang Yi
2024-05-10  9:41           ` Luis Henriques
2024-05-10 11:40             ` Zhang Yi
2024-01-27  1:58 ` [PATCH v3 04/26] ext4: add a hole extent entry in cache after punch Zhang Yi
2024-02-03 17:56   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 05/26] ext4: make ext4_map_blocks() distinguish delalloc only extent Zhang Yi
2024-02-03 17:57   ` Theodore Ts'o
2024-01-27  1:58 ` [PATCH v3 06/26] ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type Zhang Yi
2024-02-03 17:57   ` Theodore Ts'o
2024-01-27  1:58 ` [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation Zhang Yi
2024-02-13  5:46   ` Christoph Hellwig
2024-02-17  8:55     ` Zhang Yi
2024-02-18 23:30       ` Dave Chinner
2024-02-19  1:14         ` Zhang Yi
2024-02-28  8:53     ` Zhang Yi
2024-02-28 22:13       ` Christoph Hellwig
2024-02-29  9:20         ` Zhang Yi
2024-02-28 22:25       ` Dave Chinner
2024-02-29  8:59         ` Zhang Yi
2024-02-29 23:19           ` Dave Chinner
2024-02-29 23:29             ` Darrick J. Wong [this message]
2024-03-01  3:26             ` Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 08/26] iomap: add pos and dirty_len into trace_iomap_writepage_map Zhang Yi
2024-02-12  6:02   ` Christoph Hellwig
2024-02-19  1:27     ` Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 09/26] ext4: allow inserting delalloc extents with multi-blocks Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 10/26] ext4: correct delalloc extent length Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 11/26] ext4: also mark extent as delalloc if it's been unwritten Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 12/26] ext4: factor out bh handles to ext4_da_get_block_prep() Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 13/26] ext4: use reserved metadata blocks when splitting extent in endio Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 14/26] ext4: factor out ext4_map_{create|query}_blocks() Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 15/26] ext4: introduce seq counter for extent entry Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 16/26] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 17/26] ext4: implement buffered read iomap path Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 18/26] ext4: implement buffered write " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 19/26] ext4: implement writeback " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 20/26] ext4: implement mmap " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 21/26] ext4: implement zero_range " Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 22/26] ext4: writeback partial blocks before zero range Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 23/26] ext4: fall back to buffer_head path for defrag Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 24/26] ext4: partially enable iomap for regular file's buffered IO path Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 25/26] filemap: support disable large folios on active inode Zhang Yi
2024-01-27  1:58 ` [RFC PATCH v3 26/26] ext4: enable large folio for regular file with iomap buffered IO path Zhang Yi
2024-02-12  6:18 ` [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Darrick J. Wong
2024-02-12  9:16   ` Ritesh Harjani
2024-02-12 10:24     ` Matthew Wilcox
2024-02-17  9:31   ` Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240229232916.GE1927156@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=yi.zhang@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox