From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Zhang Yi <yi.zhang@huaweicloud.com>,
Christoph Hellwig <hch@infradead.org>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, tytso@mit.edu,
adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com,
willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com,
chengzhihao1@huawei.com, yukuai3@huawei.com,
wangkefeng.wang@huawei.com
Subject: Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation
Date: Thu, 29 Feb 2024 15:29:16 -0800 [thread overview]
Message-ID: <20240229232916.GE1927156@frogsfrogsfrogs> (raw)
In-Reply-To: <ZeERAob9Imwh01bG@dread.disaster.area>
On Fri, Mar 01, 2024 at 10:19:30AM +1100, Dave Chinner wrote:
> On Thu, Feb 29, 2024 at 04:59:34PM +0800, Zhang Yi wrote:
> > Hello, Dave!
> >
> > On 2024/2/29 6:25, Dave Chinner wrote:
> > > On Wed, Feb 28, 2024 at 04:53:32PM +0800, Zhang Yi wrote:
> > >> On 2024/2/13 13:46, Christoph Hellwig wrote:
> > >>> Wouldn't it make more sense to just move the size manipulation to the
> > >>> write-only code? An untested version of that is below. With this
> > >>> the naming of the status variable becomes even more confusing than
> > >>> it already is, maybe we need to do a cleanup of the *_write_end
> > >>> calling conventions as it always returns the passed in copied value
> > >>> or 0.
> > >>>
> > >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > >>> index 3dab060aed6d7b..8401a9ca702fc0 100644
> > >>> --- a/fs/iomap/buffered-io.c
> > >>> +++ b/fs/iomap/buffered-io.c
> > >>> @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
> > >>> size_t copied, struct folio *folio)
> > >>> {
> > >>> const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > >>> - loff_t old_size = iter->inode->i_size;
> > >>> - size_t ret;
> > >>> -
> > >>> - if (srcmap->type == IOMAP_INLINE) {
> > >>> - ret = iomap_write_end_inline(iter, folio, pos, copied);
> > >>> - } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> > >>> - ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
> > >>> - copied, &folio->page, NULL);
> > >>> - } else {
> > >>> - ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> > >>> - }
> > >>> -
> > >>> - /*
> > >>> - * Update the in-memory inode size after copying the data into the page
> > >>> - * cache. It's up to the file system to write the updated size to disk,
> > >>> - * preferably after I/O completion so that no stale data is exposed.
> > >>> - */
> > >>> - if (pos + ret > old_size) {
> > >>> - i_size_write(iter->inode, pos + ret);
> > >>> - iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> > >>> - }
> > >>
> > >> I've recently discovered that if we don't increase i_size in
> > >> iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs
> > >> depends on iomap_zero_iter() to increase i_size in some cases.
> > >>
> > >> generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> > >> (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details)
> > >>
> > >> _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> > >> *** xfs_repair -n output ***
> > >> Phase 1 - find and verify superblock...
> > >> Phase 2 - using internal log
> > >> - zero log...
> > >> - scan filesystem freespace and inode maps...
> > >> sb_fdblocks 10916, counted 10923
> > >> - found root inode chunk
> > >> ...
> > >>
> > >> After debugging and analysis, I found the root cause of the problem is
> > >> related to the pre-allocations of xfs. xfs pre-allocates some blocks to
> > >> reduce fragmentation during buffer append writing, then if we write new
> > >> data or do file copy(reflink) after the end of the pre-allocating range,
> > >> xfs would zero-out and write back the pre-allocate space(e.g.
> > >> xfs_file_write_checks() -> xfs_zero_range()), so we have to update
> > >> i_size before writing back in iomap_zero_iter(), otherwise, it will
> > >> result in stale delayed extent.
> > >
> > > Ok, so this is long because the example is lacking in clear details
> > > so to try to understand it I've laid it out in detail to make sure
> > > I've understood it correctly.
> > >
> >
> > Thanks for the graph, the added detail makes things clear and easy to
> > understand. To be honest, it's not exactly the same as the results I
> > captured and described (the position A\B\C\D\E\F I described is
> > increased one by one), but the root cause of the problem is the same,
> > so it doesn't affect our understanding of the problem.
>
> OK, good :)
>
> .....
>
> > > However, if this did actually write zeroes to disk, this would end
> > > up with:
> > >
> > > A C B E F D
> > > +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > > EOF EOF
> > > (in memory) (on disk)
> > >
> > > Which is wrong - the file extension and zeros should not get exposed
> > > to the user until the entire reflink completes. This would expose
> > > zeros at the EOF and a file size that the user never asked for after
> > > a crash. Experience tells me that they would report this as
> > > "filesystem corrupting data on crash".
> > >
> > > If we move where i_size gets updated by iomap_zero_iter(), we get:
> > >
> > > A C B E F D
> > > +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > > EOF
> > > (in memory)
> > > (on disk)
> > >
> > > Which is also wrong, because now the user can see the size change
> > > and read zeros in the middle of the clone operation, which is also
> > > wrong.
> > >
> > > IOWs, we do not want to move the in-memory or on-disk EOF as a
> > > result of zeroing delalloc extents beyond EOF as it opens up
> > > transient, non-atomic on-disk states in the event of a crash.
> > >
> > > So, catch-22: we need to move the in-memory EOF to write back zeroes
> > > beyond EOF, but that would move the on-disk EOF to E before the
> > > clone operation starts. i.e. it makes clone non-atomic.
> >
> > Make sense. IIUC, I also notice that xfs_file_write_checks() zero
> > out EOF blocks if the later write offset is beyond the size of the
> > file. Think about if we replace the reflink operation to a buffer
> > write E to F, although it doesn't call xfs_flush_unmap_range()
> > directly, but if it could be raced by another background write
> > back, and trigger the same problem (I've not try to reproduce it,
> > so please correct me if I understand wrong).
>
> Correct, but the write is about to extend the file size when it
> writes into the cache beyond the zeroed region. There is no cache
> invalidate possible in this path, so the write of data moves the
> in-memory EOF past the zeroes in cache and everything works just
> fine.
>
> If it races with concurrent background writeback, the writeback will
> skip the zeroed range beyond EOF until they are exposed by the first
> data write beyond the zeroed post-eof region which moves the
> in-memory EOF.
>
> truncate(to a larger size) also does this same zeroing - the page
> cache is zeroed before we move the EOF in memory, and so the
> writeback will only occur once the in-memory EOF is moved. i.e. it
> effectively does:
>
> xfs_zero_range(oldsize to newsize)
> truncate_setsize(newsize)
> filemap_write_and_wait_range(old size to new size)
>
> > > What should acutally result from the iomap_zero_range() call from
> > > xfs_reflink_remap_prep() is a state like this:
> > >
> > > A C B E F D
> > > +wwwwwwwwww+DDDDDDDDDDD+uuuuu+rrrrrrr+dddddd+
> > > EOF EOF
> > > (on disk) (in memory)
> > >
> > > where 'u' are unwritten extent blocks.
> > >
> >
> > Yeah, this is a good solution.
> >
> > In xfs_file_write_checks(), I don't fully understand why we need
> > the xfs_zero_range().
>
> The EOF block may only be partially written. Hence on extension, we
> have to guarantee the part of that block beyond the current EOF is
> zero if the write leaves a hole between the current EOF and the
> start of the new extending write.
>
> > Theoretically, iomap have already handled
> > partial block zeroing for both buffered IO and DIO, so I guess
> > the only reason we still need it is to handle pre-allocated blocks
> > (no?).
>
> Historically speaking, Linux is able to leak data beyond EOF on
> writeback of partial EOF blocks (e.g. mmap() can write to the EOF
> page beyond EOF without failing. We try to mitigate these cases
> where we can, but we have to consider that at any time the data in
> the cache beyond EOF can be non-zero thanks to mmap() and so any
> file extension *must* zero any region beyond EOF cached in the page
> cache.
>
> > If so,would it be better to call xfs_free_eofblocks() to
> > release all the preallocated extents in range? If not, maybe we
> > could only zero out mapped partial blocks and also release
> > preallocated extents?
>
> No, that will cause all sorts of other performance problems,
> especially for reflinked files that triggering COW
> operations...
>
> >
> > In xfs_reflink_remap_prep(), I read the commit 410fdc72b05a ("xfs:
> > zero posteof blocks when cloning above eof"), xfs used to release
> > preallocations, the change log said it didn't work because of the
> > PREALLOC flag, but the 'force' parameter is 'true' when calling
> > xfs_can_free_eofblocks(), so I don't get the problem met. Could we
> > fall back to use xfs_free_eofblocks() and make a state like this?
> >
> > A C B E F D
> > +wwwwwwwwww+DDDDDDDDDDD+hhhhh+rrrrrrr+dddddd+
> > EOF EOF
> > (on disk) (in memory)
>
> It could, but that then requires every place that may call
> xfs_zero_range() to be aware of this need to trim EOF blocks to do
> the right thing in all cases. We don't want to remove speculative
> delalloc in the write() path nor in the truncate(up) case, and so it
> doesn't fix the general problem of zeroing specualtive delalloc
> beyond EOF requiring writeback to push page caceh pages to disk
> before the inode size has been updated.
>
> The general solution is to have zeroing of speculative prealloc
> extents beyond EOF simply convert the range to unwritten and then
> invalidate any cached pages over that range. At this point, we are
> guaranteed to have zeroes across that range, all without needing to
> do any IO at all...
That (separate iomap ops that do the delalloc -> unwritten allocation)
seems a lot more straightforward to me than whacking preallocations.
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2024-02-29 23:29 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-27 1:57 [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Zhang Yi
2024-01-27 1:58 ` [PATCH v3 01/26] ext4: refactor ext4_da_map_blocks() Zhang Yi
2024-02-03 17:56 ` Theodore Ts'o
2024-01-27 1:58 ` [PATCH v3 02/26] ext4: convert to exclusive lock while inserting delalloc extents Zhang Yi
2024-02-03 17:56 ` Theodore Ts'o
2024-01-27 1:58 ` [PATCH v3 03/26] ext4: correct the hole length returned by ext4_map_blocks() Zhang Yi
2024-02-03 17:56 ` Theodore Ts'o
2024-05-09 15:16 ` Luis Henriques
[not found] ` <20240509163953.GI3620298@mit.edu>
2024-05-09 17:23 ` Luis Henriques
2024-05-10 3:39 ` Zhang Yi
2024-05-10 9:41 ` Luis Henriques
2024-05-10 11:40 ` Zhang Yi
2024-01-27 1:58 ` [PATCH v3 04/26] ext4: add a hole extent entry in cache after punch Zhang Yi
2024-02-03 17:56 ` Theodore Ts'o
2024-01-27 1:58 ` [PATCH v3 05/26] ext4: make ext4_map_blocks() distinguish delalloc only extent Zhang Yi
2024-02-03 17:57 ` Theodore Ts'o
2024-01-27 1:58 ` [PATCH v3 06/26] ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type Zhang Yi
2024-02-03 17:57 ` Theodore Ts'o
2024-01-27 1:58 ` [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation Zhang Yi
2024-02-13 5:46 ` Christoph Hellwig
2024-02-17 8:55 ` Zhang Yi
2024-02-18 23:30 ` Dave Chinner
2024-02-19 1:14 ` Zhang Yi
2024-02-28 8:53 ` Zhang Yi
2024-02-28 22:13 ` Christoph Hellwig
2024-02-29 9:20 ` Zhang Yi
2024-02-28 22:25 ` Dave Chinner
2024-02-29 8:59 ` Zhang Yi
2024-02-29 23:19 ` Dave Chinner
2024-02-29 23:29 ` Darrick J. Wong [this message]
2024-03-01 3:26 ` Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 08/26] iomap: add pos and dirty_len into trace_iomap_writepage_map Zhang Yi
2024-02-12 6:02 ` Christoph Hellwig
2024-02-19 1:27 ` Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 09/26] ext4: allow inserting delalloc extents with multi-blocks Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 10/26] ext4: correct delalloc extent length Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 11/26] ext4: also mark extent as delalloc if it's been unwritten Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 12/26] ext4: factor out bh handles to ext4_da_get_block_prep() Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 13/26] ext4: use reserved metadata blocks when splitting extent in endio Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 14/26] ext4: factor out ext4_map_{create|query}_blocks() Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 15/26] ext4: introduce seq counter for extent entry Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 16/26] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 17/26] ext4: implement buffered read iomap path Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 18/26] ext4: implement buffered write " Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 19/26] ext4: implement writeback " Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 20/26] ext4: implement mmap " Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 21/26] ext4: implement zero_range " Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 22/26] ext4: writeback partial blocks before zero range Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 23/26] ext4: fall back to buffer_head path for defrag Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 24/26] ext4: partially enable iomap for regular file's buffered IO path Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 25/26] filemap: support disable large folios on active inode Zhang Yi
2024-01-27 1:58 ` [RFC PATCH v3 26/26] ext4: enable large folio for regular file with iomap buffered IO path Zhang Yi
2024-02-12 6:18 ` [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered IO path and enable large foilo Darrick J. Wong
2024-02-12 9:16 ` Ritesh Harjani
2024-02-12 10:24 ` Matthew Wilcox
2024-02-17 9:31 ` Zhang Yi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240229232916.GE1927156@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=adilger.kernel@dilger.ca \
--cc=chengzhihao1@huawei.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=yi.zhang@huawei.com \
--cc=yi.zhang@huaweicloud.com \
--cc=yukuai3@huawei.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox