From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F2A3C48BF6 for ; Thu, 29 Feb 2024 23:29:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E811D94000A; Thu, 29 Feb 2024 18:29:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E3039940007; Thu, 29 Feb 2024 18:29:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CD0CB94000A; Thu, 29 Feb 2024 18:29:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B61F0940007 for ; Thu, 29 Feb 2024 18:29:20 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 85BDB803A9 for ; Thu, 29 Feb 2024 23:29:20 +0000 (UTC) X-FDA: 81846434880.24.8D2D036 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf10.hostedemail.com (Postfix) with ESMTP id D5CD2C000F for ; Thu, 29 Feb 2024 23:29:18 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=f5iTNON8; spf=pass (imf10.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709249359; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kpY1VWz3DUcPxmQUvn9eEbBiitRHceBa4VJWHr65DlA=; b=3XnFTfpMPxkzv+BKaena9wGaGixE5hGGl5ZCF2pgM1NbFBHUFXvpyRrwjyUFMMELvyvjpq mA1txJUegokUqpq5pxx8828lI85xVXDXjXnkDWIMuWXg+nWwSsQZrfngrHE/s8Nbb5CPPb 9bBr4pBaX9kUZZUq0tn/s6BOsbZ2EbI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709249359; a=rsa-sha256; cv=none; b=3wu5c27zopoukVPogNjgccexLEIoo1SFbIbko92TWJELcvbdRz3VV7hcT9UUB8U9uoMo/o 99EN+YOKuE8i5Z5edH7Zcr92S/k7tEKkWsyKWe9IKAyIClrcorReKqOLvitwUtl3Thw7AW cy806Ywyh0dAC7j60S+gExAFVoYzzlk= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=f5iTNON8; spf=pass (imf10.hostedemail.com: domain of djwong@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=djwong@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id D46C16069C; Thu, 29 Feb 2024 23:29:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 71FCFC43330; Thu, 29 Feb 2024 23:29:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1709249357; bh=rlRoTcAkeLdFoK5I7nEY6DMlG7QMXz3Gox/lPZFNqfs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=f5iTNON8Mm6LlrZpl5rixdiN447xNZOSdfzeHKdJJLty3w57h9sugpm8FB3dxXDqI MM+7fYLsH09bJDO+fW2VuyyOWfZU7qaRP0KTp2nd0S2BHfA7Av+kwHKN4BZq14vBGx NCZBL5qSui2LqzRf8KS8oO82IDbGOcBw/SdA0xR1Ir+M/xoYskivqObRneatyAQf/G kggdXwW9ID7lkAiSlMtiYvldXshZUrYICpxdiOBelHL8qaYS1vfdc1YuhUc2o36sqO LJPfzX9dWp9m9NqVoHgM/5LvLA0yW6RjJgESrF/MizkfvjTy6TXeUmEzqdr2mGFbAF 6FNf/6KcUupFg== Date: Thu, 29 Feb 2024 15:29:16 -0800 From: "Darrick J. Wong" To: Dave Chinner Cc: Zhang Yi , Christoph Hellwig , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com Subject: Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation Message-ID: <20240229232916.GE1927156@frogsfrogsfrogs> References: <20240127015825.1608160-1-yi.zhang@huaweicloud.com> <20240127015825.1608160-8-yi.zhang@huaweicloud.com> <9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com> <45c1607a-805d-e7a2-a5ca-3fd7e507a664@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: D5CD2C000F X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 6xmaug38a83jmckjfbgo4819add933qx X-HE-Tag: 1709249358-287002 X-HE-Meta: U2FsdGVkX1+l2p1o2jHypxI7AzfR9i7JZ7e0Y66X3m8PxY3F3PlVpM6hFvxhiGioFqC8nc+E4Onhiw65QhbxPGBfY5tK6EUm1gR8IOQqXzDshkfSady561QAi36VVe40Ad99Z5hIqMsJPmOkqctZWY7zEfY8Rrlaw+UhMdbk5sCOytYI7jO3IrDVuAQBUeze8Kpu55TNupaR40dEV6rFRHwxbn4h/0mSk0uumBtZQl3/rjWhWFjAgEnDnK01QG8TxXaHPPZyMsTmzjt7fOxtWAPSfO6ll7Cwj1VF4R1fvfzANrmPnj939rrw0rgiBBdwzMi4yIPH+wrJBewudTx0sES3nD+/sQdDMU2TEmI8M3hG+BOPZllIPYiBI1lSEr38tpX5xFG6sg4cc9DCySijFa2z4ZOPMcEN/F139JrXw//4DOXlK7WXdSIr/4BeOqjoz+iENvRyH09tvfB7l/l+D9KCC4/0KVjBFDNvPjDlW9vEjyIm0sfmqNUv6YUrNBno9ZN1DGhwPBUlWLCK4Izvin6oJA0nYVVaAA2h0Mbl/DVNliKwzWV1X2pl8YsPQjQ7qZPrFnKrFDMojP5P/6F3sUw/Cg0i6ZwhvZiCRcOkEjpklOojgmOgKo2pSTgakTHflteeRMNwJ9g3rUjXcxM6DFKxEB2mA8LNXhyfp8GNeovVjW5+Eo5YRp50dh5JmMHFwB18L7uJ4cu0+u0MBhtK4V5SFW2TYfkpKGYjD8bHxlqlOZ0VrdEaaKjzU8AhQEOlD/a8/Le43iKcqgEATWGpIJv5SooTZ1Mcc3IeiGNy1nHjO1n6Zj1V/CTU9EU6hIK7teHBhqrMyxT7xiDs0EL03tDTN2sZpFnPOKeAH5G/hH8z+G1+pMWXSxw/J2CBjd7hJSecmfjFkeA5bb8U+XhOp7zoWQbc8KKDy1o1PDNdtY5XlGzWfCKL13IuaQXRCozhsTTPyB2WSUmo4vqZvk2 n1/JUwSc icoUNk8ckMUBeVpLnRp1r3WeTtzcn+M5HenWh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 01, 2024 at 10:19:30AM +1100, Dave Chinner wrote: > On Thu, Feb 29, 2024 at 04:59:34PM +0800, Zhang Yi wrote: > > Hello, Dave! > > > > On 2024/2/29 6:25, Dave Chinner wrote: > > > On Wed, Feb 28, 2024 at 04:53:32PM +0800, Zhang Yi wrote: > > >> On 2024/2/13 13:46, Christoph Hellwig wrote: > > >>> Wouldn't it make more sense to just move the size manipulation to the > > >>> write-only code? An untested version of that is below. With this > > >>> the naming of the status variable becomes even more confusing than > > >>> it already is, maybe we need to do a cleanup of the *_write_end > > >>> calling conventions as it always returns the passed in copied value > > >>> or 0. > > >>> > > >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > >>> index 3dab060aed6d7b..8401a9ca702fc0 100644 > > >>> --- a/fs/iomap/buffered-io.c > > >>> +++ b/fs/iomap/buffered-io.c > > >>> @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len, > > >>> size_t copied, struct folio *folio) > > >>> { > > >>> const struct iomap *srcmap = iomap_iter_srcmap(iter); > > >>> - loff_t old_size = iter->inode->i_size; > > >>> - size_t ret; > > >>> - > > >>> - if (srcmap->type == IOMAP_INLINE) { > > >>> - ret = iomap_write_end_inline(iter, folio, pos, copied); > > >>> - } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { > > >>> - ret = block_write_end(NULL, iter->inode->i_mapping, pos, len, > > >>> - copied, &folio->page, NULL); > > >>> - } else { > > >>> - ret = __iomap_write_end(iter->inode, pos, len, copied, folio); > > >>> - } > > >>> - > > >>> - /* > > >>> - * Update the in-memory inode size after copying the data into the page > > >>> - * cache. It's up to the file system to write the updated size to disk, > > >>> - * preferably after I/O completion so that no stale data is exposed. > > >>> - */ > > >>> - if (pos + ret > old_size) { > > >>> - i_size_write(iter->inode, pos + ret); > > >>> - iter->iomap.flags |= IOMAP_F_SIZE_CHANGED; > > >>> - } > > >> > > >> I've recently discovered that if we don't increase i_size in > > >> iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs > > >> depends on iomap_zero_iter() to increase i_size in some cases. > > >> > > >> generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r) > > >> (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details) > > >> > > >> _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r) > > >> *** xfs_repair -n output *** > > >> Phase 1 - find and verify superblock... > > >> Phase 2 - using internal log > > >> - zero log... > > >> - scan filesystem freespace and inode maps... > > >> sb_fdblocks 10916, counted 10923 > > >> - found root inode chunk > > >> ... > > >> > > >> After debugging and analysis, I found the root cause of the problem is > > >> related to the pre-allocations of xfs. xfs pre-allocates some blocks to > > >> reduce fragmentation during buffer append writing, then if we write new > > >> data or do file copy(reflink) after the end of the pre-allocating range, > > >> xfs would zero-out and write back the pre-allocate space(e.g. > > >> xfs_file_write_checks() -> xfs_zero_range()), so we have to update > > >> i_size before writing back in iomap_zero_iter(), otherwise, it will > > >> result in stale delayed extent. > > > > > > Ok, so this is long because the example is lacking in clear details > > > so to try to understand it I've laid it out in detail to make sure > > > I've understood it correctly. > > > > > > > Thanks for the graph, the added detail makes things clear and easy to > > understand. To be honest, it's not exactly the same as the results I > > captured and described (the position A\B\C\D\E\F I described is > > increased one by one), but the root cause of the problem is the same, > > so it doesn't affect our understanding of the problem. > > OK, good :) > > ..... > > > > However, if this did actually write zeroes to disk, this would end > > > up with: > > > > > > A C B E F D > > > +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+ > > > EOF EOF > > > (in memory) (on disk) > > > > > > Which is wrong - the file extension and zeros should not get exposed > > > to the user until the entire reflink completes. This would expose > > > zeros at the EOF and a file size that the user never asked for after > > > a crash. Experience tells me that they would report this as > > > "filesystem corrupting data on crash". > > > > > > If we move where i_size gets updated by iomap_zero_iter(), we get: > > > > > > A C B E F D > > > +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+ > > > EOF > > > (in memory) > > > (on disk) > > > > > > Which is also wrong, because now the user can see the size change > > > and read zeros in the middle of the clone operation, which is also > > > wrong. > > > > > > IOWs, we do not want to move the in-memory or on-disk EOF as a > > > result of zeroing delalloc extents beyond EOF as it opens up > > > transient, non-atomic on-disk states in the event of a crash. > > > > > > So, catch-22: we need to move the in-memory EOF to write back zeroes > > > beyond EOF, but that would move the on-disk EOF to E before the > > > clone operation starts. i.e. it makes clone non-atomic. > > > > Make sense. IIUC, I also notice that xfs_file_write_checks() zero > > out EOF blocks if the later write offset is beyond the size of the > > file. Think about if we replace the reflink operation to a buffer > > write E to F, although it doesn't call xfs_flush_unmap_range() > > directly, but if it could be raced by another background write > > back, and trigger the same problem (I've not try to reproduce it, > > so please correct me if I understand wrong). > > Correct, but the write is about to extend the file size when it > writes into the cache beyond the zeroed region. There is no cache > invalidate possible in this path, so the write of data moves the > in-memory EOF past the zeroes in cache and everything works just > fine. > > If it races with concurrent background writeback, the writeback will > skip the zeroed range beyond EOF until they are exposed by the first > data write beyond the zeroed post-eof region which moves the > in-memory EOF. > > truncate(to a larger size) also does this same zeroing - the page > cache is zeroed before we move the EOF in memory, and so the > writeback will only occur once the in-memory EOF is moved. i.e. it > effectively does: > > xfs_zero_range(oldsize to newsize) > truncate_setsize(newsize) > filemap_write_and_wait_range(old size to new size) > > > > What should acutally result from the iomap_zero_range() call from > > > xfs_reflink_remap_prep() is a state like this: > > > > > > A C B E F D > > > +wwwwwwwwww+DDDDDDDDDDD+uuuuu+rrrrrrr+dddddd+ > > > EOF EOF > > > (on disk) (in memory) > > > > > > where 'u' are unwritten extent blocks. > > > > > > > Yeah, this is a good solution. > > > > In xfs_file_write_checks(), I don't fully understand why we need > > the xfs_zero_range(). > > The EOF block may only be partially written. Hence on extension, we > have to guarantee the part of that block beyond the current EOF is > zero if the write leaves a hole between the current EOF and the > start of the new extending write. > > > Theoretically, iomap have already handled > > partial block zeroing for both buffered IO and DIO, so I guess > > the only reason we still need it is to handle pre-allocated blocks > > (no?). > > Historically speaking, Linux is able to leak data beyond EOF on > writeback of partial EOF blocks (e.g. mmap() can write to the EOF > page beyond EOF without failing. We try to mitigate these cases > where we can, but we have to consider that at any time the data in > the cache beyond EOF can be non-zero thanks to mmap() and so any > file extension *must* zero any region beyond EOF cached in the page > cache. > > > If so,would it be better to call xfs_free_eofblocks() to > > release all the preallocated extents in range? If not, maybe we > > could only zero out mapped partial blocks and also release > > preallocated extents? > > No, that will cause all sorts of other performance problems, > especially for reflinked files that triggering COW > operations... > > > > > In xfs_reflink_remap_prep(), I read the commit 410fdc72b05a ("xfs: > > zero posteof blocks when cloning above eof"), xfs used to release > > preallocations, the change log said it didn't work because of the > > PREALLOC flag, but the 'force' parameter is 'true' when calling > > xfs_can_free_eofblocks(), so I don't get the problem met. Could we > > fall back to use xfs_free_eofblocks() and make a state like this? > > > > A C B E F D > > +wwwwwwwwww+DDDDDDDDDDD+hhhhh+rrrrrrr+dddddd+ > > EOF EOF > > (on disk) (in memory) > > It could, but that then requires every place that may call > xfs_zero_range() to be aware of this need to trim EOF blocks to do > the right thing in all cases. We don't want to remove speculative > delalloc in the write() path nor in the truncate(up) case, and so it > doesn't fix the general problem of zeroing specualtive delalloc > beyond EOF requiring writeback to push page caceh pages to disk > before the inode size has been updated. > > The general solution is to have zeroing of speculative prealloc > extents beyond EOF simply convert the range to unwritten and then > invalidate any cached pages over that range. At this point, we are > guaranteed to have zeroes across that range, all without needing to > do any IO at all... That (separate iomap ops that do the delalloc -> unwritten allocation) seems a lot more straightforward to me than whacking preallocations. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >