From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D53AC47DD9 for ; Wed, 28 Feb 2024 22:25:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9C7F06B009B; Wed, 28 Feb 2024 17:25:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 978F06B009C; Wed, 28 Feb 2024 17:25:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F2E26B009D; Wed, 28 Feb 2024 17:25:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 695466B009B for ; Wed, 28 Feb 2024 17:25:36 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id D438F40F77 for ; Wed, 28 Feb 2024 22:25:35 +0000 (UTC) X-FDA: 81842645430.22.480C0A7 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf23.hostedemail.com (Postfix) with ESMTP id CF99C140004 for ; Wed, 28 Feb 2024 22:25:33 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=bTq6go5c; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf23.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709159134; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oOujcwe/BF6OoLMFzWGriAndnvAMZ6SmsxufNb6baEU=; b=3eEWgxM/6JfLNR3j2IZWpBBP26RB1UpNMfZ9M+nduCzUGZWVkTeBERqKFYEfOnC+Q+8Pdd Srt/SwKjRacNb0E18nPQL6wnRE+3ddoZHC9TAD79XvH+98RnkIq6kI/CDcSl8x/DK5Ik0d 1l5QoUrRsVzpu1BO12qKRZg71VamhWY= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=bTq6go5c; dmarc=pass (policy=quarantine) header.from=fromorbit.com; spf=pass (imf23.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=david@fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709159134; a=rsa-sha256; cv=none; b=heEI3iDzLcfW0K+32cbfZORq8hgXsDTb9qn4PVQlPDVW0WN3qxCkr1kEBYywPzGUr6jtMx IrrJ/ImceRJkQLqePxlJ0Is7lwTH7CdR8Xf+HRpztzzcLH8ixatpg2qhhpyqF+SRO6ylXd CADWEKzAhKJUt02QFjAVojHanQjMBRE= Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-1dc3b4b9b62so2352485ad.1 for ; Wed, 28 Feb 2024 14:25:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1709159132; x=1709763932; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=oOujcwe/BF6OoLMFzWGriAndnvAMZ6SmsxufNb6baEU=; b=bTq6go5cDCPyfaR3CMasHpeDy6+YwcQipbf1zYg5ZfXkZGie7XGaMlQ7+dVVmYL66s 1WDL0087KGb+AKpvWyEN8qu2Otlz6VucuoQkz1H2cWriEEfkCkMWt9sl0jjTl9nMlBRN EalM8x7XLcxi2f2+A9SAIV81AVjGJhI0gZkF0aI6KU7Y1yMIMULXJCyrLPkC5DdrYBma 6Y9P0GyutdfECMumZEU074ivFAnFXfB38JzJP7eJrUnMIRwZ5wcjEqvGGt+AVa99VOA3 1qKOqz74UKm79rZnO82IgbkvPcLpwBpsTAxP2gWrmc27HehKTLcqunf5xgEWPEu/eUv3 0IXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709159132; x=1709763932; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oOujcwe/BF6OoLMFzWGriAndnvAMZ6SmsxufNb6baEU=; b=IGpRvPp4WHEa6MI8eCHgHJSA1o+G6Iw+99ToucFZiN98krKMW3TgDq662n02ZJ+0ha ftpDvwzvzPOne0ezRj7ARatjZx4rwmKcQgLKS/ZnYqkuCyx6QuGyHo4QBuVsQpKiKhF5 jg1bhL8xiUQ9aDGBAPfBzQQgmm4RMWAz2NVvxtEWWLT5BGFSIneF+Z1cWcF9GIFAreSO FIre5loM8tNfyXGseR706DlfB0nTTT8iTP678O0AkLNBh7QxYiZLEXr2RcDveks9NaZe 5/4qlR+CAN0hKkuBlrMD+j83k9YSfpGYEq0Uj2bkKrelfUWV2PNv5w2b/kLef95sXp2z 3pKA== X-Forwarded-Encrypted: i=1; AJvYcCXKT44NWolri7BmRjJuQ9b9jzyhjUR6KSyqd9eB1dpdrJnC81sFeB7tODYWpmtOV3+QCQXv0eU1aXi01wNhwnE+Oas= X-Gm-Message-State: AOJu0Yxc3V4amGsLJ2UrfAzplt2otMwXUcKO/QTA4yelD0r938yQ8pRm cy8alVvL0VuYA605UaS31LjC75Rik45G8DgLmz5JI9DPXbvILVzb7Gu08bVi1us= X-Google-Smtp-Source: AGHT+IHqQy0+vejFYNP5aALqSGIqnuDR7yU2CVlkZ7BMVbeUnEFC6Oq62NF/rLHwF+0GS7yyR0VfVQ== X-Received: by 2002:a17:902:c403:b0:1dc:6b26:d1cd with SMTP id k3-20020a170902c40300b001dc6b26d1cdmr153477plk.2.1709159132460; Wed, 28 Feb 2024 14:25:32 -0800 (PST) Received: from dread.disaster.area (pa49-181-247-196.pa.nsw.optusnet.com.au. [49.181.247.196]) by smtp.gmail.com with ESMTPSA id jy13-20020a17090342cd00b001dc11f90512sm3765505plb.126.2024.02.28.14.25.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Feb 2024 14:25:31 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1rfSMj-00CqUC-0S; Thu, 29 Feb 2024 09:25:29 +1100 Date: Thu, 29 Feb 2024 09:25:29 +1100 From: Dave Chinner To: Zhang Yi Cc: Christoph Hellwig , djwong@kernel.org, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com Subject: Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a write operation Message-ID: References: <20240127015825.1608160-1-yi.zhang@huaweicloud.com> <20240127015825.1608160-8-yi.zhang@huaweicloud.com> <9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com> X-Rspam-User: X-Stat-Signature: n9mkgqakedrrz61xu3jqprbd45guqg1r X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: CF99C140004 X-HE-Tag: 1709159133-196673 X-HE-Meta: U2FsdGVkX188wcRy+9VPR94eDoU2ogoCD8+qscK0cD//sxqrTuXrIDIiKCW/Fb/3HhuM79w1JbqeeQOXhOCZM8aAO8Yh575w6ZJ5hgnaHk5QMDCHSayL6aaCXT/uFKYVzgrxTyb1LPR2dnNjeppCrVzYbaFzL/Q0c1Jiw13L4JP0YgBQT7bGHfli0aVhvpQ39toqeJsEviTf4geLSRauyoKc+BuZCBYK+tbWbcGaPO0hAks20S5MKvAFHdHBxB6lZ7QWh/LYVUHnUd1dWLRvPbOmOB0oXJ5kGXqHfyXV27tglCGoD0g71lLoa+Ay3QbAkYSdUem8CbDAohF/nvIsfpGhy5xwZMyDVJKuguy3qpFtbLWb36VJ2vmKayfYho7cLqcSIM2oOg7UnkqCu0O0M0DXWslSdWBONBBytYv7A5CcpUXjLVWVudvxxyBgZ0g0uYabKKidLe1VbI7A9l+lVGIONZBlMnzIrSuc67hkB7ioLluNUQKmDOLmjhhmylkia1QiL+WJjjqiXNX54Hk8iTRf+KhJchXcw4K0PukWi/+N4/YxOvtSoQ5TCUDA8fMdPETYswoHffZ8MTM8tyAgOT9pbqyhLpIA4YSxhMl8yFFNzonzAtgtzarnU/S7W3gPjM6PeE2dZ9IwQ//2RU/yM3/VHqFJNL1njAfb+GNXb923nFESH9yID+EH1twXEp+y+6Y8nag4ET7qvqu8xYb0pWJaM5XVkVIl13iHwLpm8f9nceY812suVHQmfVGvaITnJP7bPQfAnsh1KgX9mU7KB/9DfS7W13z1QAHanuLrd4kfTt+RL30zyYUv9QARdmq/q4Cf7BnAqsZFB0vAv0W9SYzE4MrDsdQpLQt69mMuXkXAwg5wJjZVr7kTbQx6Xw1peJ4mJM4bpi26kaW5XSuv09rlMMTQuLeIHf2C/8+Ncv83Y8pGZFetBwLMX/HQADh5NqusTjudkP6Kk23SMXX TOSnZV6/ 3oQdbNqf+CJXq/r554klz4gclY8RYzR5WsSs8Z4H/EZS2+O4Jh7BC2s5Dguj0nPCZzAz+vqmt8xVJtRxGbw4LxShW5ukqiVnxoBU0z/Joi4Yk4gF3jH8Lvfe5FB1tRqxwg3l0asLzqhyyJSou88DFjXGOxLdwNRdbnLPiK/rSQO1MUVlZf+hpKK+/q5L4DMqkg2b/B8FDWFJrnoiCSUKZSPagWSXQief3CrrxnPiqu5lxCgvY3jmZ0eKxBsyDS5pLV5FVYErl4bYlXJ4rKAi+cDRkQVBSQagbFXZLKnGCnhXBXgglCMV7ZRE0tbAm2dJOTdAvcg9Q+tntD9O4uSbgUbqzlb9OOiWCpuOGdzbJSU6YvK0mBKhQ/VDFXEclOvdCOJFW+E2CcPdVBvL+rNFwzvrMFvUS1ZFeywYDYCMji1Uxi4/tsAPeAF4zdwFIDloTlDkVGTTBzst85EXnOwrW9PkhTrqM63mjE72KXSAbRrQpUMNeKCJb7X+JA7Zh7dnjyv7v+S8itFqAUTdcGMKobdKbmYlUrmbC3HlPI2zb3GEdTu0c90iyvYnMMJkbdq5rU4EUzTP/hLmMyrjudKGy2FVymg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 28, 2024 at 04:53:32PM +0800, Zhang Yi wrote: > On 2024/2/13 13:46, Christoph Hellwig wrote: > > Wouldn't it make more sense to just move the size manipulation to the > > write-only code? An untested version of that is below. With this > > the naming of the status variable becomes even more confusing than > > it already is, maybe we need to do a cleanup of the *_write_end > > calling conventions as it always returns the passed in copied value > > or 0. > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 3dab060aed6d7b..8401a9ca702fc0 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > > @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len, > > size_t copied, struct folio *folio) > > { > > const struct iomap *srcmap = iomap_iter_srcmap(iter); > > - loff_t old_size = iter->inode->i_size; > > - size_t ret; > > - > > - if (srcmap->type == IOMAP_INLINE) { > > - ret = iomap_write_end_inline(iter, folio, pos, copied); > > - } else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) { > > - ret = block_write_end(NULL, iter->inode->i_mapping, pos, len, > > - copied, &folio->page, NULL); > > - } else { > > - ret = __iomap_write_end(iter->inode, pos, len, copied, folio); > > - } > > - > > - /* > > - * Update the in-memory inode size after copying the data into the page > > - * cache. It's up to the file system to write the updated size to disk, > > - * preferably after I/O completion so that no stale data is exposed. > > - */ > > - if (pos + ret > old_size) { > > - i_size_write(iter->inode, pos + ret); > > - iter->iomap.flags |= IOMAP_F_SIZE_CHANGED; > > - } > > I've recently discovered that if we don't increase i_size in > iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs > depends on iomap_zero_iter() to increase i_size in some cases. > > generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r) > (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details) > > _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r) > *** xfs_repair -n output *** > Phase 1 - find and verify superblock... > Phase 2 - using internal log > - zero log... > - scan filesystem freespace and inode maps... > sb_fdblocks 10916, counted 10923 > - found root inode chunk > ... > > After debugging and analysis, I found the root cause of the problem is > related to the pre-allocations of xfs. xfs pre-allocates some blocks to > reduce fragmentation during buffer append writing, then if we write new > data or do file copy(reflink) after the end of the pre-allocating range, > xfs would zero-out and write back the pre-allocate space(e.g. > xfs_file_write_checks() -> xfs_zero_range()), so we have to update > i_size before writing back in iomap_zero_iter(), otherwise, it will > result in stale delayed extent. Ok, so this is long because the example is lacking in clear details so to try to understand it I've laid it out in detail to make sure I've understood it correctly. > > For more details, let's think about this case, > 1. Buffered write from range [A, B) of an empty file foo, and > xfs_buffered_write_iomap_begin() prealloc blocks for it, then create > a delayed extent from [A, D). So we have a delayed allocation extent and the file size is now B like so: A B D +DDDDDDDDDDDDDDDDDDDDDD+dddddddddddddddddddd+ EOF (in memory) where 'd' is a delalloc block with no data and 'D' is a delalloc block with dirty folios over it. > 2. Write back process map blocks but only convert above delayed extent > from [A, C) since the lack of a contiguous physical blocks, now we > have a left over delayed extent from [C, D), and the file size is B. So this produces: A C B D +wwwwwwwwww+DDDDDDDDDDD+dddddddddddddddddddd+ EOF EOF (on disk) (in memory) where 'w' contains allocated written data blocks. > 3. Copy range from another file to range [E, F), then > xfs_reflink_zero_posteof() would zero-out post eof range [B, E), it > writes zero, dirty and write back [C, E). I'm going to assume that [E,F) is located like this because you are talking about post-eof zeroing from B to E: A C B E F D +wwwwwwwwww+DDDDDDDDDDD+ddddd+rrrrrrr+dddddd+ EOF EOF (on disk) (in memory) where 'r' is the clone destination over dellaloc blocks. Did I get that right? And so reflink wants to zero [B,E] before it updates the file size, just like a truncate(E) would. iomap_zero_iter() will see a delalloc extent (IOMAP_DELALLOC) for [B,E], so it will write zeros into cache for it. We then have: A C B E F D +wwwwwwwwww+DDDDDDDDDDD+ZZZZZ+rrrrrrr+dddddd+ EOF EOF (on disk) (in memory) where 'Z' is delalloc blocks with zeroes in cache. Because the destination is post EOF, xfs_reflink_remap_prep() then does: /* * If pos_out > EOF, we may have dirtied blocks between EOF and * pos_out. In that case, we need to extend the flush and unmap to cover * from EOF to the end of the copy length. */ if (pos_out > XFS_ISIZE(dest)) { loff_t flen = *len + (pos_out - XFS_ISIZE(dest)); ret = xfs_flush_unmap_range(dest, XFS_ISIZE(dest), flen); } .... Which attempts to flush from the current in memory EOF up to the end of the clone destination range. This should result in: A C B E F D +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+ EOF EOF (on disk) (in memory) Where 'z' is zeroes on disk. Have I understood this correctly? However, if this did actually write zeroes to disk, this would end up with: A C B E F D +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+ EOF EOF (in memory) (on disk) Which is wrong - the file extension and zeros should not get exposed to the user until the entire reflink completes. This would expose zeros at the EOF and a file size that the user never asked for after a crash. Experience tells me that they would report this as "filesystem corrupting data on crash". If we move where i_size gets updated by iomap_zero_iter(), we get: A C B E F D +wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+ EOF (in memory) (on disk) Which is also wrong, because now the user can see the size change and read zeros in the middle of the clone operation, which is also wrong. IOWs, we do not want to move the in-memory or on-disk EOF as a result of zeroing delalloc extents beyond EOF as it opens up transient, non-atomic on-disk states in the event of a crash. So, catch-22: we need to move the in-memory EOF to write back zeroes beyond EOF, but that would move the on-disk EOF to E before the clone operation starts. i.e. it makes clone non-atomic. What should acutally result from the iomap_zero_range() call from xfs_reflink_remap_prep() is a state like this: A C B E F D +wwwwwwwwww+DDDDDDDDDDD+uuuuu+rrrrrrr+dddddd+ EOF EOF (on disk) (in memory) where 'u' are unwritten extent blocks. i.e. instead of writing zeroes through the page cache for IOMAP_DELALLOC ranges beyond EOF, we should be converting those ranges to unwritten and invalidating any cached data over that range beyond EOF. IOWs, it looks to me like the problem is that xfs_buffered_write_iomap_begin() is doing the wrong thing for IOMAP_ZERO operations for post-EOF regions spanned by speculative delalloc. It should be converting the region to unwritten so it has zeroes on disk, not relying on the page cache to be able to do writeback beyond the current EOF.... > 4. Since we don't update i_size in iomap_zero_iter(),the writeback > doesn't write anything back, also doesn't convert the delayed extent. > After copy range, the file size will update to F. Yup, this is all, individually, correct behaviour. But when put together, the wrong thing happens. I suspect xfs_zero_range() needs to provide a custom set of iomap_begin/end callbacks rather than overloading the normal buffered write mechanisms. -Dave. -- Dave Chinner david@fromorbit.com