From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 351B0C54798
	for <linux-mm@archiver.kernel.org>; Thu, 29 Feb 2024 23:19:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8F74D6B0089; Thu, 29 Feb 2024 18:19:39 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8A7A26B008A; Thu, 29 Feb 2024 18:19:39 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 76EA96B0092; Thu, 29 Feb 2024 18:19:39 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 66F696B0089
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 18:19:39 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 03B7280639
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 23:19:38 +0000 (UTC)
X-FDA: 81846410478.06.7A719E9
Received: from mail-pg1-f177.google.com (mail-pg1-f177.google.com [209.85.215.177])
	by imf17.hostedemail.com (Postfix) with ESMTP id 0B08740020
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 23:19:35 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=liAFEDq1;
	spf=pass (imf17.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=david@fromorbit.com;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709248776;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=LzB2DcfYOR7VovBPih0jTdeK49M04fl9RE9G+2FG9Wo=;
	b=LbyF1wPLieBl1/1B6iCIXCpL84oQeg7TXyy2A6KqTBimuGRsb0GjBYpWPChJOniZE5UhC6
	fRM9fE1eIX4jS7ewWGwNep1c3qsFt7zmNVwIT3tvxj1lwLWtDmGFF41LesMj2EitCOPWDL
	XAs9luQAn3lHgwAOat6aap0belLBIYk=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=liAFEDq1;
	spf=pass (imf17.hostedemail.com: domain of david@fromorbit.com designates 209.85.215.177 as permitted sender) smtp.mailfrom=david@fromorbit.com;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709248776; a=rsa-sha256;
	cv=none;
	b=Cm5g+ygY80ktPdWW5NxVOtQips7fHnkq+nCj4GB22uDr66u7LwPkLj9c3uE1qOMUMHYwK9
	n72pEFmBBzes5J/EHgmTZ2cs0Ka7ZRkjfaoqUzKX6sS63XB64uNRokGy8auKRHNBcIdxIc
	ivgSDY+aN0oz9JACH8mNViFd1cDKxNk=
Received: by mail-pg1-f177.google.com with SMTP id 41be03b00d2f7-5bdbe2de25fso1169977a12.3
        for <linux-mm@kvack.org>; Thu, 29 Feb 2024 15:19:35 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1709248775; x=1709853575; darn=kvack.org;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date:from:to
         :cc:subject:date:message-id:reply-to;
        bh=LzB2DcfYOR7VovBPih0jTdeK49M04fl9RE9G+2FG9Wo=;
        b=liAFEDq14io5QNkeYcrqmLDZxvboR+t6PGckqY3XYOYcA87owlUrvCW3NJ8Bs9US15
         Ycve4dDYVwHVIW1xGca+Ch2dduY/M0awVr0YvtIv7HW6GeKP276+ymnbw/hd4bgkvsup
         Y5VRiYS8xKC7ouc1RVQekYt4miOMeGhI2jbqiw7n0Khxmkaf06w2CkmB+2tCULNm8h3V
         pMXvC9j2HwGcFck0Lb/CHai21+fQaRFbHT+Gswy6NSoZgMF5T0ZESDGVIZVmLZgj9OmM
         QDoQpM3zadYVSOsrS54HSOfdQa1YKVoB8aMjYQdAR12UqJPrZ+iTcsmQo9uf+YboDfRr
         HAlw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709248775; x=1709853575;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=LzB2DcfYOR7VovBPih0jTdeK49M04fl9RE9G+2FG9Wo=;
        b=c04QFZWJzqs0ERh8XG2RoJto3W7XcdXYwy8i2N747OgcYV+88sCKbwc6fR0Zhk1+Ou
         xZOzJEm5ONSOG9HcI9Kek+zTnNreHiwbAGmnpMGZWDVXdm1vPlpoQm1apVAgs2T84QJo
         QEMKSCVhFmLdIQLvyZtyUQWpC4Nt5EngQoc3G7DsHDFvzKFQN1YWes8cO3TVOKce2byw
         +2itSx4aeDZEZsWEZ9pUh7252MV6/2nDjTN2wapXvzr+PM3shVCPWKlqMZZ6FY+glwHS
         /4OE0pnQ+skfjKXGR9nPdQZdzifeuosovhbHsAMktnYNjsVCtxBuiDp4Hp4coU9J5QyZ
         BDYQ==
X-Forwarded-Encrypted: i=1; AJvYcCWSVYoQS1fIW7h0w+uNDDEd+Jk/hfE5n8Xl4/6Gn/RFFLTQ/DP7rUQp/KbIAgw+2XNVDW854I+7SB90+NLZI+oPd2U=
X-Gm-Message-State: AOJu0YyeH+aM8VV3l7YsUxloh6jCIPjCGEywYxw4+S8ahesm7ASMaHKv
	sGxn0t6R5lcgVf4O5ub9TOSG/RekxVsXt/vFr4k+aPjnqQwq3X24mS8U4Oqd/2s=
X-Google-Smtp-Source: AGHT+IG84xb6+RPgA/S6whpydeQjGZWhjkSRY0cJM2NPzuyxJYlQl4nrAkOGJffUmhItKUU/uGO2fQ==
X-Received: by 2002:a17:90a:3004:b0:299:6389:2961 with SMTP id g4-20020a17090a300400b0029963892961mr88532pjb.13.1709248774686;
        Thu, 29 Feb 2024 15:19:34 -0800 (PST)
Received: from dread.disaster.area (pa49-181-247-196.pa.nsw.optusnet.com.au. [49.181.247.196])
        by smtp.gmail.com with ESMTPSA id x6-20020a170902a38600b001db86c48221sm2096693pla.22.2024.02.29.15.19.33
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 29 Feb 2024 15:19:34 -0800 (PST)
Received: from dave by dread.disaster.area with local (Exim 4.96)
	(envelope-from <david@fromorbit.com>)
	id 1rfpgY-00DJED-2Z;
	Fri, 01 Mar 2024 10:19:30 +1100
Date: Fri, 1 Mar 2024 10:19:30 +1100
From: Dave Chinner <david@fromorbit.com>
To: Zhang Yi <yi.zhang@huaweicloud.com>
Cc: Christoph Hellwig <hch@infradead.org>, djwong@kernel.org,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-xfs@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca,
	jack@suse.cz, ritesh.list@gmail.com, willy@infradead.org,
	zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com,
	yukuai3@huawei.com, wangkefeng.wang@huawei.com
Subject: Re: [RFC PATCH v3 07/26] iomap: don't increase i_size if it's not a
 write operation
Message-ID: <ZeERAob9Imwh01bG@dread.disaster.area>
References: <20240127015825.1608160-1-yi.zhang@huaweicloud.com>
 <20240127015825.1608160-8-yi.zhang@huaweicloud.com>
 <ZcsCP4h-ExNOcdD6@infradead.org>
 <9b0040ef-3d9d-6246-4bdd-82b9a8f55fa2@huaweicloud.com>
 <Zd+y2VP8HpbkDu41@dread.disaster.area>
 <45c1607a-805d-e7a2-a5ca-3fd7e507a664@huaweicloud.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <45c1607a-805d-e7a2-a5ca-3fd7e507a664@huaweicloud.com>
X-Rspamd-Queue-Id: 0B08740020
X-Rspam-User: 
X-Stat-Signature: co4dwkojknn6qyueqopro1igpnz1usp1
X-Rspamd-Server: rspam01
X-HE-Tag: 1709248775-505949
X-HE-Meta: U2FsdGVkX18xVXzAcEU8FUGjCF0lM6/50HZfUZvjPLVvmNTi7+9a865I9Omx+rsvIp3ZLlbGAK5fPVe9WOyQ5HLp/eR7bes8OdcFPhAs+m0p8YhoNRVTD5bN5Iu5h07OEJlTcdlGhnEVHjpjRU5C2JZMeRhkc+iF4J5usj13ye/5TAwVdKVMp23G/JwnK/wxhKg6FVk71LbKCxKzcvOOuvlsGsOEQ2OxT8aWclTLxTxMmTaItwvVLKE2BvMT7MQ2f7GHEjdEVS8Bqmfu/IvNx+5Cdzy8SvLgH787F1ffuMkI1lC9uVFLa3KVAGht/tXA+XotKfGRFayRYMljQVyNR6GF5oRF8gJ/MuxrWz5oArDF+tDYbstdzh5aGH5UUggh1hU81o9rYmbT7VTyeEc6CJvEKaSfNgfUTu1qqVxU0hDa60ljyH3CYyx+IW5frATO/CWBWOzEmYO+kZiuh3XrB7h66+SACiH0bkuKSoCAo/j3OIDtyAFGQDjm8E7XLWn6xUYrUI8hI9KCQMhH1Z0kJz6FJEts842MpEFFGvE54sB3PkzUEe+4IIh5toEKKWDfCJBpbW7uWcV9yDsmuxMHez8NQBIvIAwQVeo5BPS/o5CTB/IhRVxERGSXspz5LU7iqEfehTGOphRr0IymlmAcr430C4Rr2bdFhWBryEiWcTNzCeTB0Ykfx//yO5YpJuFZueUGBYFmiQUWTVoOM0egURbTwSyA6YqUBZwGs92z7IRFN73EuCiaAszckV7I3RFncGp5cm/ni9//UvWma9l6A0KYXcZ+Xs4rM7bm25uhAYJVhvO1gFGiM9ImI5ZkEGEYAg5u6ZH/+F8Byl1eqJAR8uCYZxZhIRjCM+uMTPt3FA/7oIvmJYNBxEGmpbWUHdsHcOCGf9tkkKC9Lz8JkxfsW6ru7olplVk/R/dO6R9AXiRVL0nvItVgUOw8DPXv9uELyRtxLZpYENPws0R9wxG
 5nwoCf4T
 dGn2OPcLi10lf44Tldol2M/kfECf99U57L1XMPjgKZqxVVD+KZ1dx+MTKNYqBhCoKdgeDae+GAlDGh5+43rp9/SVWSE7Jl4U2vYwttWJoS7XAhEceDRfetvsG+R3xWs2v6eP6WwZ4xcFlt2q/wIferQR+SQIeUFiSUVClpWaeC/gqBonsQt3HOVdSZEmSRZ9kgSWtOJGCQsvbi3MPgjyh5hiSsupTBCzIaxVKlKY/M6ZJkRw=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Feb 29, 2024 at 04:59:34PM +0800, Zhang Yi wrote:
> Hello, Dave!
> 
> On 2024/2/29 6:25, Dave Chinner wrote:
> > On Wed, Feb 28, 2024 at 04:53:32PM +0800, Zhang Yi wrote:
> >> On 2024/2/13 13:46, Christoph Hellwig wrote:
> >>> Wouldn't it make more sense to just move the size manipulation to the
> >>> write-only code?  An untested version of that is below.  With this
> >>> the naming of the status variable becomes even more confusing than
> >>> it already is, maybe we need to do a cleanup of the *_write_end
> >>> calling conventions as it always returns the passed in copied value
> >>> or 0.
> >>>
> >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>> index 3dab060aed6d7b..8401a9ca702fc0 100644
> >>> --- a/fs/iomap/buffered-io.c
> >>> +++ b/fs/iomap/buffered-io.c
> >>> @@ -876,34 +876,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
> >>>  		size_t copied, struct folio *folio)
> >>>  {
> >>>  	const struct iomap *srcmap = iomap_iter_srcmap(iter);
> >>> -	loff_t old_size = iter->inode->i_size;
> >>> -	size_t ret;
> >>> -
> >>> -	if (srcmap->type == IOMAP_INLINE) {
> >>> -		ret = iomap_write_end_inline(iter, folio, pos, copied);
> >>> -	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
> >>> -		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
> >>> -				copied, &folio->page, NULL);
> >>> -	} else {
> >>> -		ret = __iomap_write_end(iter->inode, pos, len, copied, folio);
> >>> -	}
> >>> -
> >>> -	/*
> >>> -	 * Update the in-memory inode size after copying the data into the page
> >>> -	 * cache.  It's up to the file system to write the updated size to disk,
> >>> -	 * preferably after I/O completion so that no stale data is exposed.
> >>> -	 */
> >>> -	if (pos + ret > old_size) {
> >>> -		i_size_write(iter->inode, pos + ret);
> >>> -		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
> >>> -	}
> >>
> >> I've recently discovered that if we don't increase i_size in
> >> iomap_zero_iter(), it would break fstests generic/476 on xfs. xfs
> >> depends on iomap_zero_iter() to increase i_size in some cases.
> >>
> >>  generic/476 75s ... _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> >>  (see /home/zhangyi/xfstests-dev/results//xfs/generic/476.full for details)
> >>
> >>  _check_xfs_filesystem: filesystem on /dev/pmem2 is inconsistent (r)
> >>  *** xfs_repair -n output ***
> >>  Phase 1 - find and verify superblock...
> >>  Phase 2 - using internal log
> >>          - zero log...
> >>          - scan filesystem freespace and inode maps...
> >>  sb_fdblocks 10916, counted 10923
> >>          - found root inode chunk
> >>  ...
> >>
> >> After debugging and analysis, I found the root cause of the problem is
> >> related to the pre-allocations of xfs. xfs pre-allocates some blocks to
> >> reduce fragmentation during buffer append writing, then if we write new
> >> data or do file copy(reflink) after the end of the pre-allocating range,
> >> xfs would zero-out and write back the pre-allocate space(e.g.
> >> xfs_file_write_checks() -> xfs_zero_range()), so we have to update
> >> i_size before writing back in iomap_zero_iter(), otherwise, it will
> >> result in stale delayed extent.
> > 
> > Ok, so this is long because the example is lacking in clear details
> > so to try to understand it I've laid it out in detail to make sure
> > I've understood it correctly.
> > 
> 
> Thanks for the graph, the added detail makes things clear and easy to
> understand. To be honest, it's not exactly the same as the results I
> captured and described (the position A\B\C\D\E\F I described is
> increased one by one), but the root cause of the problem is the same,
> so it doesn't affect our understanding of the problem.

OK, good :)

.....

> > However, if this did actually write zeroes to disk, this would end
> > up with:
> > 
> > 	A          C           B     E       F      D
> > 	+wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > 	                      EOF   EOF
> >                       (in memory)   (on disk)
> > 
> > Which is wrong - the file extension and zeros should not get exposed
> > to the user until the entire reflink completes. This would expose
> > zeros at the EOF and a file size that the user never asked for after
> > a crash. Experience tells me that they would report this as
> > "filesystem corrupting data on crash".
> > 
> > If we move where i_size gets updated by iomap_zero_iter(), we get:
> > 
> > 	A          C           B     E       F      D
> > 	+wwwwwwwwww+DDDDDDDDDDD+zzzzz+rrrrrrr+dddddd+
> > 	                            EOF
> >                                 (in memory)
> > 		                 (on disk)
> > 
> > Which is also wrong, because now the user can see the size change
> > and read zeros in the middle of the clone operation, which is also
> > wrong.
> > 
> > IOWs, we do not want to move the in-memory or on-disk EOF as a
> > result of zeroing delalloc extents beyond EOF as it opens up
> > transient, non-atomic on-disk states in the event of a crash.
> > 
> > So, catch-22: we need to move the in-memory EOF to write back zeroes
> > beyond EOF, but that would move the on-disk EOF to E before the
> > clone operation starts. i.e. it makes clone non-atomic.
> 
> Make sense. IIUC, I also notice that xfs_file_write_checks() zero
> out EOF blocks if the later write offset is beyond the size of the
> file.  Think about if we replace the reflink operation to a buffer
> write E to F, although it doesn't call xfs_flush_unmap_range()
> directly, but if it could be raced by another background write
> back, and trigger the same problem (I've not try to reproduce it,
> so please correct me if I understand wrong).

Correct, but the write is about to extend the file size when it
writes into the cache beyond the zeroed region. There is no cache
invalidate possible in this path, so the write of data moves the
in-memory EOF past the zeroes in cache and everything works just
fine.

If it races with concurrent background writeback, the writeback will
skip the zeroed range beyond EOF until they are exposed by the first
data write beyond the zeroed post-eof region which moves the
in-memory EOF.

truncate(to a larger size) also does this same zeroing - the page
cache is zeroed before we move the EOF in memory, and so the
writeback will only occur once the in-memory EOF is moved. i.e. it
effectively does:

	xfs_zero_range(oldsize to newsize)
	truncate_setsize(newsize)
	filemap_write_and_wait_range(old size to new size)

> > What should acutally result from the iomap_zero_range() call from
> > xfs_reflink_remap_prep() is a state like this:
> > 
> > 	A          C           B     E       F      D
> > 	+wwwwwwwwww+DDDDDDDDDDD+uuuuu+rrrrrrr+dddddd+
> > 	          EOF         EOF
> >                (on disk)  (in memory)
> > 
> > where 'u' are unwritten extent blocks.
> > 
> 
> Yeah, this is a good solution.
> 
> In xfs_file_write_checks(), I don't fully understand why we need
> the xfs_zero_range().

The EOF block may only be partially written. Hence on extension, we
have to guarantee the part of that block beyond the current EOF is
zero if the write leaves a hole between the current EOF and the
start of the new extending write.

> Theoretically, iomap have already handled
> partial block zeroing for both buffered IO and DIO, so I guess
> the only reason we still need it is to handle pre-allocated blocks
> (no?).

Historically speaking, Linux is able to leak data beyond EOF on
writeback of partial EOF blocks (e.g. mmap() can write to the EOF
page beyond EOF without failing. We try to mitigate these cases
where we can, but we have to consider that at any time the data in
the cache beyond EOF can be non-zero thanks to mmap() and so any
file extension *must* zero any region beyond EOF cached in the page
cache.

> If so，would it be better to call xfs_free_eofblocks() to
> release all the preallocated extents in range? If not, maybe we
> could only zero out mapped partial blocks and also release
> preallocated extents?

No, that will cause all sorts of other performance problems,
especially for reflinked files that triggering COW
operations...

> 
> In xfs_reflink_remap_prep(), I read the commit 410fdc72b05a ("xfs:
> zero posteof blocks when cloning above eof"), xfs used to release
> preallocations, the change log said it didn't work because of the
> PREALLOC flag, but the 'force' parameter is 'true' when calling
> xfs_can_free_eofblocks(), so I don't get the problem met. Could we
> fall back to use xfs_free_eofblocks() and make a state like this?
> 
>  	A          C           B     E       F      D
>  	+wwwwwwwwww+DDDDDDDDDDD+hhhhh+rrrrrrr+dddddd+
>  	          EOF         EOF
>                 (on disk)  (in memory)

It could, but that then requires every place that may call
xfs_zero_range() to be aware of this need to trim EOF blocks to do
the right thing in all cases. We don't want to remove speculative
delalloc in the write() path nor in the truncate(up) case, and so it
doesn't fix the general problem of zeroing specualtive delalloc
beyond EOF requiring writeback to push page caceh pages to disk
before the inode size has been updated.

The general solution is to have zeroing of speculative prealloc
extents beyond EOF simply convert the range to unwritten and then
invalidate any cached pages over that range. At this point, we are
guaranteed to have zeroes across that range, all without needing to
do any IO at all...

-Dave.
-- 
Dave Chinner
david@fromorbit.com