From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3AAECC83F1B
	for <linux-mm@archiver.kernel.org>; Mon, 14 Jul 2025 14:35:27 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D161D8D0005; Mon, 14 Jul 2025 10:35:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CC6A88D0001; Mon, 14 Jul 2025 10:35:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C04498D0005; Mon, 14 Jul 2025 10:35:26 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id B17D08D0001
	for <linux-mm@kvack.org>; Mon, 14 Jul 2025 10:35:26 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 7267056F4C
	for <linux-mm@kvack.org>; Mon, 14 Jul 2025 14:35:26 +0000 (UTC)
X-FDA: 83663118252.07.911A519
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf10.hostedemail.com (Postfix) with ESMTP id 7AD0CC0003
	for <linux-mm@kvack.org>; Mon, 14 Jul 2025 14:35:24 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Ms3VstFa;
	spf=pass (imf10.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com;
	dmarc=pass (policy=quarantine) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1752503724;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=mDPs2726dFX9ugWaNVRaZ+owwzalt6vzAcoPklGLe7A=;
	b=A3FEdTowkLPNWeOjBbeLZnxvp9f1HL5e2bz8eFUqpeeq3FYvVhmJRfZTrodAFApc5e5/55
	yj2bI8STqI/Mbk8h39YPmlbV8wozXnzeanwyEI2T+AswkS/d+emOEDm8VWl0dqXX1FTcVA
	Ff8AtebXS7Qysk9a2YdUIRuU4OlGEyE=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752503724; a=rsa-sha256;
	cv=none;
	b=XeTyIFYFbsXAlmg2nXJPNMacMPYqijkY8ATJIgtYOHwBWcww6b4kBLienW0s721M2l3LCu
	VwmaU75n54NZqgclO7oo1KvbnKrqUGqRRjV2RCDi7vH4KGSNI+xOUQ/FJtDJCpfGNEoXuG
	l+Rr6yrGTHtvxAU4lpo6PimWMnJ1MaE=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Ms3VstFa;
	spf=pass (imf10.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com;
	dmarc=pass (policy=quarantine) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1752503723;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=mDPs2726dFX9ugWaNVRaZ+owwzalt6vzAcoPklGLe7A=;
	b=Ms3VstFaTYHN9RQGMWvFCIDxk/iThGPRtM2rhOJa/Ndeky24KlJ2DMXNcxnOICxJfe7dfm
	eQr3X3LGAawtjx+Y1igoxoNSA1xPyobPgA2ciPEk1NZlsGK5o96mNqp6hdzcGSYBW3PSeg
	QRjD0McMYxXH95DWrQRy4SAF3EhkUJU=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-29--oviym61ONKirWe0WrNJaQ-1; Mon,
 14 Jul 2025 10:35:19 -0400
X-MC-Unique: -oviym61ONKirWe0WrNJaQ-1
X-Mimecast-MFC-AGG-ID: -oviym61ONKirWe0WrNJaQ_1752503718
Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id EADD919560B1;
	Mon, 14 Jul 2025 14:35:17 +0000 (UTC)
Received: from bfoster (unknown [10.22.64.43])
	by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8FC37180045B;
	Mon, 14 Jul 2025 14:35:16 +0000 (UTC)
Date: Mon, 14 Jul 2025 10:38:57 -0400
From: Brian Foster <bfoster@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>, linux-mm@kvack.org,
	Matthew Wilcox <willy@infradead.org>,
	Usama Arif <usamaarif64@gmail.com>
Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension
Message-ID: <aHUWgRP4NKq6mKME@bfoster>
References: <20250625184930.269727-1-bfoster@redhat.com>
 <297e44e9-1b58-d7c4-192c-9408204ab1e3@google.com>
 <67f0461b-3359-41e7-a7cd-b059cbef4154@linux.alibaba.com>
 <097c0b07-1f43-51c3-3591-aaa2015226c2@google.com>
 <0224ed0f-d207-4c79-8c9d-f4915a91c11d@linux.alibaba.com>
 <aHE28N9tawdf4FGZ@bfoster>
 <aHFw0jNehHYXQzRh@bfoster>
 <18c5d84b-1449-411e-8cd7-ee8c6af37677@linux.alibaba.com>
MIME-Version: 1.0
In-Reply-To: <18c5d84b-1449-411e-8cd7-ee8c6af37677@linux.alibaba.com>
X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: bIU_VJY4eUkDCG0EwGQJUOdp0VcsOMsJsdS-oR8cw74_1752503718
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 7AD0CC0003
X-Stat-Signature: 9r98pqqp6zmgb9khy139eku78injwezn
X-Rspam-User: 
X-HE-Tag: 1752503724-297667
X-HE-Meta: U2FsdGVkX19yBaD0JdBB0CJPz6+096IFaifPSaiNJLULLytPkyHBGk63OrG0+smHh9d0dNLUDYknltiZK3q7CHFCv12vykZJoszUIymzsUBVPcJ1Sx5+HjZMVvjL12pFW1laG5RLyqbc/b8PoG0pRhVWTEfXGruGlBz/aQnB0Me0iSyYiPm9mtQXhERR5zsETO4OGMsUp92e6CRaGZoEhcjb45MEE49YwWmcHFrDLSVqdvsU/IE77Ekm5zBkhwqBb4AA8d9NI63GezAp9or4z+NCZKxOBstlZ9Jb1C+r1c1/1Y+k01oN+aG3+E2sCCKhIKiuaPob5trdK1V43Hy8ZLCEVlZiZi+Zot5Oq6dXw9LOm3TyoFjvqllUWuEAaz6BOgpFQtQxEg3HaL0GMdLyvG3lYClSWIFJyp/e1OQmBgF+KnpEpwvOGdrqqx5ACUkXhC2PmM5JKnvA9IF1veiis3bJ51MpKTVBnnVp1rOCQ/J4fCFdNqL/r8aWdZXWBEx2XKziY/Son1sgDYd6nqgEzbJ4T72PNW6ntapmPHake7Untsx8V5Jc97OuDDPZc2KmvCIRLanY8IeMwIxtjGPE0LyM5U0t+TxKA+227OaWm1CFV/WN2+U4w3/yxtdTh3xg7Qzyp4DYT0hQDlxIHk6CisW9zYu4HpTvWGgsA078KLK5yTDy//6IdluAiMCTO4RgniguC5puGrMtjas+RuSu8XSrEdlVSPHJ6dbXXRwSkMZzc1MXD8blieKK0xhduEInV1Klf0s8v1ovGEmJhFZzw+UtbDShRBvKccR6q2ITUOPC9UUgMiOo8g4ZK0n06OLgHzUVlYN0Xo1me1uowTxAGmB/Jwi+a4k1txeplq7RTRNgY7yVd08didSpO5EVTuOX8VQQ92bj5Z9bv25tDy4VjTApBUHySpxqFl81WBNceOlb4JtVlXZVaBKEizFGmlptp9R+3vQqtMy8yUOXh05
 dLq5bjbB
 qE3c4OsiMPGEK2aZFxEwIvHgoJQZHOcLnyigiczqBuOcDh3Rr9U+paiwdEDzQvQ1BlFFLQNYgptpbmw4g0NHF3f2Oio8saxzMwQo+E3Kq9pTqm0hVRwJVvBStV5rSny+zrmUWc9VabblahfjHtl9j+93EcUxls6n2uYlPL9d9vGxFftk+UQUATkytcwx/B77i8jnwX9GSIy9ZHPbBXIFYUxdkFGb0aU/9oZPqgS+jWEDuH8gM9kpkQAd+0tzOO0+rZSiGMsUoFm0295TUfVu77TskmZTkuLig/dmlAZiKQgH1WcN++b+ZEOuaBxFj5vnJzqzWYRho58Gzn+OdskfzaaIiF/IaEbCvvLMhQvYOfNU3flkmQPYz77qn98o29srWMX2JPpsrwfpT+yZwMqvB/SC6Mg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jul 14, 2025 at 11:05:35AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/7/12 04:15, Brian Foster wrote:
> > On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote:
> > > On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote:
> > > > 
> > > > 
> > > > On 2025/7/11 06:20, Hugh Dickins wrote:
> > > > > On Thu, 10 Jul 2025, Baolin Wang wrote:
> > > > > > On 2025/7/9 15:57, Hugh Dickins wrote:
> > > > > ...
> > > > > > > 
> > > > > > > The problem is with huge pages (or large folios) in shmem_writeout():
> > > > > > > what goes in as a large folio may there have to be split into small
> > > > > > > pages; or it may be swapped out as one large folio, but fragmentation
> > > > > > > at swapin time demand that it be split into small pages when swapped in.
> > > > > > 
> > > > > > Good point.
> > > > > > 
> > > > > > > So, if there has been swapout since the large folio was modified beyond
> > > > > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > > > > > > length needs to be zeroed.
> > > > > > > 
> > > > > > > We could set that aside as a deficiency to be fixed later on: that
> > > > > > > would not be unreasonable, but I'm guessing that won't satisfy you.
> > > > > > > 
> > > > > > > We could zero the maximum (the remainder of PMD size I believe) in
> > > > > > > shmem_zero_eof(): looping over small folios within the range, skipping
> > > > > > > !uptodate ones (but we do force them uptodate when swapping out, in
> > > > > > > order to keep the space reservation).  TBH I've ignored that as a bad
> > > > > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> > > > > > 
> > > > > > However, IIUC, if the large folios are split in shmem_writeout(), and those
> > > > > > small folios which beyond EOF will be dropped and freed in
> > > > > > __split_unmapped_folio(), should we still consider them?
> > > > > 
> > > > > You're absolutely right about the normal case, and thank you for making
> > > > > that point.  Had I forgotten that when writing?  Or was I already
> > > > > jumping ahead to the problem case?  I don't recall, but was certainly
> > > > > wrong for not mentioning it.
> > > > > 
> > > > > The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> > > > > the small page extent spanning i_size) i.e. fallocate() has promised to
> > > > > keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> > > > > is keeping those pages.
> > > > 
> > > > Ah, yes, you are right.
> > > > 
> > > > > There could well be some optimization, involving fallocend, to avoid
> > > > > zeroing more than necessary; but I wouldn't want to say what in a hurry,
> > > > > it's quite confusing!
> > > > 
> > > > Like you said, not only can a large folio split occur during swapout, but it
> > > > can also happen during a punch hole operation. Moreover, considering the
> > > > abnormal case of fallocate() you mentioned, we should find a more common
> > > > approach to mitigate the impact of fallocate().
> > > > 
> > > > For instance, when splitting, we could clear the 'uptodate' flag for these
> > > > EOF small folios that are beyond 'i_size' but less than the 'fallocend', so
> > > > that these EOF small folios will be re-initialized if they are used again.
> > > > What do you think?
> > > > 
> > > ...
> > > 
> > > Hi Baolin,
> > > 
> > > So I'm still digesting Hugh's clarification wrt the falloc case, but I'm
> > > a little curious here given that I intended to implement the writeout
> > > zeroing suggestion regardless of that discussion..
> > > 
> > > I see the hole punch case falls into truncate_inode_[partial_]folio(),
> > > which looks to me like it handles zeroing. The full truncate case just
> > > tosses the folio of course, but the partial case zeroes according to the
> > > target range prior to doing any potential split from that codepath.
> > > 
> > > That looks kind of similar to what I have prototyped for the
> > > shmem_writeout() case: tail zero the EOF straddling folio before falling
> > > into the split call. [1] Does that not solve the same general issue in
> > > the swapout path as potentially clearing uptodate via the split? I'm
> > > mainly trying to understand if that is just a potential alternative
> > > approach, or if this solves a corner case that I'm missing. Hm?
> > > 
> > 
> > Ok, after playing around a bit I think I see what I was missing. I
> > misinterpreted that the punch case is only going to zero in the target
> > range of the punch. So if you have something like a 1M file backed by an
> > fallocated 2M folio, map write the whole 2M, then punch the last 4k of
> > the file, you end up with the non-zeroed smaller folios beyond EOF. This
> > means that even with a zero of the eof folio, a truncate up over those
> > folios won't be zeroed.
> 
> Right.
> 
> > I need to think on it some more, but I take it this means that
> > essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on
> > swapout (which I think is analogous to your earlier prototype logic) [1]
> > and 2. shmem_zero_eof() needs to turn into something like
> > shmem_zero_range().
> 
> Like we discussed, only considering swapout is not enough; it's necessary to
> consider all cases of large folio splits, such as swapout, punch hole,
> migration, shmem shrinker, etc. In the future, if there are other cases of
> splits, the impact on EOF folios will also need to be considered (should
> zero them before split). IMHO, this could lead to complexity and
> uncontrollability.
> 

Ok. FWIW, the purpose of the swap time zeroing in this case is not
necessarily to be a solution purely on its own. Rather (and to Hugh's
earlier point about the zeroing needing to cover a range vs. just
relying on the eof folio size), it's probably more ideal if that eof
zeroing code can assume post-eof swapped out folios are always zeroed.

But anyways, I'll try to shoot for something like that for a v2. I also
want to see if I can figure a way for a bit more thorough testing. We
can revisit from there if there are better options and/or furher gaps to
consider. Thanks again for the comments.

Brian

> So my suggestion is to address this issue during the split process, and it
> seems feasible to make EOF small folios not 'uptodate' during the split.
> Anyway, you can investigate further.
> 
> > The latter would zero a range of uptodate folios between current EOF and
> > the start of the extending operation, rather than just the EOF folio.
> > This is actually pretty consistent with traditional fs (see
> > xfs_file_write_zero_eof() for example) behavior. I was originally
> > operating under assumption that this wasn't necessary for tmpfs given
> > traditional pagecache post-eof behavior, but that has clearly proven
> > false.
> > 
> > Brian
> > 
> > [1] I'm also wondering if another option here is to just clear_uptodate
> > any uptodate folio that fully starts beyond EOF. I.e., if the folio
> > straddles EOF then partial zero as below, if the folio is beyond EOF
> > then clear uptodate and let the existing code further down zero it.
> > 
> > > If the former, I suspect we'd need to tail zero on writeout regardless
> > > of folio size. Given that, and IIUC that clearing uptodate as such will
> > > basically cause the split folios to fall back into the !uptodate -> zero
> > > -> mark_uptodate sequence of shmem_writeout(), I wonder what the
> > > advantage of that is. It feels a bit circular to me when considered
> > > along with the tail zeroing below, but again I'm peeling away at
> > > complexity as I go here.. ;) Thoughts?
> > > 
> > > Brian
> > > 
> > > [1] prototype writeout logic:
> > > 
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 634e499b6197..535021ae5a2f 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
> > >   	struct inode *inode = mapping->host;
> > >   	struct shmem_inode_info *info = SHMEM_I(inode);
> > >   	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > > -	pgoff_t index;
> > > +	loff_t i_size = i_size_read(inode);
> > > +	pgoff_t index = i_size >> PAGE_SHIFT;
> > >   	int nr_pages;
> > >   	bool split = false;
> > > @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
> > >   	if (!total_swap_pages)
> > >   		goto redirty;
> > > +	/*
> > > +	 * If the folio straddles EOF, the tail portion must be zeroed on
> > > +	 * every swapout.
> > > +	 */
> > > +	if (folio_test_uptodate(folio) &&
> > > +	    folio->index <= index && folio_next_index(folio) > index) {
> > > +		size_t from = offset_in_folio(folio, i_size);
> > > +		if (from)
> > > +			folio_zero_segment(folio, from, folio_size(folio));
> > > +	}
> > > +
> > >   	/*
> > >   	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
> > >   	 * split when swapping.
> > > 
> > > 
> > 
>