From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 42AF4CF34B5 for ; Wed, 19 Nov 2025 14:08:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9DE936B00A8; Wed, 19 Nov 2025 09:08:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 98F036B00AA; Wed, 19 Nov 2025 09:08:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A4C56B00C3; Wed, 19 Nov 2025 09:08:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 7A5886B00A8 for ; Wed, 19 Nov 2025 09:08:38 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B4D82160368 for ; Wed, 19 Nov 2025 14:08:35 +0000 (UTC) X-FDA: 84127536990.03.FCE49F9 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 713BB1C0015 for ; Wed, 19 Nov 2025 14:08:33 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Fz2a2sSn; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf18.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1763561313; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+Yg19b3C0A9RgvXwEm9q22h/mHwyzJvPVT/d/nPBkXI=; b=rVVVjJfymGacl5H6B8ui9plxVMCaWbCT9p0winq0A0IQu9f3rLqQX+sRHAEBD3JKRxepH+ SAxCunl6RSRD2Na4jepOv3WvxbZ4bE/IyA9m8ICvtHoqRI75ruKZe91fZRcRBRhAnTGq+6 1IenXTvPvrTHuABolTfk5ndz68xzZpg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1763561313; a=rsa-sha256; cv=none; b=umwCuU9yEw584dOyXdlW2sHnNLXKK3BlhZf4g5ciYzcc8HyIRTg/VltNH/OjCi86rD6Ihw WxIrS/hM7aUh9HdkLONC9ElBcYYSa7jjpKyVlwT85tjEjeCLX9fv9yg+beAf40Ap49Swv0 npVlDlhJo8WiYsMUv0uLcuCkDd6f6g0= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Fz2a2sSn; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf18.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1763561312; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+Yg19b3C0A9RgvXwEm9q22h/mHwyzJvPVT/d/nPBkXI=; b=Fz2a2sSnkSL0ab9QJe8UxI3XJWX0MF5iUmle/tNjEwFX3fbe3PsoetKdvp8khLub/B40hE aWB5tCJZ7Ma8E8PULdbwK7HnlS71N8+fom9E1SgvJOfOVtuKJHPAwNSzFeT4CfA/TnITQc sCscR29cHwE4aV96IehanF3q1Y8DAVQ= Received: from mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-495-JwJzTWP2O7aaSa9fWhiWnA-1; Wed, 19 Nov 2025 09:08:30 -0500 X-MC-Unique: JwJzTWP2O7aaSa9fWhiWnA-1 X-Mimecast-MFC-AGG-ID: JwJzTWP2O7aaSa9fWhiWnA_1763561310 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-08.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B8A75180034D; Wed, 19 Nov 2025 14:08:29 +0000 (UTC) Received: from bfoster (unknown [10.22.64.29]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id C44DA195608E; Wed, 19 Nov 2025 14:08:28 +0000 (UTC) Date: Wed, 19 Nov 2025 09:08:26 -0500 From: Brian Foster To: Baolin Wang Cc: linux-mm@kvack.org, Hugh Dickins Subject: Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Message-ID: References: <20251112162522.412295-1-bfoster@redhat.com> <20251112162522.412295-2-bfoster@redhat.com> <9e696091-9c40-46cf-91b9-c95e1d38a1a8@linux.alibaba.com> MIME-Version: 1.0 In-Reply-To: <9e696091-9c40-46cf-91b9-c95e1d38a1a8@linux.alibaba.com> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: FgBcDXXAm5mg80VkTpfLgPdvNadvP-M-m47hlG6tsrg_1763561310 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Stat-Signature: ux44qwenntbhndcamnz6dw3ccxgc3qji X-Rspam-User: X-Rspamd-Queue-Id: 713BB1C0015 X-Rspamd-Server: rspam10 X-HE-Tag: 1763561313-790691 X-HE-Meta: U2FsdGVkX1+5cm2YJsHvD4i5oD0cJpgSFUGWdv4WJHJNEAYMjNBRc6HPoDg29yZRu0speRwKY/z98QfunzarIbZHWjV2pWGylr7m5mu8eS+qWgnaw9hThiRZjmC5G4KlDD+J2o2uTlK/eZVztDdPl0bFrJQGMm2OuTho24jS9EkA1gcloHgnf2JvOoTTa5iJfpnuJ2OMW6J5whGj1Z2Z/rESKjfbWwm11g/PSxe+MzaPBS0S+4xeXXb/9b7LRjr+SezKSp9azFGevyoyQkwr5XYsRCoVs2FZICmmmOLw4ByEv9K8qboaBzWVhDxykZRnsVeKxhDuspsLZa7EquBJtYx8F9NRid6VP3o1XPX9Qj6X7jVht289LsffTHFo7qgxL8K11uE0lsTiTkknLf69QC3b0AeCQOEL/WTi+oueyVBAmfDhmYau/AFkhcKBA7JSBLvL96VVeTtoW3P8cE0QpBEAGi4IQyllI6Hg9wZSQ/dj3LLLR6yWVCs+Ml81XhJO0eV5Bbb9JtQ90c0ho3RDS6snk+iR8oludG+ZGvfMm25Vz0kEGxPWwzUTSrkAYBYRhCKSy5RFIAIxzDAEXqpLNSZ9oTEL2kLs5D+ZHNnNJyPYeVHfW0nxGoWBiOMnqOz55ncuwff/17kHZ0jLc3q2KpLCUtf0dqOQwgVq12Jg7wK1Ip1UeDddf9IyaTBM8Gdm53hW5HsefzwV4Gnv4vPUft0MHRZPFCk5k8w8726QR16+OBa9k7zazU9L6CtgiBln2bTSU6KFcTDdGeEPMC03mbJfX/97LE4OJrMfUiLbJssZEwtC7RNXANDirE+/qFGBx+cLDjnzRUBYDWDmHZxLoDOBGixKmK0qR5uNPmV4fplxbDgt8m311+TXVEoTltFM1oT0OdX5ZB0/N2EkFdOz+asqy7EjJ1zWZem9Yqe0hzO/eAc9qLr4QuMD5rJFRfIETxqwjAl7Qtdk2YYh0F+ eGTZ3jTw X+HGhOo8bae0s1jxl+ubVW+bmLV5pvV9DJZ6lkPhGIlf2ZB5bBTmZlo88S9WpAqyEiIMe0OjnoWeDNjJ53NY1sHdZwcryOdpxXKDeQoAiCpRw2remZDNuF2GzCyZh4kS9wyj6d5O3WTHW2uJ3AUY/msSj+VLlvxEsH4dCpUxk3XIu1NP9B2X7r+I12olAvTC1WsQ141SmG8SVrzdFUYsDyEsnmZXq7ure1q9iZmOkveKVEHMpAhbP3VUnEy1cbK98CJ0yF02veoyrvh8cyCNqlxZ4CzgYxhK2M+KV2v+IjpgQ251Mw/lE/YY0p/PTnWp/xKZTs+HgjYUrDoLzDEh36NcAdAwiJdGJrcabPE4RulEd2sXKvlu2oZcirt1N2zDnGtYyiomtFhNLywX7i8WUIoWaETxA3HEUC/4AVbNBtELyJ+pXXrc+jDHQX+Vhlebysf32ukRUzPQFj5E= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 19, 2025 at 11:53:41AM +0800, Baolin Wang wrote: > > > On 2025/11/18 22:39, Brian Foster wrote: > > On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote: > > > > > > > > > On 2025/11/13 00:25, Brian Foster wrote: > > > > As a first step to facilitate efficient post-eof zeroing in tmpfs, > > > > zero post-eof uptodate folios at swap out time. This ensures that > > > > post-eof ranges are zeroed "on disk" (i.e. analogous to traditional > > > > pagecache writeback) and facilitates zeroing on file size changes by > > > > allowing it to not have to swap in. > > > > > > > > Note that shmem_writeout() already zeroes !uptodate folios so this > > > > introduces some duplicate logic. We'll clean this up in the next > > > > patch. > > > > > > > > Signed-off-by: Brian Foster > > > > --- > > > > mm/shmem.c | 19 +++++++++++++++++-- > > > > 1 file changed, 17 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > > index 0a25ee095b86..5fb3c911894f 100644 > > > > --- a/mm/shmem.c > > > > +++ b/mm/shmem.c > > > > @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, > > > > struct inode *inode = mapping->host; > > > > struct shmem_inode_info *info = SHMEM_I(inode); > > > > struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); > > > > + loff_t i_size = i_size_read(inode); > > > > + pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE); > > > > pgoff_t index; > > > > int nr_pages; > > > > bool split = false; > > > > @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, > > > > * (unless fallocate has been used to preallocate beyond EOF). > > > > */ > > > > if (folio_test_large(folio)) { > > > > - index = shmem_fallocend(inode, > > > > - DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE)); > > > > + index = shmem_fallocend(inode, end_index); > > > > if ((index > folio->index && index < folio_next_index(folio)) || > > > > !IS_ENABLED(CONFIG_THP_SWAP)) > > > > split = true; > > > > @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug, > > > > folio_mark_uptodate(folio); > > > > } > > > > + /* > > > > + * Ranges beyond EOF must be zeroed at writeout time. This mirrors > > > > + * traditional writeback behavior and facilitates zeroing on file size > > > > + * changes without having to swap back in. > > > > + */ > > > > + if (folio_next_index(folio) >= end_index) { > > > > + size_t from = offset_in_folio(folio, i_size); > > > > + > > > > + if (index >= end_index) { > > > > + folio_zero_segment(folio, 0, folio_size(folio)); > > > > + } else if (from) > > > > + folio_zero_segment(folio, from, folio_size(folio)); > > > > + } > > > > > > As I mentioned before[1], if a large folio is beyond EOF, it will be split > > > in shmem_writeout(), and those small folios beyond EOF will be dropped and > > > freed in __folio_split(). Of course, there's another special case as Hugh > > > mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it > > > will keep the pages allocated beyond EOF after the split. However, your > > > 'end_index' here does not consider 'fallocend,' so it seems to me that this > > > portion of the code doesn't actually take effect. > > > > > > > Hi Boalin, > > s/Boalin/Baolin :) > Sorry, Baolin! ;) > > > > So I get that split post-eof folios can fall off depending on fallocate > > status. I'm not sure what you mean by considering fallocend, however. > > ISTM that fallocend contributes to the boundary where we decide to split > > and/or preserve, but i_size is what is relevant for zeroing. It's not > > clear to me if you're suggesting the logic is potentially spurious, or > > this might not actually be zeroing correctly due to falloc interactions. > > Can you clarify the concern please? Thanks. > > Sorry for not being clear enough (for my quick response yesterday). After > thinking more, I want to divide this into 3 cases to clearly explain the > logic here: > No worries. Thanks for breaking it down. Much easier to discuss this way. > 1. Without fallocate(), if a large folio is beyond EOF (i.e. i_size), it > will be split in shmem_writeout(), and those small folios beyond EOF will be > dropped and freed in __folio_split(). So, your changes should also have no > impact, because after the split, ‘folio_next_index(folio)’ is definitely <= > ‘end_index’. So the logic is correct. > Ack, but we still want to zero any post-eof portion of a small folio straddling i_size... > 2. With fallocate(), If a large folio is beyond EOF (i.e. i_size) but > smaller than 'fallocend', the folio will not be split. So, we should zero > the post-EOF part. Because 'index' (now representing 'fallocend') is greater > than 'end_index', you are zeroing the entire large folio, which does not > seem correct to me. > Unless CONFIG_THP_SWAP is disabled (no idea how likely that is, seems like an arch thing), in which case it seems we always split (and then the split path will still use fallocend to determine whether to toss or preserve). But otherwise yes, partial zeroing should be the same for the large folio across EOF case, it just happens to be a larger folio size... > if (index >= end_index) { > folio_zero_segment(folio, 0, folio_size(folio)); > } else if ... > > I think you should only zero the range from 'from' to 'folio_size(folio)' of > this large folio in this case. Right? > However index is folio->index here, not fallocend. index is reassigned a bit further down the function just after the block try_split: lands in: index = folio->index; nr_pages = folio_nr_pages(folio); This wasn't introduced by this patch, FWIW, but I suppose we could make the fallocend block use a local for clarity. So given that, the logic is effectively if the folio starts at or beyond the first page size index beyond EOF, zero the whole thing. That seems pretty straightforward to me, so I'm not clear on why we'd need to consider whether the folio is large or not at this point. > 3. With fallocate(), If a large folio is beyond EOF (i.e. i_size) and also > beyond 'fallocend', the large folio will be split to small folios. If we > continue attempting to write out these small folios beyond EOF, we need to > zero the entire mall folios at this point. So, the logic looks correct > (because 'index' > 'end_index'). > Ack.. > Based on the above analysis, I believe the logic should be: > > if (folio_next_index(folio) >= end_index) { > size_t from = offset_in_folio(folio, i_size); > > if (!folio_test_large(folio) && index >= end_index) > folio_zero_segment(folio, 0, folio_size(folio)); > else if (from) > folio_zero_segment(folio, from, folio_size(folio)); > } > > The logic here is a bit complex, please correct me if I misunderstood you. > Hmm.. so I'm not really sure about the large folio check. Have you reviewed the next patch, by chance? It occurs to me that I probably split these two up wrongly. I probably should have split off the existing !uptodate zeroing into a separate hunk in patch 1 (i.e. as a non functional change, refactoring patch) and then introduce the functional change in patch 2. I'll try that for v3. But in the meantime, this is the logic after patch 2: /* * Ranges beyond EOF must be zeroed at writeout time. This mirrors * traditional writeback behavior and facilitates zeroing on file size * changes without having to swap back in. */ if (!folio_test_uptodate(folio) || folio_next_index(folio) >= end_index) { size_t from = offset_in_folio(folio, i_size); if (!folio_test_uptodate(folio) || index >= end_index) { folio_zero_segment(folio, 0, folio_size(folio)); flush_dcache_folio(folio); folio_mark_uptodate(folio); } else if (from) folio_zero_segment(folio, from, folio_size(folio)); } So factoring out the preexisting uptodate logic, this looks mostly equivalent to what you posted above with the exception of the large folio check. I could be wrong, but it kind of sounds like that is maybe due to confusion over the index value.. hm? In any event, I'm trying to make this logic as simple and clear as possible. The idea here is basically that any folio being written out that is either !uptodate or at least partially beyond EOF needs zeroing. The split logic earlier in the function simply dictates what folios make it to this point (to be written out vs. tossed) or not, and otherwise we don't really have to care about state wrt zeroing post-eof folio ranges (particularly if any of that splitting logic happens to change in the future). Let me know if I'm missing something. Thanks again. Brian