From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C765C83F1A for ; Fri, 11 Jul 2025 20:11:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D59D96B00CF; Fri, 11 Jul 2025 16:11:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CE22D6B00D2; Fri, 11 Jul 2025 16:11:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B83D56B00D3; Fri, 11 Jul 2025 16:11:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A31986B00CF for ; Fri, 11 Jul 2025 16:11:44 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 40D3880A22 for ; Fri, 11 Jul 2025 20:11:44 +0000 (UTC) X-FDA: 83653079328.29.EE4D66B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 2C106140006 for ; Fri, 11 Jul 2025 20:11:41 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BSM3lsCT; spf=pass (imf26.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752264702; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Kzs7YeSCh6uDTXoEt2PApCiVVTfQive3DHQZSLpR/qE=; b=WumOp7mhTEjeO2vvVUiHNbmYeADri9+LM+luNJu1STQ0YbpMcgmp4RrnkZskOTJKlHUvDU P2sqM/1eub8KBH2qrZaSXYo9w48yYfCrfL9VD8SK+rryi2NTdMvsORHW1aSonFjKuF6uJI JOegOJgrT4MN5EX/yd4lRw9+5nephxA= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BSM3lsCT; spf=pass (imf26.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752264702; a=rsa-sha256; cv=none; b=4/TLBL7nYzn+yx/panv8dM2za05UAFtOdouN/snQAX4nXtM9A7z2GXJylz6cw8iu17Jbhx MIwsv6cDl4EHTM4c3R7lC7WP6oSD+rGJbomgxIetXrIGXO2IW8/izpi7vOP94S5azGZTAh 7AncKJEJwux4Fq+Cs1GAyGzJOr6Vae8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1752264700; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Kzs7YeSCh6uDTXoEt2PApCiVVTfQive3DHQZSLpR/qE=; b=BSM3lsCTGThfjAKV2p03qlimwSFR0BJxM6SANMPJPzLE5l6fBUD8lSok2xPsMmnLlz+F8x ARqbCu8nFIFT9p8pnMtolYJEUjqpA1W08vS7xCCeNgKW7Uy0hf+szqxfOW0w1PkdY6ETgd +VHrTr0F28x29mj82GP2QBtHaFyC1/0= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-561-MSbNTwhoN0OZ5ZP81abfWw-1; Fri, 11 Jul 2025 16:11:37 -0400 X-MC-Unique: MSbNTwhoN0OZ5ZP81abfWw-1 X-Mimecast-MFC-AGG-ID: MSbNTwhoN0OZ5ZP81abfWw_1752264695 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 64CBA195608E; Fri, 11 Jul 2025 20:11:35 +0000 (UTC) Received: from bfoster (unknown [10.22.64.43]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 47B6319560A3; Fri, 11 Jul 2025 20:11:32 +0000 (UTC) Date: Fri, 11 Jul 2025 16:15:14 -0400 From: Brian Foster To: Baolin Wang Cc: Hugh Dickins , linux-mm@kvack.org, Matthew Wilcox , Usama Arif Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension Message-ID: References: <20250625184930.269727-1-bfoster@redhat.com> <297e44e9-1b58-d7c4-192c-9408204ab1e3@google.com> <67f0461b-3359-41e7-a7cd-b059cbef4154@linux.alibaba.com> <097c0b07-1f43-51c3-3591-aaa2015226c2@google.com> <0224ed0f-d207-4c79-8c9d-f4915a91c11d@linux.alibaba.com> MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: P2V7dDLShUYcaarhKaJsSZ7T9VKr8S-ZpayfmKHxzS4_1752264695 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 2C106140006 X-Stat-Signature: 5rdnfwt19nuw6ik3b1pkijeeqgcs8cs9 X-Rspam-User: X-HE-Tag: 1752264701-49891 X-HE-Meta: U2FsdGVkX18c79FzgARKtDan8Nsplcbb2bUjHYUeq/Gxmt4itRcrRhBqp9xsnSxUAWN8JaLWzEFkMplrHQ1f8zIpm/qESj6bpPdLjJBUfFvPfomZPu/LZhml5InnmmMqGmdzv8cp0eUIClxPYwkGgKe2wVh5Vw8+luJQWthlzxjn7NrDIq3GdyrXOjqBtTZPjAvUkpvBwdrgHik1vN22EbOdHO4Hmci6m0Ira+FwQtwlTh6pHAiKVIYcKs9jqpT6gsA/INsF+ANGLmrVrvBR6UJhFDSxo1EsMh6zIkCCoEM6LBZHAfp2blBAS81pZiVVrx2M7Rk31XPIYJOWeNA4BZinCuOSGj+QTPJ2QDOe2cWq5amxyPVtFyHJuK3QKpmIQ69pJlffYUKBch/MLNzwNfeYmj60+w6I+FnxHF/SeEFwUS8cEXoICjpFOvWvSRdzgizgalQkgV+ALkYS1kV+iLvC0WFrXFFRNPB53Hft7cpeB+Bcqx06UD9pg8flkQZx+gm1T79oewDRAoWooK02uHBomVXaOvKZjS3cKnprb7zAW3vEiluze+fey/+QEkcZLd0NcPuPI8euRm9tiTGhgI9hfXXF2ZSmEjns2UlNxq1QZvlzHbr/lXUnW9wG0WEu3wYGSubI04Yy9oiPuaTW3h9zyx6fvEK4hZuwFYTYvEE6hS8WytIbRAzZsqZEhYXN0u/QzXtXvSm6RK1wxQs6y0SO5Yhzd9EhjulaGCWzuZgcNjksh3S0FmTxdE8FZHuPwWacmDWvL8CdkfX39R/2XX65c8tns4fBQJ87vt/HUYpA6CrZym8E0d1k8tdLBWA88WUtB1twgtKbcHhXAXBMT9EulQE4PFJH50h3I6KSM3IoeT81o2/dkxyfBCwNYhvQ2htku3DpE3ddiJY7TJldSR5l7IFtL7zYnoAi1IvpRBIb1V52LJShGvdFzJhkpAKHGBAwPvVNv7CCdXobNf5 SaLV5YVb 7ldMsvaeSvCKBmIfDNQMHwb2B8WqByFCeKMJPiCL836PlqS1qSqtt2RgNnd+YXaz+B1mY X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote: > On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote: > > > > > > On 2025/7/11 06:20, Hugh Dickins wrote: > > > On Thu, 10 Jul 2025, Baolin Wang wrote: > > > > On 2025/7/9 15:57, Hugh Dickins wrote: > > > ... > > > > > > > > > > The problem is with huge pages (or large folios) in shmem_writeout(): > > > > > what goes in as a large folio may there have to be split into small > > > > > pages; or it may be swapped out as one large folio, but fragmentation > > > > > at swapin time demand that it be split into small pages when swapped in. > > > > > > > > Good point. > > > > > > > > > So, if there has been swapout since the large folio was modified beyond > > > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what > > > > > length needs to be zeroed. > > > > > > > > > > We could set that aside as a deficiency to be fixed later on: that > > > > > would not be unreasonable, but I'm guessing that won't satisfy you. > > > > > > > > > > We could zero the maximum (the remainder of PMD size I believe) in > > > > > shmem_zero_eof(): looping over small folios within the range, skipping > > > > > !uptodate ones (but we do force them uptodate when swapping out, in > > > > > order to keep the space reservation). TBH I've ignored that as a bad > > > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad. > > > > > > > > However, IIUC, if the large folios are split in shmem_writeout(), and those > > > > small folios which beyond EOF will be dropped and freed in > > > > __split_unmapped_folio(), should we still consider them? > > > > > > You're absolutely right about the normal case, and thank you for making > > > that point. Had I forgotten that when writing? Or was I already > > > jumping ahead to the problem case? I don't recall, but was certainly > > > wrong for not mentioning it. > > > > > > The abnormal case is when there's a "fallocend" beyond i_size (or beyond > > > the small page extent spanning i_size) i.e. fallocate() has promised to > > > keep pages allocated beyond EOF. In that case, __split_unmapped_folio() > > > is keeping those pages. > > > > Ah, yes, you are right. > > > > > There could well be some optimization, involving fallocend, to avoid > > > zeroing more than necessary; but I wouldn't want to say what in a hurry, > > > it's quite confusing! > > > > Like you said, not only can a large folio split occur during swapout, but it > > can also happen during a punch hole operation. Moreover, considering the > > abnormal case of fallocate() you mentioned, we should find a more common > > approach to mitigate the impact of fallocate(). > > > > For instance, when splitting, we could clear the 'uptodate' flag for these > > EOF small folios that are beyond 'i_size' but less than the 'fallocend', so > > that these EOF small folios will be re-initialized if they are used again. > > What do you think? > > > ... > > Hi Baolin, > > So I'm still digesting Hugh's clarification wrt the falloc case, but I'm > a little curious here given that I intended to implement the writeout > zeroing suggestion regardless of that discussion.. > > I see the hole punch case falls into truncate_inode_[partial_]folio(), > which looks to me like it handles zeroing. The full truncate case just > tosses the folio of course, but the partial case zeroes according to the > target range prior to doing any potential split from that codepath. > > That looks kind of similar to what I have prototyped for the > shmem_writeout() case: tail zero the EOF straddling folio before falling > into the split call. [1] Does that not solve the same general issue in > the swapout path as potentially clearing uptodate via the split? I'm > mainly trying to understand if that is just a potential alternative > approach, or if this solves a corner case that I'm missing. Hm? > Ok, after playing around a bit I think I see what I was missing. I misinterpreted that the punch case is only going to zero in the target range of the punch. So if you have something like a 1M file backed by an fallocated 2M folio, map write the whole 2M, then punch the last 4k of the file, you end up with the non-zeroed smaller folios beyond EOF. This means that even with a zero of the eof folio, a truncate up over those folios won't be zeroed. I need to think on it some more, but I take it this means that essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on swapout (which I think is analogous to your earlier prototype logic) [1] and 2. shmem_zero_eof() needs to turn into something like shmem_zero_range(). The latter would zero a range of uptodate folios between current EOF and the start of the extending operation, rather than just the EOF folio. This is actually pretty consistent with traditional fs (see xfs_file_write_zero_eof() for example) behavior. I was originally operating under assumption that this wasn't necessary for tmpfs given traditional pagecache post-eof behavior, but that has clearly proven false. Brian [1] I'm also wondering if another option here is to just clear_uptodate any uptodate folio that fully starts beyond EOF. I.e., if the folio straddles EOF then partial zero as below, if the folio is beyond EOF then clear uptodate and let the existing code further down zero it. > If the former, I suspect we'd need to tail zero on writeout regardless > of folio size. Given that, and IIUC that clearing uptodate as such will > basically cause the split folios to fall back into the !uptodate -> zero > -> mark_uptodate sequence of shmem_writeout(), I wonder what the > advantage of that is. It feels a bit circular to me when considered > along with the tail zeroing below, but again I'm peeling away at > complexity as I go here.. ;) Thoughts? > > Brian > > [1] prototype writeout logic: > > diff --git a/mm/shmem.c b/mm/shmem.c > index 634e499b6197..535021ae5a2f 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) > struct inode *inode = mapping->host; > struct shmem_inode_info *info = SHMEM_I(inode); > struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); > - pgoff_t index; > + loff_t i_size = i_size_read(inode); > + pgoff_t index = i_size >> PAGE_SHIFT; > int nr_pages; > bool split = false; > > @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) > if (!total_swap_pages) > goto redirty; > > + /* > + * If the folio straddles EOF, the tail portion must be zeroed on > + * every swapout. > + */ > + if (folio_test_uptodate(folio) && > + folio->index <= index && folio_next_index(folio) > index) { > + size_t from = offset_in_folio(folio, i_size); > + if (from) > + folio_zero_segment(folio, from, folio_size(folio)); > + } > + > /* > * If CONFIG_THP_SWAP is not enabled, the large folio should be > * split when swapping. > >