From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3AAECC83F1B for ; Mon, 14 Jul 2025 14:35:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D161D8D0005; Mon, 14 Jul 2025 10:35:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CC6A88D0001; Mon, 14 Jul 2025 10:35:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C04498D0005; Mon, 14 Jul 2025 10:35:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B17D08D0001 for ; Mon, 14 Jul 2025 10:35:26 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7267056F4C for ; Mon, 14 Jul 2025 14:35:26 +0000 (UTC) X-FDA: 83663118252.07.911A519 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf10.hostedemail.com (Postfix) with ESMTP id 7AD0CC0003 for ; Mon, 14 Jul 2025 14:35:24 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Ms3VstFa; spf=pass (imf10.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752503724; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mDPs2726dFX9ugWaNVRaZ+owwzalt6vzAcoPklGLe7A=; b=A3FEdTowkLPNWeOjBbeLZnxvp9f1HL5e2bz8eFUqpeeq3FYvVhmJRfZTrodAFApc5e5/55 yj2bI8STqI/Mbk8h39YPmlbV8wozXnzeanwyEI2T+AswkS/d+emOEDm8VWl0dqXX1FTcVA Ff8AtebXS7Qysk9a2YdUIRuU4OlGEyE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752503724; a=rsa-sha256; cv=none; b=XeTyIFYFbsXAlmg2nXJPNMacMPYqijkY8ATJIgtYOHwBWcww6b4kBLienW0s721M2l3LCu VwmaU75n54NZqgclO7oo1KvbnKrqUGqRRjV2RCDi7vH4KGSNI+xOUQ/FJtDJCpfGNEoXuG l+Rr6yrGTHtvxAU4lpo6PimWMnJ1MaE= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Ms3VstFa; spf=pass (imf10.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1752503723; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=mDPs2726dFX9ugWaNVRaZ+owwzalt6vzAcoPklGLe7A=; b=Ms3VstFaTYHN9RQGMWvFCIDxk/iThGPRtM2rhOJa/Ndeky24KlJ2DMXNcxnOICxJfe7dfm eQr3X3LGAawtjx+Y1igoxoNSA1xPyobPgA2ciPEk1NZlsGK5o96mNqp6hdzcGSYBW3PSeg QRjD0McMYxXH95DWrQRy4SAF3EhkUJU= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-29--oviym61ONKirWe0WrNJaQ-1; Mon, 14 Jul 2025 10:35:19 -0400 X-MC-Unique: -oviym61ONKirWe0WrNJaQ-1 X-Mimecast-MFC-AGG-ID: -oviym61ONKirWe0WrNJaQ_1752503718 Received: from mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.93]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id EADD919560B1; Mon, 14 Jul 2025 14:35:17 +0000 (UTC) Received: from bfoster (unknown [10.22.64.43]) by mx-prod-int-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8FC37180045B; Mon, 14 Jul 2025 14:35:16 +0000 (UTC) Date: Mon, 14 Jul 2025 10:38:57 -0400 From: Brian Foster To: Baolin Wang Cc: Hugh Dickins , linux-mm@kvack.org, Matthew Wilcox , Usama Arif Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension Message-ID: References: <20250625184930.269727-1-bfoster@redhat.com> <297e44e9-1b58-d7c4-192c-9408204ab1e3@google.com> <67f0461b-3359-41e7-a7cd-b059cbef4154@linux.alibaba.com> <097c0b07-1f43-51c3-3591-aaa2015226c2@google.com> <0224ed0f-d207-4c79-8c9d-f4915a91c11d@linux.alibaba.com> <18c5d84b-1449-411e-8cd7-ee8c6af37677@linux.alibaba.com> MIME-Version: 1.0 In-Reply-To: <18c5d84b-1449-411e-8cd7-ee8c6af37677@linux.alibaba.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.93 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: bIU_VJY4eUkDCG0EwGQJUOdp0VcsOMsJsdS-oR8cw74_1752503718 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 7AD0CC0003 X-Stat-Signature: 9r98pqqp6zmgb9khy139eku78injwezn X-Rspam-User: X-HE-Tag: 1752503724-297667 X-HE-Meta: U2FsdGVkX19yBaD0JdBB0CJPz6+096IFaifPSaiNJLULLytPkyHBGk63OrG0+smHh9d0dNLUDYknltiZK3q7CHFCv12vykZJoszUIymzsUBVPcJ1Sx5+HjZMVvjL12pFW1laG5RLyqbc/b8PoG0pRhVWTEfXGruGlBz/aQnB0Me0iSyYiPm9mtQXhERR5zsETO4OGMsUp92e6CRaGZoEhcjb45MEE49YwWmcHFrDLSVqdvsU/IE77Ekm5zBkhwqBb4AA8d9NI63GezAp9or4z+NCZKxOBstlZ9Jb1C+r1c1/1Y+k01oN+aG3+E2sCCKhIKiuaPob5trdK1V43Hy8ZLCEVlZiZi+Zot5Oq6dXw9LOm3TyoFjvqllUWuEAaz6BOgpFQtQxEg3HaL0GMdLyvG3lYClSWIFJyp/e1OQmBgF+KnpEpwvOGdrqqx5ACUkXhC2PmM5JKnvA9IF1veiis3bJ51MpKTVBnnVp1rOCQ/J4fCFdNqL/r8aWdZXWBEx2XKziY/Son1sgDYd6nqgEzbJ4T72PNW6ntapmPHake7Untsx8V5Jc97OuDDPZc2KmvCIRLanY8IeMwIxtjGPE0LyM5U0t+TxKA+227OaWm1CFV/WN2+U4w3/yxtdTh3xg7Qzyp4DYT0hQDlxIHk6CisW9zYu4HpTvWGgsA078KLK5yTDy//6IdluAiMCTO4RgniguC5puGrMtjas+RuSu8XSrEdlVSPHJ6dbXXRwSkMZzc1MXD8blieKK0xhduEInV1Klf0s8v1ovGEmJhFZzw+UtbDShRBvKccR6q2ITUOPC9UUgMiOo8g4ZK0n06OLgHzUVlYN0Xo1me1uowTxAGmB/Jwi+a4k1txeplq7RTRNgY7yVd08didSpO5EVTuOX8VQQ92bj5Z9bv25tDy4VjTApBUHySpxqFl81WBNceOlb4JtVlXZVaBKEizFGmlptp9R+3vQqtMy8yUOXh05 dLq5bjbB qE3c4OsiMPGEK2aZFxEwIvHgoJQZHOcLnyigiczqBuOcDh3Rr9U+paiwdEDzQvQ1BlFFLQNYgptpbmw4g0NHF3f2Oio8saxzMwQo+E3Kq9pTqm0hVRwJVvBStV5rSny+zrmUWc9VabblahfjHtl9j+93EcUxls6n2uYlPL9d9vGxFftk+UQUATkytcwx/B77i8jnwX9GSIy9ZHPbBXIFYUxdkFGb0aU/9oZPqgS+jWEDuH8gM9kpkQAd+0tzOO0+rZSiGMsUoFm0295TUfVu77TskmZTkuLig/dmlAZiKQgH1WcN++b+ZEOuaBxFj5vnJzqzWYRho58Gzn+OdskfzaaIiF/IaEbCvvLMhQvYOfNU3flkmQPYz77qn98o29srWMX2JPpsrwfpT+yZwMqvB/SC6Mg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 14, 2025 at 11:05:35AM +0800, Baolin Wang wrote: > > > On 2025/7/12 04:15, Brian Foster wrote: > > On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote: > > > On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote: > > > > > > > > > > > > On 2025/7/11 06:20, Hugh Dickins wrote: > > > > > On Thu, 10 Jul 2025, Baolin Wang wrote: > > > > > > On 2025/7/9 15:57, Hugh Dickins wrote: > > > > > ... > > > > > > > > > > > > > > The problem is with huge pages (or large folios) in shmem_writeout(): > > > > > > > what goes in as a large folio may there have to be split into small > > > > > > > pages; or it may be swapped out as one large folio, but fragmentation > > > > > > > at swapin time demand that it be split into small pages when swapped in. > > > > > > > > > > > > Good point. > > > > > > > > > > > > > So, if there has been swapout since the large folio was modified beyond > > > > > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what > > > > > > > length needs to be zeroed. > > > > > > > > > > > > > > We could set that aside as a deficiency to be fixed later on: that > > > > > > > would not be unreasonable, but I'm guessing that won't satisfy you. > > > > > > > > > > > > > > We could zero the maximum (the remainder of PMD size I believe) in > > > > > > > shmem_zero_eof(): looping over small folios within the range, skipping > > > > > > > !uptodate ones (but we do force them uptodate when swapping out, in > > > > > > > order to keep the space reservation). TBH I've ignored that as a bad > > > > > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad. > > > > > > > > > > > > However, IIUC, if the large folios are split in shmem_writeout(), and those > > > > > > small folios which beyond EOF will be dropped and freed in > > > > > > __split_unmapped_folio(), should we still consider them? > > > > > > > > > > You're absolutely right about the normal case, and thank you for making > > > > > that point. Had I forgotten that when writing? Or was I already > > > > > jumping ahead to the problem case? I don't recall, but was certainly > > > > > wrong for not mentioning it. > > > > > > > > > > The abnormal case is when there's a "fallocend" beyond i_size (or beyond > > > > > the small page extent spanning i_size) i.e. fallocate() has promised to > > > > > keep pages allocated beyond EOF. In that case, __split_unmapped_folio() > > > > > is keeping those pages. > > > > > > > > Ah, yes, you are right. > > > > > > > > > There could well be some optimization, involving fallocend, to avoid > > > > > zeroing more than necessary; but I wouldn't want to say what in a hurry, > > > > > it's quite confusing! > > > > > > > > Like you said, not only can a large folio split occur during swapout, but it > > > > can also happen during a punch hole operation. Moreover, considering the > > > > abnormal case of fallocate() you mentioned, we should find a more common > > > > approach to mitigate the impact of fallocate(). > > > > > > > > For instance, when splitting, we could clear the 'uptodate' flag for these > > > > EOF small folios that are beyond 'i_size' but less than the 'fallocend', so > > > > that these EOF small folios will be re-initialized if they are used again. > > > > What do you think? > > > > > > > ... > > > > > > Hi Baolin, > > > > > > So I'm still digesting Hugh's clarification wrt the falloc case, but I'm > > > a little curious here given that I intended to implement the writeout > > > zeroing suggestion regardless of that discussion.. > > > > > > I see the hole punch case falls into truncate_inode_[partial_]folio(), > > > which looks to me like it handles zeroing. The full truncate case just > > > tosses the folio of course, but the partial case zeroes according to the > > > target range prior to doing any potential split from that codepath. > > > > > > That looks kind of similar to what I have prototyped for the > > > shmem_writeout() case: tail zero the EOF straddling folio before falling > > > into the split call. [1] Does that not solve the same general issue in > > > the swapout path as potentially clearing uptodate via the split? I'm > > > mainly trying to understand if that is just a potential alternative > > > approach, or if this solves a corner case that I'm missing. Hm? > > > > > > > Ok, after playing around a bit I think I see what I was missing. I > > misinterpreted that the punch case is only going to zero in the target > > range of the punch. So if you have something like a 1M file backed by an > > fallocated 2M folio, map write the whole 2M, then punch the last 4k of > > the file, you end up with the non-zeroed smaller folios beyond EOF. This > > means that even with a zero of the eof folio, a truncate up over those > > folios won't be zeroed. > > Right. > > > I need to think on it some more, but I take it this means that > > essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on > > swapout (which I think is analogous to your earlier prototype logic) [1] > > and 2. shmem_zero_eof() needs to turn into something like > > shmem_zero_range(). > > Like we discussed, only considering swapout is not enough; it's necessary to > consider all cases of large folio splits, such as swapout, punch hole, > migration, shmem shrinker, etc. In the future, if there are other cases of > splits, the impact on EOF folios will also need to be considered (should > zero them before split). IMHO, this could lead to complexity and > uncontrollability. > Ok. FWIW, the purpose of the swap time zeroing in this case is not necessarily to be a solution purely on its own. Rather (and to Hugh's earlier point about the zeroing needing to cover a range vs. just relying on the eof folio size), it's probably more ideal if that eof zeroing code can assume post-eof swapped out folios are always zeroed. But anyways, I'll try to shoot for something like that for a v2. I also want to see if I can figure a way for a bit more thorough testing. We can revisit from there if there are better options and/or furher gaps to consider. Thanks again for the comments. Brian > So my suggestion is to address this issue during the split process, and it > seems feasible to make EOF small folios not 'uptodate' during the split. > Anyway, you can investigate further. > > > The latter would zero a range of uptodate folios between current EOF and > > the start of the extending operation, rather than just the EOF folio. > > This is actually pretty consistent with traditional fs (see > > xfs_file_write_zero_eof() for example) behavior. I was originally > > operating under assumption that this wasn't necessary for tmpfs given > > traditional pagecache post-eof behavior, but that has clearly proven > > false. > > > > Brian > > > > [1] I'm also wondering if another option here is to just clear_uptodate > > any uptodate folio that fully starts beyond EOF. I.e., if the folio > > straddles EOF then partial zero as below, if the folio is beyond EOF > > then clear uptodate and let the existing code further down zero it. > > > > > If the former, I suspect we'd need to tail zero on writeout regardless > > > of folio size. Given that, and IIUC that clearing uptodate as such will > > > basically cause the split folios to fall back into the !uptodate -> zero > > > -> mark_uptodate sequence of shmem_writeout(), I wonder what the > > > advantage of that is. It feels a bit circular to me when considered > > > along with the tail zeroing below, but again I'm peeling away at > > > complexity as I go here.. ;) Thoughts? > > > > > > Brian > > > > > > [1] prototype writeout logic: > > > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > index 634e499b6197..535021ae5a2f 100644 > > > --- a/mm/shmem.c > > > +++ b/mm/shmem.c > > > @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) > > > struct inode *inode = mapping->host; > > > struct shmem_inode_info *info = SHMEM_I(inode); > > > struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); > > > - pgoff_t index; > > > + loff_t i_size = i_size_read(inode); > > > + pgoff_t index = i_size >> PAGE_SHIFT; > > > int nr_pages; > > > bool split = false; > > > @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) > > > if (!total_swap_pages) > > > goto redirty; > > > + /* > > > + * If the folio straddles EOF, the tail portion must be zeroed on > > > + * every swapout. > > > + */ > > > + if (folio_test_uptodate(folio) && > > > + folio->index <= index && folio_next_index(folio) > index) { > > > + size_t from = offset_in_folio(folio, i_size); > > > + if (from) > > > + folio_zero_segment(folio, from, folio_size(folio)); > > > + } > > > + > > > /* > > > * If CONFIG_THP_SWAP is not enabled, the large folio should be > > > * split when swapping. > > > > > > > > >