From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23421C83F1A for ; Thu, 10 Jul 2025 23:02:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 879896B009A; Thu, 10 Jul 2025 19:02:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 850F06B009B; Thu, 10 Jul 2025 19:02:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78DC26B009D; Thu, 10 Jul 2025 19:02:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6A3FE6B009A for ; Thu, 10 Jul 2025 19:02:07 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 17056807E6 for ; Thu, 10 Jul 2025 23:02:07 +0000 (UTC) X-FDA: 83649879894.25.2016296 Received: from mail-yb1-f173.google.com (mail-yb1-f173.google.com [209.85.219.173]) by imf25.hostedemail.com (Postfix) with ESMTP id 34B15A0003 for ; Thu, 10 Jul 2025 23:02:05 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Jbe3b3Wk; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of hughd@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752188525; a=rsa-sha256; cv=none; b=5If6f+Z2OKzY8RrJWGjJIpKffWiGgaIdmuHcpI1NfrFPXEtXS8Lv27iEqtVhfbKM6cHpmN e+3EzhInDwgRe8T9fnNNELnkxL6//jKoWgDaA2VYyBq9wCeseVaBuIMl1qHxwzevfHAd2C rXBoDdhSA45tqJHRFie4MoOQSsaGUto= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Jbe3b3Wk; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of hughd@google.com designates 209.85.219.173 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752188525; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=x+khhGuzuTjA/fLIjzMfevggKfHMVQ2HNRA+d09jGqA=; b=ZZEeaS/K9xCCcbiKpq/EM63k3+sYrrTeZHbOmsTdb7+r58+Q5uRkRs57PCmKv3+urLz7Bg YfRXQWmvjmmwxfcCjA9fnbUEgR6kXWytyC6cFOEYcNQBwGQmGgs09zS/AlYbQGWdSK3xjl ojbTd+3JTREOQ30Bt5GI63w1/EO1DdA= Received: by mail-yb1-f173.google.com with SMTP id 3f1490d57ef6-e8275f110c6so1096661276.2 for ; Thu, 10 Jul 2025 16:02:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1752188524; x=1752793324; darn=kvack.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=x+khhGuzuTjA/fLIjzMfevggKfHMVQ2HNRA+d09jGqA=; b=Jbe3b3Wk31afZChmjpL0QPWaYDujn4F9pfgrzCp8WqHemurlg3Yzyd6kOuA4RWeZQR QSNEZZW/7lzqf0ir8Sy1hv+yk68iCFzoDOePjGhnoJcETaotwpz5O9jGBYTLiMkpAWPv S/8iScRLPq7hONKBKpA6OwUQGc2HObxrFeJpat44DJqeno2dPPJG9tKdCLVFakMgi+Uz Op/nsJuGE1nRm04pm0botPU8iE4oxLu3yKRRK+f8Oz2V1HPRUa5urHtmJDaTq5CpzOyo 28lXfCQzgRA01fN5xUU124Fs4/Yjcy5TXg1Z5aKOF0wVTxx6qQG5ajBh7v1X9xZHaa54 1SIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1752188524; x=1752793324; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=x+khhGuzuTjA/fLIjzMfevggKfHMVQ2HNRA+d09jGqA=; b=IB1HjZk47JiPKN7j6U/qR4xQRHKFt61Bk6XNmYJtF6c2k2k7eR3mv068S1CeRPBgIJ jthBBEZLufJor8IAwBDoqnetrIxrELVBK+AdAeU1R874GFQBJY2+MWOBKWo2VogeToIC cUntt54L9mCSkbzlK65pTPIoUTYc+Rrs79Q+3aU0JqTa7vMyPVzAoCSBSFd5N1siDlyA R0syff4wozBxAPH6DpAoJpo7XKQHEnkWLb0smZWQg69dGN51i68Kp47hyGh0HXoe+L4v e6Nb9pE7tlh6uFYYd02LiQiend/nyiALVj1bGauYyNBG2KMFCjKegkizPNx9ngqDV2rK kSaA== X-Forwarded-Encrypted: i=1; AJvYcCU7roMvyEDDlqVCukkt6xYi4ioGJDQ35K+dxycGoMmXFiSCmE4QSHnh69AM4a1O1rZVivDgx59SXw==@kvack.org X-Gm-Message-State: AOJu0YzCHjojdm3nWI7kcOAmFCCu/cEzDet8HqEd8JvO+l8O545WF6vu e/KgfGa7WxP4VnhCJFZAFnkHYjUQ4+wBjBNzDAT48q0vZ5M4iZ/tRmxBg7NISE7ctA== X-Gm-Gg: ASbGncviVSdC096jWc57FVnAsLJjOEdi0lqylN4Gko4SOYbrrfKE4U1bGHgzp9sNKMm /M8OujtggT4eRIjN/HPdJpvn1bgTGP4z2Yh1XNGpMqgHmfX8ydeoaJJzka0wrivKS7fgmq5huqS NzbIDZLlmOaoAR1pxXxFr2qZjW7S2+hXiDhROY314PM3boZc9t/eqsGski2eBELnFbZUgePhxm4 5/+gBbe+CA6GAj6+SdJ3IuJ07/d91nI24JZ+IARluvHf/xw0zjZPWJCDAVKPMhLz4ZcNAfk4THY xZjuNDggy4SSeOr6EWVNEr4QQtmGAnl2q3tpI8N1rwrvZ5FGUUFJbIBcmVDt/UEN2L+2DtvHdsM 1LlYPNmlp3CKgVldWcms+/V4a/Gf5+c1m94QR9W+d6XUyRiebry5pZiqCVV2IsLQCqnL4B7KBzK Nkg1BlWPg= X-Google-Smtp-Source: AGHT+IG6n4+Y4nEj8ktTCXqtCvUI309mIbplvb2+mlbPvw+zgj5kG3g+tM2i3vYfKSzYARWq9A1F6Q== X-Received: by 2002:a05:6902:6313:b0:e85:ea51:fd00 with SMTP id 3f1490d57ef6-e8b85a044femr1033647276.6.1752188523656; Thu, 10 Jul 2025 16:02:03 -0700 (PDT) Received: from darker.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id 3f1490d57ef6-e8b7ae65d44sm720127276.13.2025.07.10.16.02.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 10 Jul 2025 16:02:02 -0700 (PDT) Date: Thu, 10 Jul 2025 16:02:01 -0700 (PDT) From: Hugh Dickins To: Brian Foster cc: Hugh Dickins , linux-mm@kvack.org, Baolin Wang , Matthew Wilcox , Usama Arif Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension In-Reply-To: Message-ID: References: <20250625184930.269727-1-bfoster@redhat.com> <297e44e9-1b58-d7c4-192c-9408204ab1e3@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 34B15A0003 X-Stat-Signature: 1xqjpk43b7p9hr5sryg3ess45cyig846 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1752188525-676837 X-HE-Meta: U2FsdGVkX19zVCfwTsqeOtzoNBrRafSZxVzbwFx2TBYrvMiJzQI+2LKs/h8G+2bZW2PEAt+V9n8LNIAer+cTsDwKeCqbaClUycaJTwoZpzImT/7ujwvTdLJqCwQ/vF8C0rhX1n4vJdA+h0C6BoJZGTv+SBgF82YVphvE9gHCBzg+cNCaGm6ecYW1LkLLWOdCQeKiaWfNxT4HKmPPWaeLKWi5HCF981nRRvFyy4ISOyXYUH82mrm0cZKmAWF1ikpGzXADaDtidaoeJoby2R3m0uscYvKuxRgzbJq9dgsJf7841J/rcGaKlzUC33iwiMEqSOdzzgx/prKriHjobuD0uomDlmhPPpMHHBx1SkMGfWvK2ZvZIGvgSpeY99yci3TX2r1PJGazEWX3f6noagyMuQgsSKqeeEalKtWHjryGpDGcAOHBqAOWAvSbj69ZrKGrVIz08TDOSqCJULaVdeIJ66IIShzMpz2EpRDbEtsQQYR9cTNgAn0m8KNwQkMuxbPq5eiihBHMKrA8QI1/OTwOumApn7t0PnZ/3Zs3PiZ4k4lV64pE/BHW7kHjY6UGsUEmqKb0yJ6+dnDVPgHgjP0tFJhWLIUPnDWAdbuNwZu+6PUSfMSGwcWcXMS1HCFh9Kl2O/9/cIjlZWBy0zVNzgb7NaLfP3hpuQBmxFWzqVqYE4MChZTEgA9d/9EjuGZ2EeVVm71c69F9OkMCcn7Cs+Y8CBi9JNsmjBNB31q4BBKd2cHHhlJ5TzU2SeyocxLd5ymY5N/qeQmoKxjxeWZWSQbWkQX6JaUMiF8bLij8wow76x3w+d0Oapl1br0YncAsOSfJFIGbJaoQHo3bLK4XrzvqTFSF/45jlpThR/jD2bmrZrNxTn+DtxCuAEORRWzMly0uRtrYjDEae65Nnu+FpcAfi7lr4txdXYcPNubV8cHDQwKn5V7aJVIyR51CJ3kYQR6SgWWiLPgh+nqWhQ3H1B/ 8ZCc5fOB ZTIHS0HfWmYEoXgYDeU351shncN7Hulnw4F5BoPazwq0HFBIpXvBNfvQW1ABs7jQ3HASmo4vcMQgIh3Fm+waC+pM7MH+3kDoAC3ZvmtpjVff6wdmN3k97QbS2xN0bLhr3l9jIV0MsPTi/Cu7ss5zMBoZkci726wLL4ts97dvwgORbetD4ol5uzsdLMuFdgnJ1HV48dMNO0Q3KWB/PUX6qh4/eItNQWlyyiUYq9uVafsZkydIn94CsnH9aLnVyAk8nAWaqvYZMSDrzFSTnNtJHKRBImrHEZetjjbTHEmaBLZoV++ATb4feDSLdvkPmWaRzE6DPhIHmYpj8Hvij+5Av6PcxDeGW6rN8qe7/lnc09+HGTxUSN/sV53G5JQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, 10 Jul 2025, Brian Foster wrote: > On Wed, Jul 09, 2025 at 12:57:35AM -0700, Hugh Dickins wrote: ... > > > > The problem is with huge pages (or large folios) in shmem_writeout(): > > what goes in as a large folio may there have to be split into small > > pages; or it may be swapped out as one large folio, but fragmentation > > at swapin time demand that it be split into small pages when swapped in. > > > > So, if there has been swapout since the large folio was modified beyond > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what > > length needs to be zeroed. > > > > We could set that aside as a deficiency to be fixed later on: that > > would not be unreasonable, but I'm guessing that won't satisfy you. > > > > So I was reading through some of this code yesterday and playing around > with forcing swapout of an EOF folio, and I had the same observation as > noted in Baolin's followup: it looks like large folios are always split > across EOF, at which point post-eof folios are dropped because generally > the pagecache doesn't track post-eof folios. Generally yes, but as in reply to Baolin, not in the fallocend case. > > A quick experiment to map write to a 2MB folio in a 1MB sized file shows > that the folio is indeed split on swapout. It looks like the large folio > is not immediately reconstructed on swapin, but rather something in the > background reconstitutes it such that once the post-eof range is > accessible again, it is effectively zeroed. I'm assuming this is not due > to explicit zeroing, but rather a side effect of those post-eof folios > being tossed (vs. swapped) and reallocated, but I could certainly be > wrong about that. Sounds right: that will have been khugepaged reconstituting it. > > > We could zero the maximum (the remainder of PMD size I believe) in > > shmem_zero_eof(): looping over small folios within the range, skipping > > !uptodate ones (but we do force them uptodate when swapping out, in > > order to keep the space reservation). TBH I've ignored that as a bad > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad. I have to confess that in the meantime I've grown rather to think I was too obsessed without doing it at swapout, and this "ugly" solution better. But I expect you'll try it out (in mind or in code) one way and the other, and make your own decision which way is better. A realization which pushed me in this direction, not decisive but a push: there can be other reasons for the huge page getting split, not just swapout and swapin. Notably (only?) hole-punch somewhere in that EOF-spanning huge page. Could be before the EOF or after: in either case the huge page is (likely to be) split into small pages, and shmem_zero_eof()'s folio_size(folio) give too small an estimate of what might need zeroing. That could be fixed with a further shmem_zero_eof() call somewhere in the hole-punching path; but that won't be necessary if shmem_zero_eof() knows to go beyond small folio size (in the fallocend case only? perhaps, but I haven't thought it through). > > > > The solution I've had in mind (and pursue in comments below) is to do > > the EOF zeroing in shmem_writeout() before it splits; and then avoid > > swapin in shmem_zero_eof() when i_size is raised. > > > > Indeed, this is similar to traditional writeback behavior in that fully > post-eof folios are skipped (presumed to be racing with a truncate) and > a straddling EOF folio is partially zeroed at writeback time. > > I actually observed this difference in behavior when first looking into > this issue on tmpfs, but I didn't have enough context to draw the > parallel to swapout, so thanks for bringing this up. > > > That solution partly inspired by the shmem symlink uninit-value bug > > https://lore.kernel.org/linux-mm/670793eb.050a0220.8109b.0003.GAE@google.com/ > > which I haven't rushed to fix, but ought to be fixed along with this one > > (by "along with" I don't mean that both have to be fixed in one single > > patch, but it makes sense to consider them together). I was inclined > > not to zero the whole page in shmem_symlink(), but zero before swapout. > > > > I'll have to take a closer look at that one.. And I've grown towards thinking (as I expect everybody else would) that we should simply zero the rest of the page at shmem_symlink() time. "An abundance of caution" makes me afraid to add the overhead there, but maybe the right thing to do is the obvious thing, and make it more complicated if/when anyone notices and complains. > > > It worries me that an EOF page might be swapped out and in a thousand > > times, but i_size set only once: I'm going for a solution which memsets > > a thousand times rather than once? But if that actually shows up as an > > issue in any workload, then we can add a shmem inode flag to say whether > > the EOF folio has been exposed via mmap (or symlink) so may need zeroing. > > > > What's your preference? My comments below assume the latter solution, > > but that may be wrong. > > > > I actually think swapout time zeroing is a nice functional improvement > over this current patch. I'd be less inclined to think that frequent > swap cycles are more problematic than swapping folios in (and presumably > back out) just for partial zeroing due to file changes that don't even > necessarily need the associated data. This is also generally more > consistent with traditional fs/pagecache behavior, which IMO is a good > thing. > > So I suppose what I'm saying is that I like the prospective approach to > also zero at shmem_writeout() and instead not swap in folios purely for > partial eof zeroing purposes, just perhaps not for the exact reasons > stated here. Of course the advantage could be that the code looks the > same regardless, so if that folio splitting behavior ever changed in the > future, the zeroing code would hopefully continue to do the right thing. > Thoughts? WHereas I liked the way you do it at well-defined user-call times, rather than when the kernel behind-the-scenes decides it might want to swap out the page. But I did check with ext4, and verified there that the post-EOF non-data vanishes without user-call intervention, I assume on writeback as expected. So although I like your choice of user-call times, it's not a functional requirement at all. Looks like we've neatly exchanged positions. After you, Alphonse! Hugh