From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65B80C83F1D for ; Mon, 14 Jul 2025 03:05:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E71846B009A; Sun, 13 Jul 2025 23:05:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E21C76B009B; Sun, 13 Jul 2025 23:05:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D37646B009C; Sun, 13 Jul 2025 23:05:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BCC456B009A for ; Sun, 13 Jul 2025 23:05:44 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4FF251DA1C9 for ; Mon, 14 Jul 2025 03:05:44 +0000 (UTC) X-FDA: 83661380208.05.09F68B9 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf16.hostedemail.com (Postfix) with ESMTP id 0D5F2180009 for ; Mon, 14 Jul 2025 03:05:40 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=p9pEwDRy; spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752462342; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=B6DUMff6IVz59prxMbgcxWP0Wu+zW/tntZrF8zpdTu4=; b=3MmxWi4H4M8yNBHowSTUcDsBt55eEzHQ4tT+1J/snxLCbO7k33mqqTdzC9LVohyf1o/9Qt Jeyjyf8s+uR7LASPmFmPqmRrup5se/SwHAr1XU2P8jV8OSen3OYeeMvzapHeVmN+ZJU3zH 6t4V9oJwdxPWcAytrz8+g0bFz9G/bHo= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=p9pEwDRy; spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752462342; a=rsa-sha256; cv=none; b=XnDWkbmYQ5oeH1k0oDEGbIC+nLqZ5onNKAgm2w3VAVRctQue2MdVHGj31tihAUV0GDQjqq hdH9BT50AMuWpl0SGf1vGhygrq0uWsNRmRUnPxsMeyNONFpXNCxVP49Yjwnw1s09WARaJd w8+/d2Cu4NjXacBNp5u7oyb0bFWoxSo= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1752462337; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=B6DUMff6IVz59prxMbgcxWP0Wu+zW/tntZrF8zpdTu4=; b=p9pEwDRyTUG94U4ww7CJdu9kapZ+fj2BV+PpFivnlQc/gb5bLfGUWg+bBcyh6HHojXCR5byGAHKwsV+/thCswDNTPkde0OEbskrYVOjGl9OHjVydbAKA3FIcXFbdJM6JVgvzKCnhH9NjfkLNqeK0PMEo/aZXNQcRrwgxUjwrccc= Received: from 30.74.144.136(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WioUOZ7_1752462335 cluster:ay36) by smtp.aliyun-inc.com; Mon, 14 Jul 2025 11:05:36 +0800 Message-ID: <18c5d84b-1449-411e-8cd7-ee8c6af37677@linux.alibaba.com> Date: Mon, 14 Jul 2025 11:05:35 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension To: Brian Foster Cc: Hugh Dickins , linux-mm@kvack.org, Matthew Wilcox , Usama Arif References: <20250625184930.269727-1-bfoster@redhat.com> <297e44e9-1b58-d7c4-192c-9408204ab1e3@google.com> <67f0461b-3359-41e7-a7cd-b059cbef4154@linux.alibaba.com> <097c0b07-1f43-51c3-3591-aaa2015226c2@google.com> <0224ed0f-d207-4c79-8c9d-f4915a91c11d@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 0D5F2180009 X-Stat-Signature: 3hf61famfpkcxcjp9k1w35mxnhxycuee X-Rspamd-Server: rspam02 X-Rspam-User: X-HE-Tag: 1752462340-888935 X-HE-Meta: U2FsdGVkX1+7Vl/xxz10qII+W1O2ItAxarbS44liuaJgGhvuHL2KY00alHi1omBpBuOCexTrWE3exrlhNeFHjw+7Gv/TPaDjMFKR6B4zj/OSh9UkI95QOyP/RDzVCcSFPugG7jI8JgiZyxUCcyj+UYzWOJIptcyXJ7odTzb9hLA0q0KeCxHy2LuNnh3+vRQyOjXAxrj16rx+OzUK0B843C1D8JaIDycFddJdSinaSN5oa4dcY1Lnt+VP3t7dbSqJcRmsbWVoPUDU74Ro9szn62fsOFHcxJnflpZYDIOy7d3LgxRbwPI6F7H7siJGawv6gzBnu0gy3FJU4Ea8gG9lf0aoRkKTQvFa84LA5rNN+F8DgDtjiA02bOfGVCjlAoSejxWU6guPKUw6nRBBo3hVylS19/Rsygo4WITqsMeDruNDP1SvieCI4hhiqx5xsytlaMid/5dvW5Tl83kg2ITQ/C9lDsu0rkpQqsNOLapE8FPQWE2lNK/P/WqlxIR5PilElTKbwYJpYS88DImJIVKGV3Mn8BdBeANTOQeS8smO8W/CZn7eDJNxpJra+TrAKf9ya4zM3E1P6lpNwLcmPkawdZHJKB2et+ihqbRsro8OI7pQ+RQym5KjWwlBU/FS0+5D/p6c3tVT/qe4R7zmNNJqiRNixodm2pXp1zssBXBF8JxUR37xTE+orQHYKyVCNaFNVb4zn14GSG8gq9vi6p2ad/BMUgajO0wK5RnkZW8g1zZhYNgisAFkN5aG0HNVEXiEZ9ISfNshX6+cIRT12H2e4mLeXd3hGrYIWWG7mzed2deQQMkbz57eeU9Fd2KoXQY1MTzcIl4H/Lssi4QokhsSUTdS7IR5hfb71u4K396eOmse/HC6SClX99vr3/GMkP4x3uIurDERX9X7k2eWr23Qf5FisgpIye9f2ItJIzMGQpxqEftlt0PoFciTpqqYIbffTuIf+U0kBrCXeVtXlxT EMH8lLZ3 18nGQ1c4+E12mvGytdciJsmQMkb/h+0HItJUX0LDYRnDvRX/Dij/kmXMEKBCPD5Tu7ZCtq2rvudIKxBfswFT/s27OnBXSQ4PZJFk0nfgiiiU9zBmgXC6QSpQaEhplE3kOu+wG+yMspH9h4ChYXuITUEGGEMFHPgJGB2LGlRrmKcUQ2U45pkV5mAfIo9K/DjieECSYGETHeAV10zR736syRW66foQDNciQhdwygierPZBMDnSMjXcM4GHCoAJw6wzFqPzVYjAaqP9krQITYlE0wk9m174CAoHtwK+gpmOyKV6IDOD5XJ20kS6ksK3209WoTuFI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/7/12 04:15, Brian Foster wrote: > On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote: >> On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote: >>> >>> >>> On 2025/7/11 06:20, Hugh Dickins wrote: >>>> On Thu, 10 Jul 2025, Baolin Wang wrote: >>>>> On 2025/7/9 15:57, Hugh Dickins wrote: >>>> ... >>>>>> >>>>>> The problem is with huge pages (or large folios) in shmem_writeout(): >>>>>> what goes in as a large folio may there have to be split into small >>>>>> pages; or it may be swapped out as one large folio, but fragmentation >>>>>> at swapin time demand that it be split into small pages when swapped in. >>>>> >>>>> Good point. >>>>> >>>>>> So, if there has been swapout since the large folio was modified beyond >>>>>> EOF, the folio that shmem_zero_eof() brings in does not guarantee what >>>>>> length needs to be zeroed. >>>>>> >>>>>> We could set that aside as a deficiency to be fixed later on: that >>>>>> would not be unreasonable, but I'm guessing that won't satisfy you. >>>>>> >>>>>> We could zero the maximum (the remainder of PMD size I believe) in >>>>>> shmem_zero_eof(): looping over small folios within the range, skipping >>>>>> !uptodate ones (but we do force them uptodate when swapping out, in >>>>>> order to keep the space reservation). TBH I've ignored that as a bad >>>>>> option, but it doesn't seem so bad to me now: ugly, but maybe not bad. >>>>> >>>>> However, IIUC, if the large folios are split in shmem_writeout(), and those >>>>> small folios which beyond EOF will be dropped and freed in >>>>> __split_unmapped_folio(), should we still consider them? >>>> >>>> You're absolutely right about the normal case, and thank you for making >>>> that point. Had I forgotten that when writing? Or was I already >>>> jumping ahead to the problem case? I don't recall, but was certainly >>>> wrong for not mentioning it. >>>> >>>> The abnormal case is when there's a "fallocend" beyond i_size (or beyond >>>> the small page extent spanning i_size) i.e. fallocate() has promised to >>>> keep pages allocated beyond EOF. In that case, __split_unmapped_folio() >>>> is keeping those pages. >>> >>> Ah, yes, you are right. >>> >>>> There could well be some optimization, involving fallocend, to avoid >>>> zeroing more than necessary; but I wouldn't want to say what in a hurry, >>>> it's quite confusing! >>> >>> Like you said, not only can a large folio split occur during swapout, but it >>> can also happen during a punch hole operation. Moreover, considering the >>> abnormal case of fallocate() you mentioned, we should find a more common >>> approach to mitigate the impact of fallocate(). >>> >>> For instance, when splitting, we could clear the 'uptodate' flag for these >>> EOF small folios that are beyond 'i_size' but less than the 'fallocend', so >>> that these EOF small folios will be re-initialized if they are used again. >>> What do you think? >>> >> ... >> >> Hi Baolin, >> >> So I'm still digesting Hugh's clarification wrt the falloc case, but I'm >> a little curious here given that I intended to implement the writeout >> zeroing suggestion regardless of that discussion.. >> >> I see the hole punch case falls into truncate_inode_[partial_]folio(), >> which looks to me like it handles zeroing. The full truncate case just >> tosses the folio of course, but the partial case zeroes according to the >> target range prior to doing any potential split from that codepath. >> >> That looks kind of similar to what I have prototyped for the >> shmem_writeout() case: tail zero the EOF straddling folio before falling >> into the split call. [1] Does that not solve the same general issue in >> the swapout path as potentially clearing uptodate via the split? I'm >> mainly trying to understand if that is just a potential alternative >> approach, or if this solves a corner case that I'm missing. Hm? >> > > Ok, after playing around a bit I think I see what I was missing. I > misinterpreted that the punch case is only going to zero in the target > range of the punch. So if you have something like a 1M file backed by an > fallocated 2M folio, map write the whole 2M, then punch the last 4k of > the file, you end up with the non-zeroed smaller folios beyond EOF. This > means that even with a zero of the eof folio, a truncate up over those > folios won't be zeroed. Right. > I need to think on it some more, but I take it this means that > essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on > swapout (which I think is analogous to your earlier prototype logic) [1] > and 2. shmem_zero_eof() needs to turn into something like > shmem_zero_range(). Like we discussed, only considering swapout is not enough; it's necessary to consider all cases of large folio splits, such as swapout, punch hole, migration, shmem shrinker, etc. In the future, if there are other cases of splits, the impact on EOF folios will also need to be considered (should zero them before split). IMHO, this could lead to complexity and uncontrollability. So my suggestion is to address this issue during the split process, and it seems feasible to make EOF small folios not 'uptodate' during the split. Anyway, you can investigate further. > The latter would zero a range of uptodate folios between current EOF and > the start of the extending operation, rather than just the EOF folio. > This is actually pretty consistent with traditional fs (see > xfs_file_write_zero_eof() for example) behavior. I was originally > operating under assumption that this wasn't necessary for tmpfs given > traditional pagecache post-eof behavior, but that has clearly proven > false. > > Brian > > [1] I'm also wondering if another option here is to just clear_uptodate > any uptodate folio that fully starts beyond EOF. I.e., if the folio > straddles EOF then partial zero as below, if the folio is beyond EOF > then clear uptodate and let the existing code further down zero it. > >> If the former, I suspect we'd need to tail zero on writeout regardless >> of folio size. Given that, and IIUC that clearing uptodate as such will >> basically cause the split folios to fall back into the !uptodate -> zero >> -> mark_uptodate sequence of shmem_writeout(), I wonder what the >> advantage of that is. It feels a bit circular to me when considered >> along with the tail zeroing below, but again I'm peeling away at >> complexity as I go here.. ;) Thoughts? >> >> Brian >> >> [1] prototype writeout logic: >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 634e499b6197..535021ae5a2f 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) >> struct inode *inode = mapping->host; >> struct shmem_inode_info *info = SHMEM_I(inode); >> struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); >> - pgoff_t index; >> + loff_t i_size = i_size_read(inode); >> + pgoff_t index = i_size >> PAGE_SHIFT; >> int nr_pages; >> bool split = false; >> >> @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc) >> if (!total_swap_pages) >> goto redirty; >> >> + /* >> + * If the folio straddles EOF, the tail portion must be zeroed on >> + * every swapout. >> + */ >> + if (folio_test_uptodate(folio) && >> + folio->index <= index && folio_next_index(folio) > index) { >> + size_t from = offset_in_folio(folio, i_size); >> + if (from) >> + folio_zero_segment(folio, from, folio_size(folio)); >> + } >> + >> /* >> * If CONFIG_THP_SWAP is not enabled, the large folio should be >> * split when swapping. >> >> >