Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: linux-mm@kvack.org, Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
Date: Wed, 19 Nov 2025 09:08:26 -0500	[thread overview]
Message-ID: <aR3PWnNFhOckoHvc@bfoster> (raw)
In-Reply-To: <9e696091-9c40-46cf-91b9-c95e1d38a1a8@linux.alibaba.com>

On Wed, Nov 19, 2025 at 11:53:41AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/11/18 22:39, Brian Foster wrote:
> > On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 2025/11/13 00:25, Brian Foster wrote:
> > > > As a first step to facilitate efficient post-eof zeroing in tmpfs,
> > > > zero post-eof uptodate folios at swap out time. This ensures that
> > > > post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
> > > > pagecache writeback) and facilitates zeroing on file size changes by
> > > > allowing it to not have to swap in.
> > > > 
> > > > Note that shmem_writeout() already zeroes !uptodate folios so this
> > > > introduces some duplicate logic. We'll clean this up in the next
> > > > patch.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >    mm/shmem.c | 19 +++++++++++++++++--
> > > >    1 file changed, 17 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index 0a25ee095b86..5fb3c911894f 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > >    	struct inode *inode = mapping->host;
> > > >    	struct shmem_inode_info *info = SHMEM_I(inode);
> > > >    	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > > > +	loff_t i_size = i_size_read(inode);
> > > > +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
> > > >    	pgoff_t index;
> > > >    	int nr_pages;
> > > >    	bool split = false;
> > > > @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > >    	 * (unless fallocate has been used to preallocate beyond EOF).
> > > >    	 */
> > > >    	if (folio_test_large(folio)) {
> > > > -		index = shmem_fallocend(inode,
> > > > -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> > > > +		index = shmem_fallocend(inode, end_index);
> > > >    		if ((index > folio->index && index < folio_next_index(folio)) ||
> > > >    		    !IS_ENABLED(CONFIG_THP_SWAP))
> > > >    			split = true;
> > > > @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > >    		folio_mark_uptodate(folio);
> > > >    	}
> > > > +	/*
> > > > +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> > > > +	 * traditional writeback behavior and facilitates zeroing on file size
> > > > +	 * changes without having to swap back in.
> > > > +	 */
> > > > +	if (folio_next_index(folio) >= end_index) {
> > > > +		size_t from = offset_in_folio(folio, i_size);
> > > > +
> > > > +		if (index >= end_index) {
> > > > +			folio_zero_segment(folio, 0, folio_size(folio));
> > > > +		} else if (from)
> > > > +			folio_zero_segment(folio, from, folio_size(folio));
> > > > +	}
> > > 
> > > As I mentioned before[1], if a large folio is beyond EOF, it will be split
> > > in shmem_writeout(), and those small folios beyond EOF will be dropped and
> > > freed in __folio_split(). Of course, there's another special case as Hugh
> > > mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it
> > > will keep the pages allocated beyond EOF after the split. However, your
> > > 'end_index' here does not consider 'fallocend,' so it seems to me that this
> > > portion of the code doesn't actually take effect.
> > > 
> > 
> > Hi Boalin,
> 
> s/Boalin/Baolin :)
> 

Sorry, Baolin! ;)

> > 
> > So I get that split post-eof folios can fall off depending on fallocate
> > status. I'm not sure what you mean by considering fallocend, however.
> > ISTM that fallocend contributes to the boundary where we decide to split
> > and/or preserve, but i_size is what is relevant for zeroing. It's not
> > clear to me if you're suggesting the logic is potentially spurious, or
> > this might not actually be zeroing correctly due to falloc interactions.
> > Can you clarify the concern please? Thanks.
> 
> Sorry for not being clear enough (for my quick response yesterday). After
> thinking more, I want to divide this into 3 cases to clearly explain the
> logic here:
> 

No worries. Thanks for breaking it down. Much easier to discuss this
way.

> 1. Without fallocate(), if a large folio is beyond EOF (i.e. i_size), it
> will be split in shmem_writeout(), and those small folios beyond EOF will be
> dropped and freed in __folio_split(). So, your changes should also have no
> impact, because after the split, ‘folio_next_index(folio)’ is definitely <=
> ‘end_index’. So the logic is correct.
> 

Ack, but we still want to zero any post-eof portion of a small folio
straddling i_size...

> 2. With fallocate(), If a large folio is beyond EOF (i.e. i_size) but
> smaller than 'fallocend', the folio will not be split. So, we should zero
> the post-EOF part. Because 'index' (now representing 'fallocend') is greater
> than 'end_index', you are zeroing the entire large folio, which does not
> seem correct to me.
> 

Unless CONFIG_THP_SWAP is disabled (no idea how likely that is, seems
like an arch thing), in which case it seems we always split (and then
the split path will still use fallocend to determine whether to toss or
preserve).

But otherwise yes, partial zeroing should be the same for the large
folio across EOF case, it just happens to be a larger folio size...

> if (index >= end_index) {
> 	folio_zero_segment(folio, 0, folio_size(folio));
> } else if ...
> 
> I think you should only zero the range from 'from' to 'folio_size(folio)' of
> this large folio in this case. Right?
> 

However index is folio->index here, not fallocend. index is reassigned a
bit further down the function just after the block try_split: lands in:

        index = folio->index;
        nr_pages = folio_nr_pages(folio);

This wasn't introduced by this patch, FWIW, but I suppose we could make
the fallocend block use a local for clarity.

So given that, the logic is effectively if the folio starts at or beyond
the first page size index beyond EOF, zero the whole thing. That seems
pretty straightforward to me, so I'm not clear on why we'd need to
consider whether the folio is large or not at this point.

> 3. With fallocate(), If a large folio is beyond EOF (i.e. i_size) and also
> beyond 'fallocend', the large folio will be split to small folios. If we
> continue attempting to write out these small folios beyond EOF, we need to
> zero the entire mall folios at this point. So, the logic looks correct
> (because 'index' > 'end_index').
> 

Ack..

> Based on the above analysis, I believe the logic should be:
> 
> if (folio_next_index(folio) >= end_index) {
> 	size_t from = offset_in_folio(folio, i_size);
> 
> 	if (!folio_test_large(folio) && index >= end_index)
> 		folio_zero_segment(folio, 0, folio_size(folio));
> 	else if (from)
> 		folio_zero_segment(folio, from, folio_size(folio));
> }
> 
> The logic here is a bit complex, please correct me if I misunderstood you.
> 

Hmm.. so I'm not really sure about the large folio check. Have you
reviewed the next patch, by chance? It occurs to me that I probably
split these two up wrongly. I probably should have split off the
existing !uptodate zeroing into a separate hunk in patch 1 (i.e. as a
non functional change, refactoring patch) and then introduce the
functional change in patch 2. I'll try that for v3.

But in the meantime, this is the logic after patch 2:

	/*
	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
	 * traditional writeback behavior and facilitates zeroing on file size
	 * changes without having to swap back in.
	 */
	if (!folio_test_uptodate(folio) ||
	    folio_next_index(folio) >= end_index) {
		size_t from = offset_in_folio(folio, i_size);

		if (!folio_test_uptodate(folio) || index >= end_index) {
			folio_zero_segment(folio, 0, folio_size(folio));
			flush_dcache_folio(folio);
			folio_mark_uptodate(folio);
		} else if (from)
			folio_zero_segment(folio, from, folio_size(folio));
	}

So factoring out the preexisting uptodate logic, this looks mostly
equivalent to what you posted above with the exception of the large
folio check. I could be wrong, but it kind of sounds like that is maybe
due to confusion over the index value.. hm?

In any event, I'm trying to make this logic as simple and clear as
possible. The idea here is basically that any folio being written out
that is either !uptodate or at least partially beyond EOF needs zeroing.

The split logic earlier in the function simply dictates what folios make
it to this point (to be written out vs. tossed) or not, and otherwise we
don't really have to care about state wrt zeroing post-eof folio ranges
(particularly if any of that splitting logic happens to change in the
future). Let me know if I'm missing something. Thanks again.

Brian

next prev parent reply	other threads:[~2025-11-19 14:08 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 16:25 [PATCH v2 0/3] tmpfs: zero post-eof ranges on file extension Brian Foster
2025-11-12 16:25 ` [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Brian Foster
2025-11-18  2:33   ` Baolin Wang
2025-11-18 14:39     ` Brian Foster
2025-11-19  3:53       ` Baolin Wang
2025-11-19 14:08         ` Brian Foster [this message]
2025-11-20  1:57           ` Baolin Wang
2025-11-20 14:12             ` Brian Foster
2025-11-12 16:25 ` [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout Brian Foster
2025-11-20  2:56   ` Baolin Wang
2025-11-20 14:14     ` Brian Foster
2025-11-12 16:25 ` [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension Brian Foster
2025-11-20  5:57   ` Baolin Wang
2025-11-20 14:21     ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aR3PWnNFhOckoHvc@bfoster \
    --to=bfoster@redhat.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox