Re: [PATCH] tmpfs: zero post-eof folio range on file extension

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org, Baolin Wang <baolin.wang@linux.alibaba.com>,
	Matthew Wilcox <willy@infradead.org>,
	Usama Arif <usamaarif64@gmail.com>
Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension
Date: Thu, 10 Jul 2025 08:36:03 -0400	[thread overview]
Message-ID: <aG-zs1DcFw2DMxls@bfoster> (raw)
In-Reply-To: <297e44e9-1b58-d7c4-192c-9408204ab1e3@google.com>

On Wed, Jul 09, 2025 at 12:57:35AM -0700, Hugh Dickins wrote:
> Thanks for suggestions, let me come back to comment on your original.
> 
> On Wed, 25 Jun 2025, Brian Foster wrote:
> >  
> 
> Although Matthew's alternative commit was insufficient and wrong
> (because use of truncation risks deleting folios already promised
> by fallocate), I found his brief commit message very very helpful
> in understanding your patch - it makes clear what POSIX promises,
> is a better justification than just passing an xfstest you wrote,
> and explains why you're right to target places where i_size changes:
> 
> POSIX requires that "If the file size is increased, the extended area
> shall appear as if it were zero-filled".  It is possible to use mmap to
> write past EOF and that data will become visible instead of zeroes.
> 
> Please add that paragraph here, right at the head of your commit message.
> 

Indeed, this is a better description. I'll incorporate it.

> > Most traditional filesystems zero the post-eof portion of the eof
> > folio at writeback time, or when the file size is extended by
> > truncate or extending writes. This ensures that the previously
> > post-eof range of the folio is zeroed before it is exposed to the
> > file.
> > 
> > tmpfs doesn't implement the writeback path the way a traditional
> > filesystem does, so zeroing behavior won't be exactly the same.
> > However, it can still perform explicit zeroing from the various
> > operations that extend a file and expose a post-eof portion of the
> > eof folio. The current lack of zeroing is observed via failure of
> > fstests test generic/363 on tmpfs. This test injects post-eof mapped
> > writes in certain situations to detect gaps in zeroing behavior.
> > 
> > Add a new eof zeroing helper for file extending operations. Look up
> > the current eof folio, and if one exists, zero the range about to be
> > exposed. This allows generic/363 to pass on tmpfs.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > This survives the aforemented reproducer, an fstests regression run, and
> > ~100m fsx operations without issues. Let me know if there are any other
> > recommended tests for tmpfs and I'm happy to run them. Otherwise, a
> > couple notes as I'm not terribly familiar with tmpfs...
> > 
> > First, I used _get_partial_folio() because we really only want to zero
> > an eof folio if one has been previously allocated. My understanding is
> > that lookup path will avoid unnecessary folio allocation in such cases,
> > but let me know if that's wrong.
> > 
> > Also, it seems that the falloc path leaves newly preallocated folios
> > !uptodate until they are used. This had me wondering if perhaps
> > shmem_zero_eof() could just skip out if the eof folio happens to be
> > !uptodate. Hm?
> 
> Yes, you were right to think that it's better to skip the !uptodates,
> and Baolin made good suggestion there (though I'll unravel it a bit).
> 
> > 
> > Thoughts, reviews, flames appreciated.
> 
> Sorry, much as you'd appreciate a flame, I cannot oblige: I think you've
> done a very good job here (but it's not ready yet), and you've done it
> in such a way that huge=always passes generic/363 with no trouble.
> 

Thanks, I appreciate the feedback from you both..

> I did keep on wanting to change this and that of your patch below; but
> then later came around to seeing why your choices were better than what
> I was going to suggest.
> 
> I had difficulty getting deep enough into it, but think I'm there now.
> And have identified one missed aspect, which rather changes around what
> you should do - I'd have preferred to get into that at the end, but
> since it affects what shmem_zero_eof() should look like, I'd better
> talk about it here first.
> 
> The problem is with huge pages (or large folios) in shmem_writeout():
> what goes in as a large folio may there have to be split into small
> pages; or it may be swapped out as one large folio, but fragmentation
> at swapin time demand that it be split into small pages when swapped in.
> 
> So, if there has been swapout since the large folio was modified beyond
> EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> length needs to be zeroed.
> 
> We could set that aside as a deficiency to be fixed later on: that
> would not be unreasonable, but I'm guessing that won't satisfy you.
> 

So I was reading through some of this code yesterday and playing around
with forcing swapout of an EOF folio, and I had the same observation as
noted in Baolin's followup: it looks like large folios are always split
across EOF, at which point post-eof folios are dropped because generally
the pagecache doesn't track post-eof folios.

A quick experiment to map write to a 2MB folio in a 1MB sized file shows
that the folio is indeed split on swapout. It looks like the large folio
is not immediately reconstructed on swapin, but rather something in the
background reconstitutes it such that once the post-eof range is
accessible again, it is effectively zeroed. I'm assuming this is not due
to explicit zeroing, but rather a side effect of those post-eof folios
being tossed (vs. swapped) and reallocated, but I could certainly be
wrong about that.

> We could zero the maximum (the remainder of PMD size I believe) in
> shmem_zero_eof(): looping over small folios within the range, skipping
> !uptodate ones (but we do force them uptodate when swapping out, in
> order to keep the space reservation).  TBH I've ignored that as a bad
> option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> 
> The solution I've had in mind (and pursue in comments below) is to do
> the EOF zeroing in shmem_writeout() before it splits; and then avoid
> swapin in shmem_zero_eof() when i_size is raised.
> 

Indeed, this is similar to traditional writeback behavior in that fully
post-eof folios are skipped (presumed to be racing with a truncate) and
a straddling EOF folio is partially zeroed at writeback time.

I actually observed this difference in behavior when first looking into
this issue on tmpfs, but I didn't have enough context to draw the
parallel to swapout, so thanks for bringing this up.

> That solution partly inspired by the shmem symlink uninit-value bug
> https://lore.kernel.org/linux-mm/670793eb.050a0220.8109b.0003.GAE@google.com/
> which I haven't rushed to fix, but ought to be fixed along with this one
> (by "along with" I don't mean that both have to be fixed in one single
> patch, but it makes sense to consider them together).  I was inclined
> not to zero the whole page in shmem_symlink(), but zero before swapout.
> 

I'll have to take a closer look at that one..

> It worries me that an EOF page might be swapped out and in a thousand
> times, but i_size set only once: I'm going for a solution which memsets
> a thousand times rather than once?  But if that actually shows up as an
> issue in any workload, then we can add a shmem inode flag to say whether
> the EOF folio has been exposed via mmap (or symlink) so may need zeroing.
> 
> What's your preference?  My comments below assume the latter solution,
> but that may be wrong.
> 

I actually think swapout time zeroing is a nice functional improvement
over this current patch. I'd be less inclined to think that frequent
swap cycles are more problematic than swapping folios in (and presumably
back out) just for partial zeroing due to file changes that don't even
necessarily need the associated data. This is also generally more
consistent with traditional fs/pagecache behavior, which IMO is a good
thing.

So I suppose what I'm saying is that I like the prospective approach to
also zero at shmem_writeout() and instead not swap in folios purely for
partial eof zeroing purposes, just perhaps not for the exact reasons
stated here. Of course the advantage could be that the code looks the
same regardless, so if that folio splitting behavior ever changed in the
future, the zeroing code would hopefully continue to do the right thing.
Thoughts?

> > 
> > Brian
> > 
> >  mm/shmem.c | 36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 3a5a65b1f41a..4bb96c24fb9e 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1077,6 +1077,29 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
> >  	return folio;
> >  }
> >  
> > +/*
> > + * Zero any post-EOF range of the EOF folio about to be exposed by size
> > + * extension.
> > + */
> > +static void shmem_zero_eof(struct inode *inode, loff_t pos)
> 
> Please change that "pos" to, perhaps, "newsize": I was initiailly quite
> confused by that argument to shmem_zero_eof(), until I grasped that
> oldsize (really, the important one here) is calculated inside.
> 

Hmm.. newsize isn't necessarily accurate either as this tends to point
to the start offset of the extending operation, not necessarily the new
inode size. That said, I'm not particular so I'll change it unless
somebody chimes in otherwise.

> > +{
> > +	struct folio *folio;
> > +	loff_t i_size = i_size_read(inode);
> > +	size_t from, len;
> > +
> > +	folio = shmem_get_partial_folio(inode, i_size >> PAGE_SHIFT);
> 
> I deceived myself into thinking shmem_get_folio(,,,,SGP_READ) was better,
> but no, that doesn't handle large folio spanning EOF in the way needed
> here.  It frustrates and depresses me that there's no appropriate SGP_
> for this.  It used to be the case that everything must go through
> shmem_getpage(), but then large folio truncation came to need that
> odd shmem_get_partial_folio() bypass.
> 
> Lacking a suitable SGP_, you did well to choose it for further use, but
> I now think you should go your own way, duplicating its filemap_get_entry()
> and !xa_is_value block (with folio_test_uptodate check added in - while
> folio locked I suppose, though I'm not certain that's necessary),
> without doing the shmem_get_folio(,,,,SGP_READ) to swapin.
> 

Ok, makes sense.

> > +	if (!folio)
> > +		return;
> > +
> > +	/* zero to the end of the folio or start of extending operation */
> > +	from = offset_in_folio(folio, i_size);
> 
> If from == 0, doesn't that mean we're looking at the *next* folio,
> and do not need to zero it at all?  A case that would have been ruled
> out by a simple alignment check before, in the old days of all-small
> pages; but nowadays we cannot tell without looking at the folio itself.
> 
> > +	len = min_t(loff_t, folio_size(folio) - from, pos - i_size);
> > +	folio_zero_range(folio, from, len);
> > +
> > +	folio_unlock(folio);
> > +	folio_put(folio);
> > +}
> > +
> >  /*
> >   * Remove range of pages and swap entries from page cache, and free them.
> >   * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> > @@ -1302,6 +1325,8 @@ static int shmem_setattr(struct mnt_idmap *idmap,
> >  			return -EPERM;
> >  
> >  		if (newsize != oldsize) {
> > +			if (newsize > oldsize)
> > +				shmem_zero_eof(inode, newsize);
> 
> Nit, and feel free to disagree, but I'd slightly prefer you to do that
> a few lines later, after the error return, before the i_size_write()
> and update_mtime.
> 
> >  			error = shmem_reacct_size(SHMEM_I(inode)->flags,
> >  					oldsize, newsize);
> >  			if (error)
> > @@ -3464,6 +3489,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  	ret = file_update_time(file);
> >  	if (ret)
> >  		goto unlock;
> > +	if (iocb->ki_pos > i_size_read(inode))
> > +		shmem_zero_eof(inode, iocb->ki_pos);
> 
> Okay.  I was inclined to ask you to move it to where shmem_write_end()
> does its i_size_write() and zeroing to make uptodate, that might be a
> more natural locationi.  But came to realize that there are several
> reasons why your choice is easier, and any attempt to rearrange likely
> to be a waste of your time (not to mention folio deadlock).
> 
> >  	ret = generic_perform_write(iocb, from);
> >  unlock:
> >  	inode_unlock(inode);
> > @@ -3791,8 +3818,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >  		cond_resched();
> >  	}
> >  
> > -	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> > +	/*
> > +	 * The post-eof portion of the eof folio isn't guaranteed to be zeroed
> > +	 * by fallocate, so zero through the end of the fallocated range
> > +	 * instead of the start.
> > +	 */
> 
> That comment continues to mystify me (the end instead of the start??):
> but there's an easy answer, just delete it, the code is easier to
> understand without it!
> 

Hah, I admit I often times read back my own comments and wonder what I
was thinking. ;) The purpose of this one was to try to explain why we
pass offset + len rather than the typical offset in this particular
case. I'll think about whether that can be stated more clearly, or
otherwise just drop the comment.

The rest of the feedback sounds reasonable to me at first glance, but
I'll take a closer look at those changes once I have the bigger picture
swapout thing worked out. Thanks again for the review and let me know if
any of this sounds misguided..

Brian

> > +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> > +		shmem_zero_eof(inode, offset + len);
> >  		i_size_write(inode, offset + len);
> > +	}
> >  undone:
> >  	spin_lock(&inode->i_lock);
> >  	inode->i_private = NULL;
> > -- 
> > 2.49.0
> 
> I don't have a proposal for how the code in shmem_writeout() should look,
> and maybe you won't agree that's where the change should be made.
> 
> Thanks,
> Hugh
>

next prev parent reply	other threads:[~2025-07-10 12:32 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-25 18:49 Brian Foster
2025-06-25 19:21 ` Matthew Wilcox
2025-06-26  5:35   ` Hugh Dickins
2025-06-26 12:55     ` Brian Foster
2025-06-27  3:21       ` Baolin Wang
2025-06-27 11:54         ` Brian Foster
2025-07-09  7:57 ` Hugh Dickins
2025-07-10  6:47   ` Baolin Wang
2025-07-10 22:20     ` Hugh Dickins
2025-07-11  3:50       ` Baolin Wang
2025-07-11  7:50         ` Hugh Dickins
2025-07-11  8:42           ` Baolin Wang
2025-07-11 16:08         ` Brian Foster
2025-07-11 20:15           ` Brian Foster
2025-07-14  3:05             ` Baolin Wang
2025-07-14 14:38               ` Brian Foster
2025-07-10 12:36   ` Brian Foster [this message]
2025-07-10 23:02     ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aG-zs1DcFw2DMxls@bfoster \
    --to=bfoster@redhat.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=usamaarif64@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox