linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] tmpfs: zero post-eof ranges on file extension
@ 2025-11-12 16:25 Brian Foster
  2025-11-12 16:25 ` [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Brian Foster @ 2025-11-12 16:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang

Hi all,

First and foremost, I'm posting this v2 with a bit of a caveat. This had
survived most of my testing until I rebased onto -next and came up with
a bit more involved of an fsx test that concurrently stresses swap
activity. I root caused a couple or so issues that fell out of that
which appeared to be preexisting problems, then ultimately ran into the
folio_next_pos() type issue that exists in for-next (this causes wrong
same_folio values and subsequently wonky behavior).

So with that I need to reset my thinking and revisit that particular
test, but given that I hadn't thus far changed anything in these
patches, I wanted to get a v2 of the core work out on the list. Note
that this is based on -next and I'm testing locally with the
folio_next_pos() changes reverted (including the call added in patch 3).

With that out of the way, I think this addresses the open issues from
the discussion on v1. The main difference in v2 is that rather than just
zero the post-eof portion of the EOF folio, we now zero the full range
between the current and new i_size to accommodate situations where large
folios may split across writeout and swap in.

Patch 1 introduces post-eof zeroing at writeout time so the zeroing call
can skip swap entries. Patch 2 is an incremental cleanup (I found it
easier to split the change this way, but these two patches could be
squashed). Patch 3 introduces the zeroing behavior.

Thoughts, reviews, flames appreciated.

Brian

v2:
- Rework to zero uptodate post-eof folios on writeout and full range
  from EOF on size extension.
- Misc. cleanups: variable renames (pos -> end), code relocation,
  comment cleanups/removal.
- Update commit log to call out POSIX requirement.
v1: https://lore.kernel.org/linux-mm/20250625184930.269727-1-bfoster@redhat.com/

Brian Foster (3):
  tmpfs: zero post-eof uptodate folios on swapout
  tmpfs: combine !uptodate and post-eof zeroing logic at swapout
  tmpfs: zero post-eof ranges on file extension

 mm/shmem.c | 135 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 113 insertions(+), 22 deletions(-)

-- 
2.51.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
  2025-11-12 16:25 [PATCH v2 0/3] tmpfs: zero post-eof ranges on file extension Brian Foster
@ 2025-11-12 16:25 ` Brian Foster
  2025-11-18  2:33   ` Baolin Wang
  2025-11-12 16:25 ` [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout Brian Foster
  2025-11-12 16:25 ` [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension Brian Foster
  2 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2025-11-12 16:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang

As a first step to facilitate efficient post-eof zeroing in tmpfs,
zero post-eof uptodate folios at swap out time. This ensures that
post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
pagecache writeback) and facilitates zeroing on file size changes by
allowing it to not have to swap in.

Note that shmem_writeout() already zeroes !uptodate folios so this
introduces some duplicate logic. We'll clean this up in the next
patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 mm/shmem.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0a25ee095b86..5fb3c911894f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	loff_t i_size = i_size_read(inode);
+	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
 	pgoff_t index;
 	int nr_pages;
 	bool split = false;
@@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 	 * (unless fallocate has been used to preallocate beyond EOF).
 	 */
 	if (folio_test_large(folio)) {
-		index = shmem_fallocend(inode,
-			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
+		index = shmem_fallocend(inode, end_index);
 		if ((index > folio->index && index < folio_next_index(folio)) ||
 		    !IS_ENABLED(CONFIG_THP_SWAP))
 			split = true;
@@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		folio_mark_uptodate(folio);
 	}
 
+	/*
+	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
+	 * traditional writeback behavior and facilitates zeroing on file size
+	 * changes without having to swap back in.
+	 */
+	if (folio_next_index(folio) >= end_index) {
+		size_t from = offset_in_folio(folio, i_size);
+
+		if (index >= end_index) {
+			folio_zero_segment(folio, 0, folio_size(folio));
+		} else if (from)
+			folio_zero_segment(folio, from, folio_size(folio));
+	}
+
 	if (!folio_alloc_swap(folio)) {
 		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
 		int error;
-- 
2.51.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout
  2025-11-12 16:25 [PATCH v2 0/3] tmpfs: zero post-eof ranges on file extension Brian Foster
  2025-11-12 16:25 ` [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Brian Foster
@ 2025-11-12 16:25 ` Brian Foster
  2025-11-20  2:56   ` Baolin Wang
  2025-11-12 16:25 ` [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension Brian Foster
  2 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2025-11-12 16:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang

shmem_writeout() zeroes folios that are !uptodate (before marking
them uptodate) or that extend beyond EOF to preserve data integrity
according to POSIX. This is handled in a couple different blocks.
Fold the !uptodate zeroing into the post-eof block so we zero from
one place.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 mm/shmem.c | 40 +++++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5fb3c911894f..7925ced8a05d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1627,25 +1627,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 	 * good idea to continue anyway, once we're pushing into swap.  So
 	 * reactivate the folio, and let shmem_fallocate() quit when too many.
 	 */
-	if (!folio_test_uptodate(folio)) {
-		if (inode->i_private) {
-			struct shmem_falloc *shmem_falloc;
-			spin_lock(&inode->i_lock);
-			shmem_falloc = inode->i_private;
-			if (shmem_falloc &&
-			    !shmem_falloc->waitq &&
-			    index >= shmem_falloc->start &&
-			    index < shmem_falloc->next)
-				shmem_falloc->nr_unswapped += nr_pages;
-			else
-				shmem_falloc = NULL;
-			spin_unlock(&inode->i_lock);
-			if (shmem_falloc)
-				goto redirty;
-		}
-		folio_zero_range(folio, 0, folio_size(folio));
-		flush_dcache_folio(folio);
-		folio_mark_uptodate(folio);
+	if (!folio_test_uptodate(folio) && inode->i_private) {
+		struct shmem_falloc *shmem_falloc;
+		spin_lock(&inode->i_lock);
+		shmem_falloc = inode->i_private;
+		if (shmem_falloc &&
+		    !shmem_falloc->waitq &&
+		    index >= shmem_falloc->start &&
+		    index < shmem_falloc->next)
+			shmem_falloc->nr_unswapped += nr_pages;
+		else
+			shmem_falloc = NULL;
+		spin_unlock(&inode->i_lock);
+		if (shmem_falloc)
+			goto redirty;
 	}
 
 	/*
@@ -1653,11 +1648,14 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 	 * traditional writeback behavior and facilitates zeroing on file size
 	 * changes without having to swap back in.
 	 */
-	if (folio_next_index(folio) >= end_index) {
+	if (!folio_test_uptodate(folio) ||
+	    folio_next_index(folio) >= end_index) {
 		size_t from = offset_in_folio(folio, i_size);
 
-		if (index >= end_index) {
+		if (!folio_test_uptodate(folio) || index >= end_index) {
 			folio_zero_segment(folio, 0, folio_size(folio));
+			flush_dcache_folio(folio);
+			folio_mark_uptodate(folio);
 		} else if (from)
 			folio_zero_segment(folio, from, folio_size(folio));
 	}
-- 
2.51.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension
  2025-11-12 16:25 [PATCH v2 0/3] tmpfs: zero post-eof ranges on file extension Brian Foster
  2025-11-12 16:25 ` [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Brian Foster
  2025-11-12 16:25 ` [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout Brian Foster
@ 2025-11-12 16:25 ` Brian Foster
  2025-11-20  5:57   ` Baolin Wang
  2 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2025-11-12 16:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang

POSIX requires that "If the file size is increased, the extended area
shall appear as if it were zero-filled". It is possible to use mmap to
write past EOF and that data will become visible instead of zeroes. This
behavior is reproduced by fstests test generic/363.

Most traditional filesystems zero any post-eof portion of a folio at
writeback time or when the file size is extended by truncate or
extending writes. This ensures that the previously post-eof range of the
folio is zeroed before it is exposed to the file.

The tmpfs writeout path has been updated to zero post-eof folio
ranges similar to traditional writeback. This ensures post-eof
ranges are zeroed "on disk" and allows size extension zeroing to
skip over swap entries as they are already appropriately zeroed.

To that end, introduce a new zeroing helper for proper zeroing on
file extending operations. This looks up resident folios between the
original and new eof and for those that are uptodate, zeroes them
before the associated ranges are exposed to the file. This preserves
POSIX semantics and allows generic/363 to pass on tmpfs.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 mm/shmem.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7925ced8a05d..a4aceb474377 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1101,6 +1101,78 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	return folio;
 }
 
+/*
+ * Zero a post-EOF range about to be exposed by size extension. Zero from the
+ * current i_size through lend, the latter of which typically refers to the
+ * start offset of an extending operation. Skip swap entries because associated
+ * folios were zeroed at swapout time.
+ */
+static void shmem_zero_eof(struct inode *inode, loff_t lend)
+{
+	struct address_space *mapping = inode->i_mapping;
+	loff_t lstart = i_size_read(inode);
+	pgoff_t index = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	pgoff_t end = lend >> PAGE_SHIFT;
+	struct folio_batch fbatch;
+	struct folio *folio;
+	int i;
+	bool same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
+
+	folio = filemap_lock_folio(mapping, lstart >> PAGE_SHIFT);
+	if (!IS_ERR(folio)) {
+		same_folio = lend < folio_next_pos(folio);
+		index = folio_next_index(folio);
+
+		if (folio_test_uptodate(folio)) {
+			size_t from = offset_in_folio(folio, lstart);
+			size_t len = min_t(loff_t, folio_size(folio) - from,
+					   lend - lstart);
+
+			folio_zero_range(folio, from, len);
+		}
+
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	if (!same_folio) {
+		folio = filemap_lock_folio(mapping, lend >> PAGE_SHIFT);
+		if (!IS_ERR(folio)) {
+			end = folio->index;
+
+			if (folio_test_uptodate(folio)) {
+				size_t len = lend - folio_pos(folio);
+				folio_zero_range(folio, 0, len);
+			}
+
+			folio_unlock(folio);
+			folio_put(folio);
+		}
+	}
+
+	/*
+	 * Zero uptodate folios fully within the target range. Uptodate folios
+	 * beyond EOF are generally unexpected, but can exist if a larger
+	 * falloc'd and uptodate EOF folio is split.
+	 */
+	folio_batch_init(&fbatch);
+	while (index < end) {
+		if (!filemap_get_folios(mapping, &index, end - 1, &fbatch))
+			break;
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			folio = fbatch.folios[i];
+
+			folio_lock(folio);
+			if (folio_test_uptodate(folio) &&
+			    folio->mapping == mapping) {
+				folio_zero_segment(folio, 0, folio_size(folio));
+			}
+			folio_unlock(folio);
+		}
+		folio_batch_release(&fbatch);
+	}
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -1331,6 +1403,8 @@ static int shmem_setattr(struct mnt_idmap *idmap,
 					oldsize, newsize);
 			if (error)
 				return error;
+			if (newsize > oldsize)
+				shmem_zero_eof(inode, newsize);
 			i_size_write(inode, newsize);
 			update_mtime = true;
 		} else {
@@ -3512,6 +3586,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	ret = file_update_time(file);
 	if (ret)
 		goto unlock;
+	if (iocb->ki_pos > i_size_read(inode))
+		shmem_zero_eof(inode, iocb->ki_pos);
 	ret = generic_perform_write(iocb, from);
 unlock:
 	inode_unlock(inode);
@@ -3844,8 +3920,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		cond_resched();
 	}
 
-	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
+		shmem_zero_eof(inode, offset + len);
 		i_size_write(inode, offset + len);
+	}
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
-- 
2.51.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
  2025-11-12 16:25 ` [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Brian Foster
@ 2025-11-18  2:33   ` Baolin Wang
  2025-11-18 14:39     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Baolin Wang @ 2025-11-18  2:33 UTC (permalink / raw)
  To: Brian Foster, linux-mm; +Cc: Hugh Dickins



On 2025/11/13 00:25, Brian Foster wrote:
> As a first step to facilitate efficient post-eof zeroing in tmpfs,
> zero post-eof uptodate folios at swap out time. This ensures that
> post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
> pagecache writeback) and facilitates zeroing on file size changes by
> allowing it to not have to swap in.
> 
> Note that shmem_writeout() already zeroes !uptodate folios so this
> introduces some duplicate logic. We'll clean this up in the next
> patch.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   mm/shmem.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0a25ee095b86..5fb3c911894f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   	struct inode *inode = mapping->host;
>   	struct shmem_inode_info *info = SHMEM_I(inode);
>   	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> +	loff_t i_size = i_size_read(inode);
> +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
>   	pgoff_t index;
>   	int nr_pages;
>   	bool split = false;
> @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   	 * (unless fallocate has been used to preallocate beyond EOF).
>   	 */
>   	if (folio_test_large(folio)) {
> -		index = shmem_fallocend(inode,
> -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> +		index = shmem_fallocend(inode, end_index);
>   		if ((index > folio->index && index < folio_next_index(folio)) ||
>   		    !IS_ENABLED(CONFIG_THP_SWAP))
>   			split = true;
> @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   		folio_mark_uptodate(folio);
>   	}
>   
> +	/*
> +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> +	 * traditional writeback behavior and facilitates zeroing on file size
> +	 * changes without having to swap back in.
> +	 */
> +	if (folio_next_index(folio) >= end_index) {
> +		size_t from = offset_in_folio(folio, i_size);
> +
> +		if (index >= end_index) {
> +			folio_zero_segment(folio, 0, folio_size(folio));
> +		} else if (from)
> +			folio_zero_segment(folio, from, folio_size(folio));
> +	}

As I mentioned before[1], if a large folio is beyond EOF, it will be 
split in shmem_writeout(), and those small folios beyond EOF will be 
dropped and freed in __folio_split(). Of course, there's another special 
case as Hugh mentioned: when there's a 'fallocend' beyond i_size (e.g., 
fallocate()), it will keep the pages allocated beyond EOF after the 
split. However, your 'end_index' here does not consider 'fallocend,' so 
it seems to me that this portion of the code doesn't actually take effect.

[1] 
https://lore.kernel.org/linux-mm/097c0b07-1f43-51c3-3591-aaa2015226c2@google.com/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
  2025-11-18  2:33   ` Baolin Wang
@ 2025-11-18 14:39     ` Brian Foster
  2025-11-19  3:53       ` Baolin Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2025-11-18 14:39 UTC (permalink / raw)
  To: Baolin Wang; +Cc: linux-mm, Hugh Dickins

On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/11/13 00:25, Brian Foster wrote:
> > As a first step to facilitate efficient post-eof zeroing in tmpfs,
> > zero post-eof uptodate folios at swap out time. This ensures that
> > post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
> > pagecache writeback) and facilitates zeroing on file size changes by
> > allowing it to not have to swap in.
> > 
> > Note that shmem_writeout() already zeroes !uptodate folios so this
> > introduces some duplicate logic. We'll clean this up in the next
> > patch.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >   mm/shmem.c | 19 +++++++++++++++++--
> >   1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 0a25ee095b86..5fb3c911894f 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >   	struct inode *inode = mapping->host;
> >   	struct shmem_inode_info *info = SHMEM_I(inode);
> >   	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > +	loff_t i_size = i_size_read(inode);
> > +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
> >   	pgoff_t index;
> >   	int nr_pages;
> >   	bool split = false;
> > @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >   	 * (unless fallocate has been used to preallocate beyond EOF).
> >   	 */
> >   	if (folio_test_large(folio)) {
> > -		index = shmem_fallocend(inode,
> > -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> > +		index = shmem_fallocend(inode, end_index);
> >   		if ((index > folio->index && index < folio_next_index(folio)) ||
> >   		    !IS_ENABLED(CONFIG_THP_SWAP))
> >   			split = true;
> > @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >   		folio_mark_uptodate(folio);
> >   	}
> > +	/*
> > +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> > +	 * traditional writeback behavior and facilitates zeroing on file size
> > +	 * changes without having to swap back in.
> > +	 */
> > +	if (folio_next_index(folio) >= end_index) {
> > +		size_t from = offset_in_folio(folio, i_size);
> > +
> > +		if (index >= end_index) {
> > +			folio_zero_segment(folio, 0, folio_size(folio));
> > +		} else if (from)
> > +			folio_zero_segment(folio, from, folio_size(folio));
> > +	}
> 
> As I mentioned before[1], if a large folio is beyond EOF, it will be split
> in shmem_writeout(), and those small folios beyond EOF will be dropped and
> freed in __folio_split(). Of course, there's another special case as Hugh
> mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it
> will keep the pages allocated beyond EOF after the split. However, your
> 'end_index' here does not consider 'fallocend,' so it seems to me that this
> portion of the code doesn't actually take effect.
> 

Hi Boalin,

So I get that split post-eof folios can fall off depending on fallocate
status. I'm not sure what you mean by considering fallocend, however.
ISTM that fallocend contributes to the boundary where we decide to split
and/or preserve, but i_size is what is relevant for zeroing. It's not
clear to me if you're suggesting the logic is potentially spurious, or
this might not actually be zeroing correctly due to falloc interactions.
Can you clarify the concern please? Thanks.

Brian

> [1] https://lore.kernel.org/linux-mm/097c0b07-1f43-51c3-3591-aaa2015226c2@google.com/
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
  2025-11-18 14:39     ` Brian Foster
@ 2025-11-19  3:53       ` Baolin Wang
  2025-11-19 14:08         ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Baolin Wang @ 2025-11-19  3:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-mm, Hugh Dickins



On 2025/11/18 22:39, Brian Foster wrote:
> On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/11/13 00:25, Brian Foster wrote:
>>> As a first step to facilitate efficient post-eof zeroing in tmpfs,
>>> zero post-eof uptodate folios at swap out time. This ensures that
>>> post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
>>> pagecache writeback) and facilitates zeroing on file size changes by
>>> allowing it to not have to swap in.
>>>
>>> Note that shmem_writeout() already zeroes !uptodate folios so this
>>> introduces some duplicate logic. We'll clean this up in the next
>>> patch.
>>>
>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>> ---
>>>    mm/shmem.c | 19 +++++++++++++++++--
>>>    1 file changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 0a25ee095b86..5fb3c911894f 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>>    	struct inode *inode = mapping->host;
>>>    	struct shmem_inode_info *info = SHMEM_I(inode);
>>>    	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>>> +	loff_t i_size = i_size_read(inode);
>>> +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
>>>    	pgoff_t index;
>>>    	int nr_pages;
>>>    	bool split = false;
>>> @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>>    	 * (unless fallocate has been used to preallocate beyond EOF).
>>>    	 */
>>>    	if (folio_test_large(folio)) {
>>> -		index = shmem_fallocend(inode,
>>> -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
>>> +		index = shmem_fallocend(inode, end_index);
>>>    		if ((index > folio->index && index < folio_next_index(folio)) ||
>>>    		    !IS_ENABLED(CONFIG_THP_SWAP))
>>>    			split = true;
>>> @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>>    		folio_mark_uptodate(folio);
>>>    	}
>>> +	/*
>>> +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
>>> +	 * traditional writeback behavior and facilitates zeroing on file size
>>> +	 * changes without having to swap back in.
>>> +	 */
>>> +	if (folio_next_index(folio) >= end_index) {
>>> +		size_t from = offset_in_folio(folio, i_size);
>>> +
>>> +		if (index >= end_index) {
>>> +			folio_zero_segment(folio, 0, folio_size(folio));
>>> +		} else if (from)
>>> +			folio_zero_segment(folio, from, folio_size(folio));
>>> +	}
>>
>> As I mentioned before[1], if a large folio is beyond EOF, it will be split
>> in shmem_writeout(), and those small folios beyond EOF will be dropped and
>> freed in __folio_split(). Of course, there's another special case as Hugh
>> mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it
>> will keep the pages allocated beyond EOF after the split. However, your
>> 'end_index' here does not consider 'fallocend,' so it seems to me that this
>> portion of the code doesn't actually take effect.
>>
> 
> Hi Boalin,

s/Boalin/Baolin :)

> 
> So I get that split post-eof folios can fall off depending on fallocate
> status. I'm not sure what you mean by considering fallocend, however.
> ISTM that fallocend contributes to the boundary where we decide to split
> and/or preserve, but i_size is what is relevant for zeroing. It's not
> clear to me if you're suggesting the logic is potentially spurious, or
> this might not actually be zeroing correctly due to falloc interactions.
> Can you clarify the concern please? Thanks.

Sorry for not being clear enough (for my quick response yesterday). 
After thinking more, I want to divide this into 3 cases to clearly 
explain the logic here:

1. Without fallocate(), if a large folio is beyond EOF (i.e. i_size), it 
will be split in shmem_writeout(), and those small folios beyond EOF 
will be dropped and freed in __folio_split(). So, your changes should 
also have no impact, because after the split, ‘folio_next_index(folio)’ 
is definitely <= ‘end_index’. So the logic is correct.

2. With fallocate(), If a large folio is beyond EOF (i.e. i_size) but 
smaller than 'fallocend', the folio will not be split. So, we should 
zero the post-EOF part. Because 'index' (now representing 'fallocend') 
is greater than 'end_index', you are zeroing the entire large folio, 
which does not seem correct to me.

if (index >= end_index) {
	folio_zero_segment(folio, 0, folio_size(folio));
} else if ...

I think you should only zero the range from 'from' to 
'folio_size(folio)' of this large folio in this case. Right?

3. With fallocate(), If a large folio is beyond EOF (i.e. i_size) and 
also beyond 'fallocend', the large folio will be split to small folios. 
If we continue attempting to write out these small folios beyond EOF, we 
need to zero the entire mall folios at this point. So, the logic looks 
correct (because 'index' > 'end_index').

Based on the above analysis, I believe the logic should be:

if (folio_next_index(folio) >= end_index) {
	size_t from = offset_in_folio(folio, i_size);

	if (!folio_test_large(folio) && index >= end_index)
		folio_zero_segment(folio, 0, folio_size(folio));
	else if (from)
		folio_zero_segment(folio, from, folio_size(folio));
}

The logic here is a bit complex, please correct me if I misunderstood you.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
  2025-11-19  3:53       ` Baolin Wang
@ 2025-11-19 14:08         ` Brian Foster
  2025-11-20  1:57           ` Baolin Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2025-11-19 14:08 UTC (permalink / raw)
  To: Baolin Wang; +Cc: linux-mm, Hugh Dickins

On Wed, Nov 19, 2025 at 11:53:41AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/11/18 22:39, Brian Foster wrote:
> > On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 2025/11/13 00:25, Brian Foster wrote:
> > > > As a first step to facilitate efficient post-eof zeroing in tmpfs,
> > > > zero post-eof uptodate folios at swap out time. This ensures that
> > > > post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
> > > > pagecache writeback) and facilitates zeroing on file size changes by
> > > > allowing it to not have to swap in.
> > > > 
> > > > Note that shmem_writeout() already zeroes !uptodate folios so this
> > > > introduces some duplicate logic. We'll clean this up in the next
> > > > patch.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >    mm/shmem.c | 19 +++++++++++++++++--
> > > >    1 file changed, 17 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index 0a25ee095b86..5fb3c911894f 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > >    	struct inode *inode = mapping->host;
> > > >    	struct shmem_inode_info *info = SHMEM_I(inode);
> > > >    	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > > > +	loff_t i_size = i_size_read(inode);
> > > > +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
> > > >    	pgoff_t index;
> > > >    	int nr_pages;
> > > >    	bool split = false;
> > > > @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > >    	 * (unless fallocate has been used to preallocate beyond EOF).
> > > >    	 */
> > > >    	if (folio_test_large(folio)) {
> > > > -		index = shmem_fallocend(inode,
> > > > -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> > > > +		index = shmem_fallocend(inode, end_index);
> > > >    		if ((index > folio->index && index < folio_next_index(folio)) ||
> > > >    		    !IS_ENABLED(CONFIG_THP_SWAP))
> > > >    			split = true;
> > > > @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > >    		folio_mark_uptodate(folio);
> > > >    	}
> > > > +	/*
> > > > +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> > > > +	 * traditional writeback behavior and facilitates zeroing on file size
> > > > +	 * changes without having to swap back in.
> > > > +	 */
> > > > +	if (folio_next_index(folio) >= end_index) {
> > > > +		size_t from = offset_in_folio(folio, i_size);
> > > > +
> > > > +		if (index >= end_index) {
> > > > +			folio_zero_segment(folio, 0, folio_size(folio));
> > > > +		} else if (from)
> > > > +			folio_zero_segment(folio, from, folio_size(folio));
> > > > +	}
> > > 
> > > As I mentioned before[1], if a large folio is beyond EOF, it will be split
> > > in shmem_writeout(), and those small folios beyond EOF will be dropped and
> > > freed in __folio_split(). Of course, there's another special case as Hugh
> > > mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it
> > > will keep the pages allocated beyond EOF after the split. However, your
> > > 'end_index' here does not consider 'fallocend,' so it seems to me that this
> > > portion of the code doesn't actually take effect.
> > > 
> > 
> > Hi Boalin,
> 
> s/Boalin/Baolin :)
> 

Sorry, Baolin! ;)

> > 
> > So I get that split post-eof folios can fall off depending on fallocate
> > status. I'm not sure what you mean by considering fallocend, however.
> > ISTM that fallocend contributes to the boundary where we decide to split
> > and/or preserve, but i_size is what is relevant for zeroing. It's not
> > clear to me if you're suggesting the logic is potentially spurious, or
> > this might not actually be zeroing correctly due to falloc interactions.
> > Can you clarify the concern please? Thanks.
> 
> Sorry for not being clear enough (for my quick response yesterday). After
> thinking more, I want to divide this into 3 cases to clearly explain the
> logic here:
> 

No worries. Thanks for breaking it down. Much easier to discuss this
way.

> 1. Without fallocate(), if a large folio is beyond EOF (i.e. i_size), it
> will be split in shmem_writeout(), and those small folios beyond EOF will be
> dropped and freed in __folio_split(). So, your changes should also have no
> impact, because after the split, ‘folio_next_index(folio)’ is definitely <=
> ‘end_index’. So the logic is correct.
> 

Ack, but we still want to zero any post-eof portion of a small folio
straddling i_size...

> 2. With fallocate(), If a large folio is beyond EOF (i.e. i_size) but
> smaller than 'fallocend', the folio will not be split. So, we should zero
> the post-EOF part. Because 'index' (now representing 'fallocend') is greater
> than 'end_index', you are zeroing the entire large folio, which does not
> seem correct to me.
> 

Unless CONFIG_THP_SWAP is disabled (no idea how likely that is, seems
like an arch thing), in which case it seems we always split (and then
the split path will still use fallocend to determine whether to toss or
preserve).

But otherwise yes, partial zeroing should be the same for the large
folio across EOF case, it just happens to be a larger folio size...

> if (index >= end_index) {
> 	folio_zero_segment(folio, 0, folio_size(folio));
> } else if ...
> 
> I think you should only zero the range from 'from' to 'folio_size(folio)' of
> this large folio in this case. Right?
> 

However index is folio->index here, not fallocend. index is reassigned a
bit further down the function just after the block try_split: lands in:

        index = folio->index;
        nr_pages = folio_nr_pages(folio);

This wasn't introduced by this patch, FWIW, but I suppose we could make
the fallocend block use a local for clarity.

So given that, the logic is effectively if the folio starts at or beyond
the first page size index beyond EOF, zero the whole thing. That seems
pretty straightforward to me, so I'm not clear on why we'd need to
consider whether the folio is large or not at this point.

> 3. With fallocate(), If a large folio is beyond EOF (i.e. i_size) and also
> beyond 'fallocend', the large folio will be split to small folios. If we
> continue attempting to write out these small folios beyond EOF, we need to
> zero the entire mall folios at this point. So, the logic looks correct
> (because 'index' > 'end_index').
> 

Ack..

> Based on the above analysis, I believe the logic should be:
> 
> if (folio_next_index(folio) >= end_index) {
> 	size_t from = offset_in_folio(folio, i_size);
> 
> 	if (!folio_test_large(folio) && index >= end_index)
> 		folio_zero_segment(folio, 0, folio_size(folio));
> 	else if (from)
> 		folio_zero_segment(folio, from, folio_size(folio));
> }
> 
> The logic here is a bit complex, please correct me if I misunderstood you.
> 

Hmm.. so I'm not really sure about the large folio check. Have you
reviewed the next patch, by chance? It occurs to me that I probably
split these two up wrongly. I probably should have split off the
existing !uptodate zeroing into a separate hunk in patch 1 (i.e. as a
non functional change, refactoring patch) and then introduce the
functional change in patch 2. I'll try that for v3.

But in the meantime, this is the logic after patch 2:

	/*
	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
	 * traditional writeback behavior and facilitates zeroing on file size
	 * changes without having to swap back in.
	 */
	if (!folio_test_uptodate(folio) ||
	    folio_next_index(folio) >= end_index) {
		size_t from = offset_in_folio(folio, i_size);

		if (!folio_test_uptodate(folio) || index >= end_index) {
			folio_zero_segment(folio, 0, folio_size(folio));
			flush_dcache_folio(folio);
			folio_mark_uptodate(folio);
		} else if (from)
			folio_zero_segment(folio, from, folio_size(folio));
	}

So factoring out the preexisting uptodate logic, this looks mostly
equivalent to what you posted above with the exception of the large
folio check. I could be wrong, but it kind of sounds like that is maybe
due to confusion over the index value.. hm?

In any event, I'm trying to make this logic as simple and clear as
possible. The idea here is basically that any folio being written out
that is either !uptodate or at least partially beyond EOF needs zeroing.

The split logic earlier in the function simply dictates what folios make
it to this point (to be written out vs. tossed) or not, and otherwise we
don't really have to care about state wrt zeroing post-eof folio ranges
(particularly if any of that splitting logic happens to change in the
future). Let me know if I'm missing something. Thanks again.

Brian



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
  2025-11-19 14:08         ` Brian Foster
@ 2025-11-20  1:57           ` Baolin Wang
  2025-11-20 14:12             ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Baolin Wang @ 2025-11-20  1:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-mm, Hugh Dickins



On 2025/11/19 22:08, Brian Foster wrote:
> On Wed, Nov 19, 2025 at 11:53:41AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/11/18 22:39, Brian Foster wrote:
>>> On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/11/13 00:25, Brian Foster wrote:
>>>>> As a first step to facilitate efficient post-eof zeroing in tmpfs,
>>>>> zero post-eof uptodate folios at swap out time. This ensures that
>>>>> post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
>>>>> pagecache writeback) and facilitates zeroing on file size changes by
>>>>> allowing it to not have to swap in.
>>>>>
>>>>> Note that shmem_writeout() already zeroes !uptodate folios so this
>>>>> introduces some duplicate logic. We'll clean this up in the next
>>>>> patch.
>>>>>
>>>>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>>>>> ---
>>>>>     mm/shmem.c | 19 +++++++++++++++++--
>>>>>     1 file changed, 17 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>> index 0a25ee095b86..5fb3c911894f 100644
>>>>> --- a/mm/shmem.c
>>>>> +++ b/mm/shmem.c
>>>>> @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>>>>     	struct inode *inode = mapping->host;
>>>>>     	struct shmem_inode_info *info = SHMEM_I(inode);
>>>>>     	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>>>>> +	loff_t i_size = i_size_read(inode);
>>>>> +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
>>>>>     	pgoff_t index;
>>>>>     	int nr_pages;
>>>>>     	bool split = false;
>>>>> @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>>>>     	 * (unless fallocate has been used to preallocate beyond EOF).
>>>>>     	 */
>>>>>     	if (folio_test_large(folio)) {
>>>>> -		index = shmem_fallocend(inode,
>>>>> -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
>>>>> +		index = shmem_fallocend(inode, end_index);
>>>>>     		if ((index > folio->index && index < folio_next_index(folio)) ||
>>>>>     		    !IS_ENABLED(CONFIG_THP_SWAP))
>>>>>     			split = true;
>>>>> @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>>>>>     		folio_mark_uptodate(folio);
>>>>>     	}
>>>>> +	/*
>>>>> +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
>>>>> +	 * traditional writeback behavior and facilitates zeroing on file size
>>>>> +	 * changes without having to swap back in.
>>>>> +	 */
>>>>> +	if (folio_next_index(folio) >= end_index) {
>>>>> +		size_t from = offset_in_folio(folio, i_size);
>>>>> +
>>>>> +		if (index >= end_index) {
>>>>> +			folio_zero_segment(folio, 0, folio_size(folio));
>>>>> +		} else if (from)
>>>>> +			folio_zero_segment(folio, from, folio_size(folio));
>>>>> +	}
>>>>
>>>> As I mentioned before[1], if a large folio is beyond EOF, it will be split
>>>> in shmem_writeout(), and those small folios beyond EOF will be dropped and
>>>> freed in __folio_split(). Of course, there's another special case as Hugh
>>>> mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it
>>>> will keep the pages allocated beyond EOF after the split. However, your
>>>> 'end_index' here does not consider 'fallocend,' so it seems to me that this
>>>> portion of the code doesn't actually take effect.
>>>>
>>>
>>> Hi Boalin,
>>
>> s/Boalin/Baolin :)
>>
> 
> Sorry, Baolin! ;)
> 
>>>
>>> So I get that split post-eof folios can fall off depending on fallocate
>>> status. I'm not sure what you mean by considering fallocend, however.
>>> ISTM that fallocend contributes to the boundary where we decide to split
>>> and/or preserve, but i_size is what is relevant for zeroing. It's not
>>> clear to me if you're suggesting the logic is potentially spurious, or
>>> this might not actually be zeroing correctly due to falloc interactions.
>>> Can you clarify the concern please? Thanks.
>>
>> Sorry for not being clear enough (for my quick response yesterday). After
>> thinking more, I want to divide this into 3 cases to clearly explain the
>> logic here:
>>
> 
> No worries. Thanks for breaking it down. Much easier to discuss this
> way.
> 
>> 1. Without fallocate(), if a large folio is beyond EOF (i.e. i_size), it
>> will be split in shmem_writeout(), and those small folios beyond EOF will be
>> dropped and freed in __folio_split(). So, your changes should also have no
>> impact, because after the split, ‘folio_next_index(folio)’ is definitely <=
>> ‘end_index’. So the logic is correct.
>>
> 
> Ack, but we still want to zero any post-eof portion of a small folio
> straddling i_size...

Yes.

>> 2. With fallocate(), If a large folio is beyond EOF (i.e. i_size) but
>> smaller than 'fallocend', the folio will not be split. So, we should zero
>> the post-EOF part. Because 'index' (now representing 'fallocend') is greater
>> than 'end_index', you are zeroing the entire large folio, which does not
>> seem correct to me.
>>
> 
> Unless CONFIG_THP_SWAP is disabled (no idea how likely that is, seems
> like an arch thing), in which case it seems we always split (and then
> the split path will still use fallocend to determine whether to toss or
> preserve).
> 
> But otherwise yes, partial zeroing should be the same for the large
> folio across EOF case, it just happens to be a larger folio size...
> 
>> if (index >= end_index) {
>> 	folio_zero_segment(folio, 0, folio_size(folio));
>> } else if ...
>>
>> I think you should only zero the range from 'from' to 'folio_size(folio)' of
>> this large folio in this case. Right?
>>
> 
> However index is folio->index here, not fallocend. index is reassigned a
> bit further down the function just after the block try_split: lands in:
> 
>          index = folio->index;
>          nr_pages = folio_nr_pages(folio);
> 
> This wasn't introduced by this patch, FWIW, but I suppose we could make
> the fallocend block use a local for clarity.

Thanks for correcting me. I mistakenly assumed that the 'index' 
represents 'fallocend' from your patch:

+		index = shmem_fallocend(inode, end_index);

> So given that, the logic is effectively if the folio starts at or beyond
> the first page size index beyond EOF, zero the whole thing. That seems
> pretty straightforward to me, so I'm not clear on why we'd need to
> consider whether the folio is large or not at this point.

Yes, you are right. After you corrected me, I understand that index 
refers to folio->index.

>> 3. With fallocate(), If a large folio is beyond EOF (i.e. i_size) and also
>> beyond 'fallocend', the large folio will be split to small folios. If we
>> continue attempting to write out these small folios beyond EOF, we need to
>> zero the entire mall folios at this point. So, the logic looks correct
>> (because 'index' > 'end_index').
>>
> 
> Ack..
> 
>> Based on the above analysis, I believe the logic should be:
>>
>> if (folio_next_index(folio) >= end_index) {
>> 	size_t from = offset_in_folio(folio, i_size);
>>
>> 	if (!folio_test_large(folio) && index >= end_index)
>> 		folio_zero_segment(folio, 0, folio_size(folio));
>> 	else if (from)
>> 		folio_zero_segment(folio, from, folio_size(folio));
>> }
>>
>> The logic here is a bit complex, please correct me if I misunderstood you.
>>
> 
> Hmm.. so I'm not really sure about the large folio check. Have you
> reviewed the next patch, by chance? It occurs to me that I probably
> split these two up wrongly. I probably should have split off the
> existing !uptodate zeroing into a separate hunk in patch 1 (i.e. as a
> non functional change, refactoring patch) and then introduce the
> functional change in patch 2. I'll try that for v3.
> 
> But in the meantime, this is the logic after patch 2:
> 
> 	/*
> 	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> 	 * traditional writeback behavior and facilitates zeroing on file size
> 	 * changes without having to swap back in.
> 	 */
> 	if (!folio_test_uptodate(folio) ||
> 	    folio_next_index(folio) >= end_index) {
> 		size_t from = offset_in_folio(folio, i_size);
> 
> 		if (!folio_test_uptodate(folio) || index >= end_index) {
> 			folio_zero_segment(folio, 0, folio_size(folio));
> 			flush_dcache_folio(folio);
> 			folio_mark_uptodate(folio);
> 		} else if (from)
> 			folio_zero_segment(folio, from, folio_size(folio));
> 	}
> 
> So factoring out the preexisting uptodate logic, this looks mostly
> equivalent to what you posted above with the exception of the large
> folio check. I could be wrong, but it kind of sounds like that is maybe
> due to confusion over the index value.. hm?

Right. My example is wrong due to confusion over the index value. I 
think you are right.

> 
> In any event, I'm trying to make this logic as simple and clear as
> possible. The idea here is basically that any folio being written out
> that is either !uptodate or at least partially beyond EOF needs zeroing.
> 
> The split logic earlier in the function simply dictates what folios make
> it to this point (to be written out vs. tossed) or not, and otherwise we
> don't really have to care about state wrt zeroing post-eof folio ranges
> (particularly if any of that splitting logic happens to change in the
> future). Let me know if I'm missing something. Thanks again.

Thanks for your work and explanation. Now I think your patch is correct, 
and I will continue to review the following patches. Feel free to add:

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

One nit: using 'fallocend' instead of 'index' can avoid confusion:)

diff --git a/mm/shmem.c b/mm/shmem.c
index 371af9e322d5..7f7bdb7944cc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1586,8 +1586,8 @@ int shmem_writeout(struct folio *folio, struct 
swap_iocb **plug,
          * (unless fallocate has been used to preallocate beyond EOF).
          */
         if (folio_test_large(folio)) {
-               index = shmem_fallocend(inode, end_index);
-               if ((index > folio->index && index < 
folio_next_index(folio)) ||
+               pgoff_t fallocend = shmem_fallocend(inode, end_index);
+               if ((fallocend > folio->index && fallocend < 
folio_next_index(folio)) ||
                     !IS_ENABLED(CONFIG_THP_SWAP))
                         split = true;
         }



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout
  2025-11-12 16:25 ` [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout Brian Foster
@ 2025-11-20  2:56   ` Baolin Wang
  2025-11-20 14:14     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Baolin Wang @ 2025-11-20  2:56 UTC (permalink / raw)
  To: Brian Foster, linux-mm; +Cc: Hugh Dickins



On 2025/11/13 00:25, Brian Foster wrote:
> shmem_writeout() zeroes folios that are !uptodate (before marking
> them uptodate) or that extend beyond EOF to preserve data integrity
> according to POSIX. This is handled in a couple different blocks.
> Fold the !uptodate zeroing into the post-eof block so we zero from
> one place.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   mm/shmem.c | 40 +++++++++++++++++++---------------------
>   1 file changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5fb3c911894f..7925ced8a05d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1627,25 +1627,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   	 * good idea to continue anyway, once we're pushing into swap.  So
>   	 * reactivate the folio, and let shmem_fallocate() quit when too many.
>   	 */
> -	if (!folio_test_uptodate(folio)) {
> -		if (inode->i_private) {
> -			struct shmem_falloc *shmem_falloc;
> -			spin_lock(&inode->i_lock);
> -			shmem_falloc = inode->i_private;
> -			if (shmem_falloc &&
> -			    !shmem_falloc->waitq &&
> -			    index >= shmem_falloc->start &&
> -			    index < shmem_falloc->next)
> -				shmem_falloc->nr_unswapped += nr_pages;
> -			else
> -				shmem_falloc = NULL;
> -			spin_unlock(&inode->i_lock);
> -			if (shmem_falloc)
> -				goto redirty;
> -		}
> -		folio_zero_range(folio, 0, folio_size(folio));
> -		flush_dcache_folio(folio);
> -		folio_mark_uptodate(folio);
> +	if (!folio_test_uptodate(folio) && inode->i_private) {
> +		struct shmem_falloc *shmem_falloc;
> +		spin_lock(&inode->i_lock);
> +		shmem_falloc = inode->i_private;
> +		if (shmem_falloc &&
> +		    !shmem_falloc->waitq &&
> +		    index >= shmem_falloc->start &&
> +		    index < shmem_falloc->next)
> +			shmem_falloc->nr_unswapped += nr_pages;
> +		else
> +			shmem_falloc = NULL;
> +		spin_unlock(&inode->i_lock);
> +		if (shmem_falloc)
> +			goto redirty;
>   	}
>   
>   	/*
> @@ -1653,11 +1648,14 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   	 * traditional writeback behavior and facilitates zeroing on file size
>   	 * changes without having to swap back in.
>   	 */
> -	if (folio_next_index(folio) >= end_index) {
> +	if (!folio_test_uptodate(folio) ||
> +	    folio_next_index(folio) >= end_index) {
>   		size_t from = offset_in_folio(folio, i_size);
>   
> -		if (index >= end_index) {
> +		if (!folio_test_uptodate(folio) || index >= end_index) {
>   			folio_zero_segment(folio, 0, folio_size(folio));
> +			flush_dcache_folio(folio);
> +			folio_mark_uptodate(folio);
>   		} else if (from)
>   			folio_zero_segment(folio, from, folio_size(folio));
>   	}

Overall, this looks correct to me. However, I have some concerns about 
this cleanup, as it involves handling different logic for !uptodate 
folios of the fallocate() and EOF zeroing. I'm not sure if combining 
them together makes the code more readable, since, as discussed in patch 
1, there are multiple scenarios to consider for EOF zeroing. Let's see 
how Hugh views this cleanup.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension
  2025-11-12 16:25 ` [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension Brian Foster
@ 2025-11-20  5:57   ` Baolin Wang
  2025-11-20 14:21     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Baolin Wang @ 2025-11-20  5:57 UTC (permalink / raw)
  To: Brian Foster, linux-mm; +Cc: Hugh Dickins



On 2025/11/13 00:25, Brian Foster wrote:
> POSIX requires that "If the file size is increased, the extended area
> shall appear as if it were zero-filled". It is possible to use mmap to
> write past EOF and that data will become visible instead of zeroes. This
> behavior is reproduced by fstests test generic/363.
> 
> Most traditional filesystems zero any post-eof portion of a folio at
> writeback time or when the file size is extended by truncate or
> extending writes. This ensures that the previously post-eof range of the
> folio is zeroed before it is exposed to the file.
> 
> The tmpfs writeout path has been updated to zero post-eof folio
> ranges similar to traditional writeback. This ensures post-eof
> ranges are zeroed "on disk" and allows size extension zeroing to
> skip over swap entries as they are already appropriately zeroed.
> 
> To that end, introduce a new zeroing helper for proper zeroing on
> file extending operations. This looks up resident folios between the
> original and new eof and for those that are uptodate, zeroes them
> before the associated ranges are exposed to the file. This preserves
> POSIX semantics and allows generic/363 to pass on tmpfs.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   mm/shmem.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 79 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7925ced8a05d..a4aceb474377 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1101,6 +1101,78 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>   	return folio;
>   }
>   
> +/*
> + * Zero a post-EOF range about to be exposed by size extension. Zero from the
> + * current i_size through lend, the latter of which typically refers to the
> + * start offset of an extending operation. Skip swap entries because associated
> + * folios were zeroed at swapout time.
> + */
> +static void shmem_zero_eof(struct inode *inode, loff_t lend)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	loff_t lstart = i_size_read(inode);
> +	pgoff_t index = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	pgoff_t end = lend >> PAGE_SHIFT;
> +	struct folio_batch fbatch;
> +	struct folio *folio;
> +	int i;
> +	bool same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
> +
> +	folio = filemap_lock_folio(mapping, lstart >> PAGE_SHIFT);
> +	if (!IS_ERR(folio)) {
> +		same_folio = lend < folio_next_pos(folio);
> +		index = folio_next_index(folio);
> +
> +		if (folio_test_uptodate(folio)) {
> +			size_t from = offset_in_folio(folio, lstart);
> +			size_t len = min_t(loff_t, folio_size(folio) - from,
> +					   lend - lstart);
> +
> +			folio_zero_range(folio, from, len);
> +		}
> +
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +
> +	if (!same_folio) {
> +		folio = filemap_lock_folio(mapping, lend >> PAGE_SHIFT);
> +		if (!IS_ERR(folio)) {
> +			end = folio->index;
> +
> +			if (folio_test_uptodate(folio)) {
> +				size_t len = lend - folio_pos(folio);
> +				folio_zero_range(folio, 0, len);

I am curious why not zero the whole folio? Since the folio corresponding 
to the 'lend' is also beyond EOF, no? Otherwise look good to me.

> +			}
> +
> +			folio_unlock(folio);
> +			folio_put(folio);
> +		}
> +	}
> +
> +	/*
> +	 * Zero uptodate folios fully within the target range. Uptodate folios
> +	 * beyond EOF are generally unexpected, but can exist if a larger
> +	 * falloc'd and uptodate EOF folio is split.
> +	 */
> +	folio_batch_init(&fbatch);
> +	while (index < end) {
> +		if (!filemap_get_folios(mapping, &index, end - 1, &fbatch))
> +			break;
> +		for (i = 0; i < folio_batch_count(&fbatch); i++) {
> +			folio = fbatch.folios[i];
> +
> +			folio_lock(folio);
> +			if (folio_test_uptodate(folio) &&
> +			    folio->mapping == mapping) {
> +				folio_zero_segment(folio, 0, folio_size(folio));
> +			}
> +			folio_unlock(folio);
> +		}
> +		folio_batch_release(&fbatch);
> +	}
> +}
> +
>   /*
>    * Remove range of pages and swap entries from page cache, and free them.
>    * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> @@ -1331,6 +1403,8 @@ static int shmem_setattr(struct mnt_idmap *idmap,
>   					oldsize, newsize);
>   			if (error)
>   				return error;
> +			if (newsize > oldsize)
> +				shmem_zero_eof(inode, newsize);
>   			i_size_write(inode, newsize);
>   			update_mtime = true;
>   		} else {
> @@ -3512,6 +3586,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   	ret = file_update_time(file);
>   	if (ret)
>   		goto unlock;
> +	if (iocb->ki_pos > i_size_read(inode))
> +		shmem_zero_eof(inode, iocb->ki_pos);
>   	ret = generic_perform_write(iocb, from);
>   unlock:
>   	inode_unlock(inode);
> @@ -3844,8 +3920,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>   		cond_resched();
>   	}
>   
> -	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> +		shmem_zero_eof(inode, offset + len);
>   		i_size_write(inode, offset + len);
> +	}
>   undone:
>   	spin_lock(&inode->i_lock);
>   	inode->i_private = NULL;



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
  2025-11-20  1:57           ` Baolin Wang
@ 2025-11-20 14:12             ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2025-11-20 14:12 UTC (permalink / raw)
  To: Baolin Wang; +Cc: linux-mm, Hugh Dickins

On Thu, Nov 20, 2025 at 09:57:40AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/11/19 22:08, Brian Foster wrote:
> > On Wed, Nov 19, 2025 at 11:53:41AM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 2025/11/18 22:39, Brian Foster wrote:
> > > > On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote:
> > > > > 
> > > > > 
> > > > > On 2025/11/13 00:25, Brian Foster wrote:
> > > > > > As a first step to facilitate efficient post-eof zeroing in tmpfs,
> > > > > > zero post-eof uptodate folios at swap out time. This ensures that
> > > > > > post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
> > > > > > pagecache writeback) and facilitates zeroing on file size changes by
> > > > > > allowing it to not have to swap in.
> > > > > > 
> > > > > > Note that shmem_writeout() already zeroes !uptodate folios so this
> > > > > > introduces some duplicate logic. We'll clean this up in the next
> > > > > > patch.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > >     mm/shmem.c | 19 +++++++++++++++++--
> > > > > >     1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > > > index 0a25ee095b86..5fb3c911894f 100644
> > > > > > --- a/mm/shmem.c
> > > > > > +++ b/mm/shmem.c
> > > > > > @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > > > >     	struct inode *inode = mapping->host;
> > > > > >     	struct shmem_inode_info *info = SHMEM_I(inode);
> > > > > >     	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > > > > > +	loff_t i_size = i_size_read(inode);
> > > > > > +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
> > > > > >     	pgoff_t index;
> > > > > >     	int nr_pages;
> > > > > >     	bool split = false;
> > > > > > @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > > > >     	 * (unless fallocate has been used to preallocate beyond EOF).
> > > > > >     	 */
> > > > > >     	if (folio_test_large(folio)) {
> > > > > > -		index = shmem_fallocend(inode,
> > > > > > -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> > > > > > +		index = shmem_fallocend(inode, end_index);
> > > > > >     		if ((index > folio->index && index < folio_next_index(folio)) ||
> > > > > >     		    !IS_ENABLED(CONFIG_THP_SWAP))
> > > > > >     			split = true;
> > > > > > @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > > > >     		folio_mark_uptodate(folio);
> > > > > >     	}
> > > > > > +	/*
> > > > > > +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> > > > > > +	 * traditional writeback behavior and facilitates zeroing on file size
> > > > > > +	 * changes without having to swap back in.
> > > > > > +	 */
> > > > > > +	if (folio_next_index(folio) >= end_index) {
> > > > > > +		size_t from = offset_in_folio(folio, i_size);
> > > > > > +
> > > > > > +		if (index >= end_index) {
> > > > > > +			folio_zero_segment(folio, 0, folio_size(folio));
> > > > > > +		} else if (from)
> > > > > > +			folio_zero_segment(folio, from, folio_size(folio));
> > > > > > +	}
> > > > > 
> > > > > As I mentioned before[1], if a large folio is beyond EOF, it will be split
> > > > > in shmem_writeout(), and those small folios beyond EOF will be dropped and
> > > > > freed in __folio_split(). Of course, there's another special case as Hugh
> > > > > mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it
> > > > > will keep the pages allocated beyond EOF after the split. However, your
> > > > > 'end_index' here does not consider 'fallocend,' so it seems to me that this
> > > > > portion of the code doesn't actually take effect.
> > > > > 
> > > > 
> > > > Hi Boalin,
> > > 
> > > s/Boalin/Baolin :)
> > > 
> > 
> > Sorry, Baolin! ;)
> > 
> > > > 
> > > > So I get that split post-eof folios can fall off depending on fallocate
> > > > status. I'm not sure what you mean by considering fallocend, however.
> > > > ISTM that fallocend contributes to the boundary where we decide to split
> > > > and/or preserve, but i_size is what is relevant for zeroing. It's not
> > > > clear to me if you're suggesting the logic is potentially spurious, or
> > > > this might not actually be zeroing correctly due to falloc interactions.
> > > > Can you clarify the concern please? Thanks.
> > > 
> > > Sorry for not being clear enough (for my quick response yesterday). After
> > > thinking more, I want to divide this into 3 cases to clearly explain the
> > > logic here:
> > > 
> > 
> > No worries. Thanks for breaking it down. Much easier to discuss this
> > way.
> > 
> > > 1. Without fallocate(), if a large folio is beyond EOF (i.e. i_size), it
> > > will be split in shmem_writeout(), and those small folios beyond EOF will be
> > > dropped and freed in __folio_split(). So, your changes should also have no
> > > impact, because after the split, ‘folio_next_index(folio)’ is definitely <=
> > > ‘end_index’. So the logic is correct.
> > > 
> > 
> > Ack, but we still want to zero any post-eof portion of a small folio
> > straddling i_size...
> 
> Yes.
> 
> > > 2. With fallocate(), If a large folio is beyond EOF (i.e. i_size) but
> > > smaller than 'fallocend', the folio will not be split. So, we should zero
> > > the post-EOF part. Because 'index' (now representing 'fallocend') is greater
> > > than 'end_index', you are zeroing the entire large folio, which does not
> > > seem correct to me.
> > > 
> > 
> > Unless CONFIG_THP_SWAP is disabled (no idea how likely that is, seems
> > like an arch thing), in which case it seems we always split (and then
> > the split path will still use fallocend to determine whether to toss or
> > preserve).
> > 
> > But otherwise yes, partial zeroing should be the same for the large
> > folio across EOF case, it just happens to be a larger folio size...
> > 
> > > if (index >= end_index) {
> > > 	folio_zero_segment(folio, 0, folio_size(folio));
> > > } else if ...
> > > 
> > > I think you should only zero the range from 'from' to 'folio_size(folio)' of
> > > this large folio in this case. Right?
> > > 
> > 
> > However index is folio->index here, not fallocend. index is reassigned a
> > bit further down the function just after the block try_split: lands in:
> > 
> >          index = folio->index;
> >          nr_pages = folio_nr_pages(folio);
> > 
> > This wasn't introduced by this patch, FWIW, but I suppose we could make
> > the fallocend block use a local for clarity.
> 
> Thanks for correcting me. I mistakenly assumed that the 'index' represents
> 'fallocend' from your patch:
> 
> +		index = shmem_fallocend(inode, end_index);
> 
> > So given that, the logic is effectively if the folio starts at or beyond
> > the first page size index beyond EOF, zero the whole thing. That seems
> > pretty straightforward to me, so I'm not clear on why we'd need to
> > consider whether the folio is large or not at this point.
> 
> Yes, you are right. After you corrected me, I understand that index refers
> to folio->index.
> 
> > > 3. With fallocate(), If a large folio is beyond EOF (i.e. i_size) and also
> > > beyond 'fallocend', the large folio will be split to small folios. If we
> > > continue attempting to write out these small folios beyond EOF, we need to
> > > zero the entire mall folios at this point. So, the logic looks correct
> > > (because 'index' > 'end_index').
> > > 
> > 
> > Ack..
> > 
> > > Based on the above analysis, I believe the logic should be:
> > > 
> > > if (folio_next_index(folio) >= end_index) {
> > > 	size_t from = offset_in_folio(folio, i_size);
> > > 
> > > 	if (!folio_test_large(folio) && index >= end_index)
> > > 		folio_zero_segment(folio, 0, folio_size(folio));
> > > 	else if (from)
> > > 		folio_zero_segment(folio, from, folio_size(folio));
> > > }
> > > 
> > > The logic here is a bit complex, please correct me if I misunderstood you.
> > > 
> > 
> > Hmm.. so I'm not really sure about the large folio check. Have you
> > reviewed the next patch, by chance? It occurs to me that I probably
> > split these two up wrongly. I probably should have split off the
> > existing !uptodate zeroing into a separate hunk in patch 1 (i.e. as a
> > non functional change, refactoring patch) and then introduce the
> > functional change in patch 2. I'll try that for v3.
> > 
> > But in the meantime, this is the logic after patch 2:
> > 
> > 	/*
> > 	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> > 	 * traditional writeback behavior and facilitates zeroing on file size
> > 	 * changes without having to swap back in.
> > 	 */
> > 	if (!folio_test_uptodate(folio) ||
> > 	    folio_next_index(folio) >= end_index) {
> > 		size_t from = offset_in_folio(folio, i_size);
> > 
> > 		if (!folio_test_uptodate(folio) || index >= end_index) {
> > 			folio_zero_segment(folio, 0, folio_size(folio));
> > 			flush_dcache_folio(folio);
> > 			folio_mark_uptodate(folio);
> > 		} else if (from)
> > 			folio_zero_segment(folio, from, folio_size(folio));
> > 	}
> > 
> > So factoring out the preexisting uptodate logic, this looks mostly
> > equivalent to what you posted above with the exception of the large
> > folio check. I could be wrong, but it kind of sounds like that is maybe
> > due to confusion over the index value.. hm?
> 
> Right. My example is wrong due to confusion over the index value. I think
> you are right.
> 
> > 
> > In any event, I'm trying to make this logic as simple and clear as
> > possible. The idea here is basically that any folio being written out
> > that is either !uptodate or at least partially beyond EOF needs zeroing.
> > 
> > The split logic earlier in the function simply dictates what folios make
> > it to this point (to be written out vs. tossed) or not, and otherwise we
> > don't really have to care about state wrt zeroing post-eof folio ranges
> > (particularly if any of that splitting logic happens to change in the
> > future). Let me know if I'm missing something. Thanks again.
> 
> Thanks for your work and explanation. Now I think your patch is correct, and
> I will continue to review the following patches. Feel free to add:
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> One nit: using 'fallocend' instead of 'index' can avoid confusion:)
> 

Great, thanks! Agreed, I'll fold this in for v3..

Brian

> diff --git a/mm/shmem.c b/mm/shmem.c
> index 371af9e322d5..7f7bdb7944cc 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1586,8 +1586,8 @@ int shmem_writeout(struct folio *folio, struct
> swap_iocb **plug,
>          * (unless fallocate has been used to preallocate beyond EOF).
>          */
>         if (folio_test_large(folio)) {
> -               index = shmem_fallocend(inode, end_index);
> -               if ((index > folio->index && index <
> folio_next_index(folio)) ||
> +               pgoff_t fallocend = shmem_fallocend(inode, end_index);
> +               if ((fallocend > folio->index && fallocend <
> folio_next_index(folio)) ||
>                     !IS_ENABLED(CONFIG_THP_SWAP))
>                         split = true;
>         }
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout
  2025-11-20  2:56   ` Baolin Wang
@ 2025-11-20 14:14     ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2025-11-20 14:14 UTC (permalink / raw)
  To: Baolin Wang; +Cc: linux-mm, Hugh Dickins

On Thu, Nov 20, 2025 at 10:56:41AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/11/13 00:25, Brian Foster wrote:
> > shmem_writeout() zeroes folios that are !uptodate (before marking
> > them uptodate) or that extend beyond EOF to preserve data integrity
> > according to POSIX. This is handled in a couple different blocks.
> > Fold the !uptodate zeroing into the post-eof block so we zero from
> > one place.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >   mm/shmem.c | 40 +++++++++++++++++++---------------------
> >   1 file changed, 19 insertions(+), 21 deletions(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 5fb3c911894f..7925ced8a05d 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1627,25 +1627,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >   	 * good idea to continue anyway, once we're pushing into swap.  So
> >   	 * reactivate the folio, and let shmem_fallocate() quit when too many.
> >   	 */
> > -	if (!folio_test_uptodate(folio)) {
> > -		if (inode->i_private) {
> > -			struct shmem_falloc *shmem_falloc;
> > -			spin_lock(&inode->i_lock);
> > -			shmem_falloc = inode->i_private;
> > -			if (shmem_falloc &&
> > -			    !shmem_falloc->waitq &&
> > -			    index >= shmem_falloc->start &&
> > -			    index < shmem_falloc->next)
> > -				shmem_falloc->nr_unswapped += nr_pages;
> > -			else
> > -				shmem_falloc = NULL;
> > -			spin_unlock(&inode->i_lock);
> > -			if (shmem_falloc)
> > -				goto redirty;
> > -		}
> > -		folio_zero_range(folio, 0, folio_size(folio));
> > -		flush_dcache_folio(folio);
> > -		folio_mark_uptodate(folio);
> > +	if (!folio_test_uptodate(folio) && inode->i_private) {
> > +		struct shmem_falloc *shmem_falloc;
> > +		spin_lock(&inode->i_lock);
> > +		shmem_falloc = inode->i_private;
> > +		if (shmem_falloc &&
> > +		    !shmem_falloc->waitq &&
> > +		    index >= shmem_falloc->start &&
> > +		    index < shmem_falloc->next)
> > +			shmem_falloc->nr_unswapped += nr_pages;
> > +		else
> > +			shmem_falloc = NULL;
> > +		spin_unlock(&inode->i_lock);
> > +		if (shmem_falloc)
> > +			goto redirty;
> >   	}
> >   	/*
> > @@ -1653,11 +1648,14 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> >   	 * traditional writeback behavior and facilitates zeroing on file size
> >   	 * changes without having to swap back in.
> >   	 */
> > -	if (folio_next_index(folio) >= end_index) {
> > +	if (!folio_test_uptodate(folio) ||
> > +	    folio_next_index(folio) >= end_index) {
> >   		size_t from = offset_in_folio(folio, i_size);
> > -		if (index >= end_index) {
> > +		if (!folio_test_uptodate(folio) || index >= end_index) {
> >   			folio_zero_segment(folio, 0, folio_size(folio));
> > +			flush_dcache_folio(folio);
> > +			folio_mark_uptodate(folio);
> >   		} else if (from)
> >   			folio_zero_segment(folio, from, folio_size(folio));
> >   	}
> 
> Overall, this looks correct to me. However, I have some concerns about this
> cleanup, as it involves handling different logic for !uptodate folios of the
> fallocate() and EOF zeroing. I'm not sure if combining them together makes
> the code more readable, since, as discussed in patch 1, there are multiple
> scenarios to consider for EOF zeroing. Let's see how Hugh views this
> cleanup.
> 

Yeah, fair. FWIW I played around with reordering this a bit yesterday
and made another cleanup so the end result now looks like this:

	/*
	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
	 * traditional writeback behavior and facilitates zeroing on file size
	 * changes without having to swap back in.
	 */
	if (!folio_test_uptodate(folio) ||
	    folio_next_index(folio) >= end_index) {
		size_t from = offset_in_folio(folio, i_size);

		if (!folio_test_uptodate(folio) || index >= end_index)
			folio_zero_segment(folio, 0, folio_size(folio));
		else if (from)
			folio_zero_segment(folio, from, folio_size(folio));

		flush_dcache_folio(folio);
		folio_mark_uptodate(folio);
	}

I find it a little cleaner personally, but that is still clearly
condensing some of the logic between the two cases, so certainly
debatable.

I'm going to try to spin around a v3 quicker than anticipated to fix up
the wonky ordering thing, since I think the way I did it here adds
unnecessary confusion. Hopefully I can get to that before Hugh digs into
this one..

Brian



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension
  2025-11-20  5:57   ` Baolin Wang
@ 2025-11-20 14:21     ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2025-11-20 14:21 UTC (permalink / raw)
  To: Baolin Wang; +Cc: linux-mm, Hugh Dickins

On Thu, Nov 20, 2025 at 01:57:48PM +0800, Baolin Wang wrote:
> 
> 
> On 2025/11/13 00:25, Brian Foster wrote:
> > POSIX requires that "If the file size is increased, the extended area
> > shall appear as if it were zero-filled". It is possible to use mmap to
> > write past EOF and that data will become visible instead of zeroes. This
> > behavior is reproduced by fstests test generic/363.
> > 
> > Most traditional filesystems zero any post-eof portion of a folio at
> > writeback time or when the file size is extended by truncate or
> > extending writes. This ensures that the previously post-eof range of the
> > folio is zeroed before it is exposed to the file.
> > 
> > The tmpfs writeout path has been updated to zero post-eof folio
> > ranges similar to traditional writeback. This ensures post-eof
> > ranges are zeroed "on disk" and allows size extension zeroing to
> > skip over swap entries as they are already appropriately zeroed.
> > 
> > To that end, introduce a new zeroing helper for proper zeroing on
> > file extending operations. This looks up resident folios between the
> > original and new eof and for those that are uptodate, zeroes them
> > before the associated ranges are exposed to the file. This preserves
> > POSIX semantics and allows generic/363 to pass on tmpfs.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >   mm/shmem.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 79 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 7925ced8a05d..a4aceb474377 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1101,6 +1101,78 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
> >   	return folio;
> >   }
> > +/*
> > + * Zero a post-EOF range about to be exposed by size extension. Zero from the
> > + * current i_size through lend, the latter of which typically refers to the
> > + * start offset of an extending operation. Skip swap entries because associated
> > + * folios were zeroed at swapout time.
> > + */
> > +static void shmem_zero_eof(struct inode *inode, loff_t lend)
> > +{
> > +	struct address_space *mapping = inode->i_mapping;
> > +	loff_t lstart = i_size_read(inode);
> > +	pgoff_t index = (lstart + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +	pgoff_t end = lend >> PAGE_SHIFT;
> > +	struct folio_batch fbatch;
> > +	struct folio *folio;
> > +	int i;
> > +	bool same_folio = (lstart >> PAGE_SHIFT) == (lend >> PAGE_SHIFT);
> > +
> > +	folio = filemap_lock_folio(mapping, lstart >> PAGE_SHIFT);
> > +	if (!IS_ERR(folio)) {
> > +		same_folio = lend < folio_next_pos(folio);
> > +		index = folio_next_index(folio);
> > +
> > +		if (folio_test_uptodate(folio)) {
> > +			size_t from = offset_in_folio(folio, lstart);
> > +			size_t len = min_t(loff_t, folio_size(folio) - from,
> > +					   lend - lstart);
> > +
> > +			folio_zero_range(folio, from, len);
> > +		}
> > +
> > +		folio_unlock(folio);
> > +		folio_put(folio);
> > +	}
> > +
> > +	if (!same_folio) {
> > +		folio = filemap_lock_folio(mapping, lend >> PAGE_SHIFT);
> > +		if (!IS_ERR(folio)) {
> > +			end = folio->index;
> > +
> > +			if (folio_test_uptodate(folio)) {
> > +				size_t len = lend - folio_pos(folio);
> > +				folio_zero_range(folio, 0, len);
> 
> I am curious why not zero the whole folio? Since the folio corresponding to
> the 'lend' is also beyond EOF, no? Otherwise look good to me.
> 

In practice I suspect we could do something like that since the entire
range is presumed beyond EOF, but I suppose it's just semantics. The end
offset will typically correspond to something like the start of a write,
so that would be unnecessary zeroing in some cases.

I don't love the prospect of having a function with a byte granular end
offset that doesn't really honor the offset in the way one would expect.
I suspect that means I'd have to change this to have a pgoff end index
or something to make the zeroing granularity more clear, and ultimately
I'm not really sure that helps anything functionally.

So really I'd say this is mainly just because this is modeled after
typical fs zeroing behavior for these sorts of size extension
situations. It is a context specific implementation of course, but the
interface/behavior seemed the most natural and consistent to me.

Thanks again for the feedback on these patches..

Brian

> > +			}
> > +
> > +			folio_unlock(folio);
> > +			folio_put(folio);
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Zero uptodate folios fully within the target range. Uptodate folios
> > +	 * beyond EOF are generally unexpected, but can exist if a larger
> > +	 * falloc'd and uptodate EOF folio is split.
> > +	 */
> > +	folio_batch_init(&fbatch);
> > +	while (index < end) {
> > +		if (!filemap_get_folios(mapping, &index, end - 1, &fbatch))
> > +			break;
> > +		for (i = 0; i < folio_batch_count(&fbatch); i++) {
> > +			folio = fbatch.folios[i];
> > +
> > +			folio_lock(folio);
> > +			if (folio_test_uptodate(folio) &&
> > +			    folio->mapping == mapping) {
> > +				folio_zero_segment(folio, 0, folio_size(folio));
> > +			}
> > +			folio_unlock(folio);
> > +		}
> > +		folio_batch_release(&fbatch);
> > +	}
> > +}
> > +
> >   /*
> >    * Remove range of pages and swap entries from page cache, and free them.
> >    * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> > @@ -1331,6 +1403,8 @@ static int shmem_setattr(struct mnt_idmap *idmap,
> >   					oldsize, newsize);
> >   			if (error)
> >   				return error;
> > +			if (newsize > oldsize)
> > +				shmem_zero_eof(inode, newsize);
> >   			i_size_write(inode, newsize);
> >   			update_mtime = true;
> >   		} else {
> > @@ -3512,6 +3586,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >   	ret = file_update_time(file);
> >   	if (ret)
> >   		goto unlock;
> > +	if (iocb->ki_pos > i_size_read(inode))
> > +		shmem_zero_eof(inode, iocb->ki_pos);
> >   	ret = generic_perform_write(iocb, from);
> >   unlock:
> >   	inode_unlock(inode);
> > @@ -3844,8 +3920,10 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >   		cond_resched();
> >   	}
> > -	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> > +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> > +		shmem_zero_eof(inode, offset + len);
> >   		i_size_write(inode, offset + len);
> > +	}
> >   undone:
> >   	spin_lock(&inode->i_lock);
> >   	inode->i_private = NULL;
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-11-20 14:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-12 16:25 [PATCH v2 0/3] tmpfs: zero post-eof ranges on file extension Brian Foster
2025-11-12 16:25 ` [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Brian Foster
2025-11-18  2:33   ` Baolin Wang
2025-11-18 14:39     ` Brian Foster
2025-11-19  3:53       ` Baolin Wang
2025-11-19 14:08         ` Brian Foster
2025-11-20  1:57           ` Baolin Wang
2025-11-20 14:12             ` Brian Foster
2025-11-12 16:25 ` [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout Brian Foster
2025-11-20  2:56   ` Baolin Wang
2025-11-20 14:14     ` Brian Foster
2025-11-12 16:25 ` [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension Brian Foster
2025-11-20  5:57   ` Baolin Wang
2025-11-20 14:21     ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox