[PATCH 0/2] ext4, mm: improve partial inode eof zeroing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] ext4, mm: improve partial inode eof zeroing
@ 2024-09-19 16:07 Brian Foster
  2024-09-19 16:07 ` [PATCH 1/2] ext4: partial zero eof block on unaligned inode size extension Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Brian Foster @ 2024-09-19 16:07 UTC (permalink / raw)
  To: linux-ext4, linux-mm; +Cc: linux-fsdevel, willy

Hi all,

I've been poking around at testing zeroing behavior after a couple
recent enhancements to iomap_zero_range() and fsx[1]. Running [1] on
ext4 has uncovered a couple issues that I think share responsibility
between the fs and pagecache.

The details are in the commit logs, but patch 1 updates ext4 to do
partial eof block zeroing in more cases and patch 2 tweaks
pagecache_isize_extended() to do eof folio zeroing similar to as is done
during writeback (i.e., ext4_bio_write_folio(),
iomap_writepage_handle_eof(), etc.). These kind of overlap, but the fs
changes handle the case of a block straddling eof (so we're writing to
disk in that case) and the pagecache changes handle the case of a folio
straddling eof that might be at least partially hole backed (i.e.
sub-page block sizes, so we're just clearing pagecache).

Aside from general review, my biggest questions WRT patch 1 are 1.
whether the journalling bits are handled correctly and 2. whether the
verity case is handled correctly. I recall seeing verity checks around
the code and I don't know enough about the feature to quite understand
why. FWIW, I have run fstests against this using various combinations of
block size and journalling modes without any regression so far. That
includes enabling generic/363 [1] for ext4, which afaict is now possible
with these two proposed changes.

WRT patch 2, I originally tested with unconditional zeroing and added
the dirty check after. This still survives testing, but I'm having
second thoughts on whether that is correct or introduces a small race
window between writeback and an i_size update. I guess there's also a
question of whether the fs or pagecache should be responsible for this,
but given writeback and truncate_setsize() behavior this seemed fairly
consistent to me.

Thoughts, reviews, flames appreciated.

Brian

[1] https://lore.kernel.org/fstests/20240828181534.41054-1-bfoster@redhat.com/

Brian Foster (2):
  ext4: partial zero eof block on unaligned inode size extension
  mm: zero range of eof folio exposed by inode size extension

 fs/ext4/extents.c |  7 ++++++-
 fs/ext4/inode.c   | 51 +++++++++++++++++++++++++++++++++--------------
 mm/truncate.c     | 15 ++++++++++++++
 3 files changed, 57 insertions(+), 16 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] ext4: partial zero eof block on unaligned inode size extension
  2024-09-19 16:07 [PATCH 0/2] ext4, mm: improve partial inode eof zeroing Brian Foster
@ 2024-09-19 16:07 ` Brian Foster
  2024-09-19 16:07 ` [PATCH 2/2] mm: zero range of eof folio exposed by " Brian Foster
  2024-11-07 15:12 ` [PATCH 0/2] ext4, mm: improve partial inode eof zeroing Theodore Ts'o
  2 siblings, 0 replies; 6+ messages in thread
From: Brian Foster @ 2024-09-19 16:07 UTC (permalink / raw)
  To: linux-ext4, linux-mm; +Cc: linux-fsdevel, willy

Using mapped writes, it's technically possible to expose stale
post-eof data on a truncate up operation. Consider the following
example:

$ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \
	-c "truncate 8k" -c "pread -v 2k 16" <file>
...
00000800:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
...

This shows that the post-eof data written via mwrite lands within
EOF after a truncate up. While this is deliberate of the test case,
behavior is somewhat unpredictable because writeback does post-eof
zeroing, and writeback can occur at any time in the background. For
example, an fsync inserted between the mwrite and truncate causes
the subsequent read to instead return zeroes. This basically means
that there is a race window in this situation between any subsequent
extending operation and writeback that dictates whether post-eof
data is exposed to the file or zeroed.

To prevent this problem, perform partial block zeroing as part of
the various inode size extending operations that are susceptible to
it. For truncate extension, zero around the original eof similar to
how truncate down does partial zeroing of the new eof. For extension
via writes and fallocate related operations, zero the newly exposed
range of the file to cover any partial zeroing that must occur at
the original and new eof blocks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/ext4/extents.c |  7 ++++++-
 fs/ext4/inode.c   | 51 +++++++++++++++++++++++++++++++++--------------
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e067f2dd0335..d43a23abf148 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4457,7 +4457,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 	int depth = 0;
 	struct ext4_map_blocks map;
 	unsigned int credits;
-	loff_t epos;
+	loff_t epos, old_size = i_size_read(inode);
 
 	BUG_ON(!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS));
 	map.m_lblk = offset;
@@ -4516,6 +4516,11 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 			if (ext4_update_inode_size(inode, epos) & 0x1)
 				inode_set_mtime_to_ts(inode,
 						      inode_get_ctime(inode));
+			if (epos > old_size) {
+				pagecache_isize_extended(inode, old_size, epos);
+				ext4_zero_partial_blocks(handle, inode,
+						     old_size, epos - old_size);
+			}
 		}
 		ret2 = ext4_mark_inode_dirty(handle, inode);
 		ext4_update_inode_fsync_trans(handle, inode, 1);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 03374dc215d1..c8d5334cecca 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1327,8 +1327,10 @@ static int ext4_write_end(struct file *file,
 	folio_unlock(folio);
 	folio_put(folio);
 
-	if (old_size < pos && !verity)
+	if (old_size < pos && !verity) {
 		pagecache_isize_extended(inode, old_size, pos);
+		ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size);
+	}
 	/*
 	 * Don't mark the inode dirty under folio lock. First, it unnecessarily
 	 * makes the holding time of folio lock longer. Second, it forces lock
@@ -1443,8 +1445,10 @@ static int ext4_journalled_write_end(struct file *file,
 	folio_unlock(folio);
 	folio_put(folio);
 
-	if (old_size < pos && !verity)
+	if (old_size < pos && !verity) {
 		pagecache_isize_extended(inode, old_size, pos);
+		ext4_zero_partial_blocks(handle, inode, old_size, pos - old_size);
+	}
 
 	if (size_changed) {
 		ret2 = ext4_mark_inode_dirty(handle, inode);
@@ -3015,7 +3019,8 @@ static int ext4_da_do_write_end(struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	loff_t old_size = inode->i_size;
 	bool disksize_changed = false;
-	loff_t new_i_size;
+	loff_t new_i_size, zero_len = 0;
+	handle_t *handle;
 
 	if (unlikely(!folio_buffers(folio))) {
 		folio_unlock(folio);
@@ -3059,18 +3064,21 @@ static int ext4_da_do_write_end(struct address_space *mapping,
 	folio_unlock(folio);
 	folio_put(folio);
 
-	if (old_size < pos)
+	if (pos > old_size) {
 		pagecache_isize_extended(inode, old_size, pos);
+		zero_len = pos - old_size;
+	}
 
-	if (disksize_changed) {
-		handle_t *handle;
+	if (!disksize_changed && !zero_len)
+		return copied;
 
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-		if (IS_ERR(handle))
-			return PTR_ERR(handle);
-		ext4_mark_inode_dirty(handle, inode);
-		ext4_journal_stop(handle);
-	}
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+	if (zero_len)
+		ext4_zero_partial_blocks(handle, inode, old_size, zero_len);
+	ext4_mark_inode_dirty(handle, inode);
+	ext4_journal_stop(handle);
 
 	return copied;
 }
@@ -5453,6 +5461,14 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 		}
 
 		if (attr->ia_size != inode->i_size) {
+			/* attach jbd2 jinode for EOF folio tail zeroing */
+			if (attr->ia_size & (inode->i_sb->s_blocksize - 1) ||
+			    oldsize & (inode->i_sb->s_blocksize - 1)) {
+				error = ext4_inode_attach_jinode(inode);
+				if (error)
+					goto err_out;
+			}
+
 			handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
 			if (IS_ERR(handle)) {
 				error = PTR_ERR(handle);
@@ -5463,12 +5479,17 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 				orphan = 1;
 			}
 			/*
-			 * Update c/mtime on truncate up, ext4_truncate() will
-			 * update c/mtime in shrink case below
+			 * Update c/mtime and tail zero the EOF folio on
+			 * truncate up. ext4_truncate() handles the shrink case
+			 * below.
 			 */
-			if (!shrink)
+			if (!shrink) {
 				inode_set_mtime_to_ts(inode,
 						      inode_set_ctime_current(inode));
+				if (oldsize & (inode->i_sb->s_blocksize - 1))
+					ext4_block_truncate_page(handle,
+							inode->i_mapping, oldsize);
+			}
 
 			if (shrink)
 				ext4_fc_track_range(handle, inode,
-- 
2.45.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] mm: zero range of eof folio exposed by inode size extension
  2024-09-19 16:07 [PATCH 0/2] ext4, mm: improve partial inode eof zeroing Brian Foster
  2024-09-19 16:07 ` [PATCH 1/2] ext4: partial zero eof block on unaligned inode size extension Brian Foster
@ 2024-09-19 16:07 ` Brian Foster
  2026-02-26 13:31   ` [PATCH] mm: fix pagecache_isize_extended() early-return bypass for large folio mappings Morduan Zang
  2024-11-07 15:12 ` [PATCH 0/2] ext4, mm: improve partial inode eof zeroing Theodore Ts'o
  2 siblings, 1 reply; 6+ messages in thread
From: Brian Foster @ 2024-09-19 16:07 UTC (permalink / raw)
  To: linux-ext4, linux-mm; +Cc: linux-fsdevel, willy

On some filesystems, it is currently possible to create a transient
data inconsistency between pagecache and on-disk state. For example,
on a 1k block size ext4 filesystem:

$ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \
	  -c "truncate 8k" -c "fiemap -v" -c "pread -v 2k 16" <file>
...
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..3]:          17410..17413         4   0x1
   1: [4..15]:         hole                12
00000800:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
$ umount <mnt>; mount <dev> <mnt>
$ xfs_io -c "pread -v 2k 16" <file>
00000800:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

This allocates and writes two 1k blocks, map writes to the post-eof
portion of the (4k) eof folio, extends the file, and then shows that
the post-eof data is not cleared before the file size is extended.
The result is pagecache with a clean and uptodate folio over a hole
that returns non-zero data. Once reclaimed, pagecache begins to
return valid data.

Some filesystems avoid this problem by flushing the EOF folio before
inode size extension. This triggers writeback time partial post-eof
zeroing. XFS explicitly zeroes newly exposed file ranges via
iomap_zero_range(), but this includes a hack to flush dirty but
hole-backed folios, which means writeback actually does the zeroing
in this particular case as well. bcachefs explicitly flushes the eof
folio on truncate extension to the same effect, but doesn't handle
the analogous write extension case (i.e., replace "truncate 8k" with
"pwrite 4k 4k" in the above example command to reproduce the same
problem on bcachefs). btrfs doesn't seem to support subpage block
sizes.

The two main options to avoid this behavior are to either flush or
do the appropriate zeroing during size extending operations. Zeroing
is only required when the size change exposes ranges of the file
that haven't been directly written, such as a write or truncate that
starts beyond the current eof. The pagecache_isize_extended() helper
is already used for this particular scenario. It currently cleans
any pte's for the eof folio to ensure preexisting mappings fault and
allow the filesystem to take action based on the updated inode size.
This is required to ensure the folio is fully backed by allocated
blocks, for example, but this also happens to be the same scenario
zeroing is required.

Update pagecache_isize_extended() to zero the post-eof range of the
eof folio if it is dirty at the time of the size change, since
writeback now won't have the chance. If non-dirty, the folio has
either not been written or the post-eof portion was zeroed by
writeback.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 mm/truncate.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index 0668cd340a46..6e7f3cfb982d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -797,6 +797,21 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
 	 */
 	if (folio_mkclean(folio))
 		folio_mark_dirty(folio);
+
+	/*
+	 * The post-eof range of the folio must be zeroed before it is exposed
+	 * to the file. Writeback normally does this, but since i_size has been
+	 * increased we handle it here.
+	 */
+	if (folio_test_dirty(folio)) {
+		unsigned int offset, end;
+
+		offset = from - folio_pos(folio);
+		end = min_t(unsigned int, to - folio_pos(folio),
+			    folio_size(folio));
+		folio_zero_segment(folio, offset, end);
+	}
+
 	folio_unlock(folio);
 	folio_put(folio);
 }
-- 
2.45.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] ext4, mm: improve partial inode eof zeroing
  2024-09-19 16:07 [PATCH 0/2] ext4, mm: improve partial inode eof zeroing Brian Foster
  2024-09-19 16:07 ` [PATCH 1/2] ext4: partial zero eof block on unaligned inode size extension Brian Foster
  2024-09-19 16:07 ` [PATCH 2/2] mm: zero range of eof folio exposed by " Brian Foster
@ 2024-11-07 15:12 ` Theodore Ts'o
  2 siblings, 0 replies; 6+ messages in thread
From: Theodore Ts'o @ 2024-11-07 15:12 UTC (permalink / raw)
  To: linux-ext4, linux-mm, Brian Foster
  Cc: Theodore Ts'o, linux-fsdevel, willy


On Thu, 19 Sep 2024 12:07:39 -0400, Brian Foster wrote:
> I've been poking around at testing zeroing behavior after a couple
> recent enhancements to iomap_zero_range() and fsx[1]. Running [1] on
> ext4 has uncovered a couple issues that I think share responsibility
> between the fs and pagecache.
> 
> The details are in the commit logs, but patch 1 updates ext4 to do
> partial eof block zeroing in more cases and patch 2 tweaks
> pagecache_isize_extended() to do eof folio zeroing similar to as is done
> during writeback (i.e., ext4_bio_write_folio(),
> iomap_writepage_handle_eof(), etc.). These kind of overlap, but the fs
> changes handle the case of a block straddling eof (so we're writing to
> disk in that case) and the pagecache changes handle the case of a folio
> straddling eof that might be at least partially hole backed (i.e.
> sub-page block sizes, so we're just clearing pagecache).
> 
> [...]

Applied, thanks!

[1/2] ext4: partial zero eof block on unaligned inode size extension
      commit: 462a214e71f3fbc40d28f0a00fe6f0d4c4041c98
[2/2] mm: zero range of eof folio exposed by inode size extension
      commit: faf7bba6b84981443773952289571e5ebeda1767

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] mm: fix pagecache_isize_extended() early-return bypass for large folio mappings
  2024-09-19 16:07 ` [PATCH 2/2] mm: zero range of eof folio exposed by " Brian Foster
@ 2026-02-26 13:31   ` Morduan Zang
  2026-02-26 20:22     ` Brian Foster
  0 siblings, 1 reply; 6+ messages in thread
From: Morduan Zang @ 2026-02-26 13:31 UTC (permalink / raw)
  To: bfoster; +Cc: linux-ext4, linux-fsdevel, linux-mm, willy, Morduan Zang

pagecache_isize_extended() has two early-return guards that were designed
for the traditional sub-page block-size case:

  Guard 1:  if (from >= to || bsize >= PAGE_SIZE)
                return;

  Guard 2:  rounded_from = round_up(from, bsize);
            if (to <= rounded_from || !(rounded_from & (PAGE_SIZE - 1)))
                return;

Guard 1 was originally "bsize == PAGE_SIZE" and was widened to
"bsize >= PAGE_SIZE" by commit 2ebe90dab980 ("mm: convert
pagecache_isize_extended to use a folio").  The rationale is correct
for the traditional buffer_head path: when the block size equals the page
size, every folio covers exactly one block, so writeback's EOF handling
(e.g. iomap_writepage_handle_eof()) zeros the post-EOF tail of the folio
before writing it out, and no action is needed here.

Guard 2 covers the case where @from rounded up to the next block boundary
is already PAGE_SIZE-aligned, meaning no hole block straddles a page
boundary.

Both guards are correct for the traditional case.  However, commit
52aecaee1c26 ("mm: zero range of eof folio exposed by inode size extension")
added post-EOF zeroing inside pagecache_isize_extended() to
handle dirty folios that will not go through writeback before the new
i_size becomes visible.  That zeroing code is placed after both guards,
so it is unreachable whenever either guard fires.

The same stale-data window is also covered by xfstests generic/363
which uses fsx with "-e 1" (EOF pollution mode) and exercises a broad
range of size-changing operations.

Fixes: 52aecaee1c26 ("mm: zero range of eof folio exposed by inode size extension")
Fixes: 2ebe90dab980 ("mm: convert pagecache_isize_extended to use a folio")
Signed-off-by: Morduan Zang <zhangdandan@uniontech.com>
---
 mm/truncate.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 12467c1bd711..d3e473a206b3 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -847,13 +847,32 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)

 	WARN_ON(to > inode->i_size);

-	if (from >= to || bsize >= PAGE_SIZE)
+	if (from >= to)
 		return;
+
+	/*
+	 * For filesystems with bsize >= PAGE_SIZE, the traditional buffer_head
+	 * path handles post-EOF zeroing correctly at writeback time. However,
+	 * with large folios enabled, a single folio can span multiple PAGE_SIZE
+	 * blocks, so mmap writes beyond EOF within the same folio are not zeroed
+	 * at writeback time before i_size is extended. We must handle this here.
+	 */
+	if (bsize >= PAGE_SIZE) {
+		/*
+		 * Only needed if the mapping supports large folios, since otherwise
+		 * each folio is exactly one page and writeback handles EOF zeroing.
+		 */
+		if (!mapping_large_folio_support(inode->i_mapping))
+			return;
+		goto find_folio;
+	}
+
 	/* Page straddling @from will not have any hole block created? */
 	rounded_from = round_up(from, bsize);
 	if (to <= rounded_from || !(rounded_from & (PAGE_SIZE - 1)))
 		return;

+find_folio:
 	folio = filemap_lock_folio(inode->i_mapping, from / PAGE_SIZE);
 	/* Folio not cached? Nothing to do */
 	if (IS_ERR(folio))
-- 
2.50.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: fix pagecache_isize_extended() early-return bypass for large folio mappings
  2026-02-26 13:31   ` [PATCH] mm: fix pagecache_isize_extended() early-return bypass for large folio mappings Morduan Zang
@ 2026-02-26 20:22     ` Brian Foster
  0 siblings, 0 replies; 6+ messages in thread
From: Brian Foster @ 2026-02-26 20:22 UTC (permalink / raw)
  To: Morduan Zang; +Cc: linux-ext4, linux-fsdevel, linux-mm, willy

On Thu, Feb 26, 2026 at 09:31:49PM +0800, Morduan Zang wrote:
> pagecache_isize_extended() has two early-return guards that were designed
> for the traditional sub-page block-size case:
> 
>   Guard 1:  if (from >= to || bsize >= PAGE_SIZE)
>                 return;
> 
>   Guard 2:  rounded_from = round_up(from, bsize);
>             if (to <= rounded_from || !(rounded_from & (PAGE_SIZE - 1)))
>                 return;
> 
> Guard 1 was originally "bsize == PAGE_SIZE" and was widened to
> "bsize >= PAGE_SIZE" by commit 2ebe90dab980 ("mm: convert
> pagecache_isize_extended to use a folio").  The rationale is correct
> for the traditional buffer_head path: when the block size equals the page
> size, every folio covers exactly one block, so writeback's EOF handling
> (e.g. iomap_writepage_handle_eof()) zeros the post-EOF tail of the folio
> before writing it out, and no action is needed here.
> 
> Guard 2 covers the case where @from rounded up to the next block boundary
> is already PAGE_SIZE-aligned, meaning no hole block straddles a page
> boundary.
> 
> Both guards are correct for the traditional case.  However, commit
> 52aecaee1c26 ("mm: zero range of eof folio exposed by inode size extension")
> added post-EOF zeroing inside pagecache_isize_extended() to
> handle dirty folios that will not go through writeback before the new
> i_size becomes visible.  That zeroing code is placed after both guards,
> so it is unreachable whenever either guard fires.
> 
> The same stale-data window is also covered by xfstests generic/363
> which uses fsx with "-e 1" (EOF pollution mode) and exercises a broad
> range of size-changing operations.
> 

Hi Morduan,

So looking back at the original cover letter for this, this bit was for
the case where we had a dirty folio in pagecache that might be partially
hole backed due to eof, therefore fs zeroing might not occur.  Hence we
do the page zeroing here before exposing this range to the file (i.e.
that writeback would have done if the folio were clean).

I thought at the time this plus the ext4 patch covered the bases for
generic/363 on ext4. You refer to this test above but don't mention if
it fails. Do you reproduce a failure with that test, or is this
something discovered by inspection?

> Fixes: 52aecaee1c26 ("mm: zero range of eof folio exposed by inode size extension")
> Fixes: 2ebe90dab980 ("mm: convert pagecache_isize_extended to use a folio")
> Signed-off-by: Morduan Zang <zhangdandan@uniontech.com>
> ---
>  mm/truncate.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 12467c1bd711..d3e473a206b3 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -847,13 +847,32 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
>  
>  	WARN_ON(to > inode->i_size);
>  
> -	if (from >= to || bsize >= PAGE_SIZE)
> +	if (from >= to)
>  		return;
> +
> +	/*
> +	 * For filesystems with bsize >= PAGE_SIZE, the traditional buffer_head
> +	 * path handles post-EOF zeroing correctly at writeback time. However,
> +	 * with large folios enabled, a single folio can span multiple PAGE_SIZE
> +	 * blocks, so mmap writes beyond EOF within the same folio are not zeroed
> +	 * at writeback time before i_size is extended. We must handle this here.
> +	 */
> +	if (bsize >= PAGE_SIZE) {
> +		/*
> +		 * Only needed if the mapping supports large folios, since otherwise
> +		 * each folio is exactly one page and writeback handles EOF zeroing.
> +		 */
> +		if (!mapping_large_folio_support(inode->i_mapping))
> +			return;

Is there currently a case for bsize >= PAGE_SIZE &&
!mapping_large_folio_support()? I thought there was a WIP for
multi-block folios, but I wasn't sure if that actually worked anywhere.

> +		goto find_folio;
> +	}
> +
>  	/* Page straddling @from will not have any hole block created? */
>  	rounded_from = round_up(from, bsize);
>  	if (to <= rounded_from || !(rounded_from & (PAGE_SIZE - 1)))
>  		return;
>  

If I understood this code correctly (and I very well may not), the
purpose of this is to basically filter out cases where a dirty eof folio
doesn't require a refault after the size update for the fs to fully
populate it with blocks. If that is the case, this makes me wonder if
perhaps this check should remain, but instead use folio_size() of the
eof folio (if one exists)..?

My understanding at one point was that we wouldn't have large eof folios
that included a page aligned offset beyond eof, but I also feel like
I've run into that once or twice when dealing with some other oddball fs
related issues, so I'm not really clear on what the expected behavior is
supposed to be there. Maybe it's a corner case (i.e. related to split
failure or some such)..? That is probably a question for Willy..

Brian

> +find_folio:
>  	folio = filemap_lock_folio(inode->i_mapping, from / PAGE_SIZE);
>  	/* Folio not cached? Nothing to do */
>  	if (IS_ERR(folio))
> -- 
> 2.50.1
> 
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-02-26 20:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-19 16:07 [PATCH 0/2] ext4, mm: improve partial inode eof zeroing Brian Foster
2024-09-19 16:07 ` [PATCH 1/2] ext4: partial zero eof block on unaligned inode size extension Brian Foster
2024-09-19 16:07 ` [PATCH 2/2] mm: zero range of eof folio exposed by " Brian Foster
2026-02-26 13:31   ` [PATCH] mm: fix pagecache_isize_extended() early-return bypass for large folio mappings Morduan Zang
2026-02-26 20:22     ` Brian Foster
2024-11-07 15:12 ` [PATCH 0/2] ext4, mm: improve partial inode eof zeroing Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox