Re: [fuse-devel] Debugging a stale kernel cache during file growth

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
       [not found] ` <CAOQ4uxjoS-PvnZ2poh0bx0C6ocTYwuEpfV0q5md15SjS620OMg@mail.gmail.com>
@ 2026-04-16 12:12   ` Miklos Szeredi
  2026-04-16 12:41     ` Nikolay Amiantov
  2026-04-16 22:54     ` Matthew Wilcox
  0 siblings, 2 replies; 6+ messages in thread
From: Miklos Szeredi @ 2026-04-16 12:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote:

> > I've recently encountered a weird issue with JuiceFS [1], a network FS
> > which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> > of the same file on another host would periodically read a block of zero
> > bytes instead of the actual data.

Thanks for the report.

I wonder if we could clear PG_uptodate on the page which had its zero
bytes exposed by the i_size increase?

Willy?

Thanks,
Miklos


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:12   ` [fuse-devel] Debugging a stale kernel cache during file growth Miklos Szeredi
@ 2026-04-16 12:41     ` Nikolay Amiantov
  2026-04-16 12:49       ` Nikolay Amiantov
  2026-04-16 23:19       ` Matthew Wilcox
  2026-04-16 22:54     ` Matthew Wilcox
  1 sibling, 2 replies; 6+ messages in thread
From: Nikolay Amiantov @ 2026-04-16 12:41 UTC (permalink / raw)
  To: Miklos Szeredi, Matthew Wilcox
  Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm


[-- Attachment #1.1: Type: text/plain, Size: 1336 bytes --]

On 4/16/26 19:12, Miklos Szeredi wrote:
> I wonder if we could clear PG_uptodate on the page which had its zero
> bytes exposed by the i_size increase?

I've actually tried that first. The idea was to get or create a new page 
on the EOF boundary, lock it and poison it with an uptodate reset if we 
need to. But this resulted in an instantaneous EIO in my test. If I 
undestand correctly, this is because of another race condition:

* A fresh page gets created and read by FUSE; uptodate is true;
* The page is unlocked on return from `fuse_read_folio`;
* Simultaneously, we run `getattr`. The page gets locked, uptodate is 
reset, the page is unlocked;
* Now back from `fuse_read_folio`, `filemap_read_folio` gets this page, 
waits on `folio_wait_locked_killable` (waiting for the getattr to reset 
uptodate), and then checks `folio_test_uptodate`;
* The page is !uptodate, so an EIO is returned.

So it effectively results in inability to have a successful `read` when 
a `getattr` for a growing file happens simultaneously.

Finally, if I understand correctly, this also leaves a (much smaller) 
theoretical race condition in `filemap_read` between checking uptodate 
and getting the current inode size.

Attached is the patch with this attempt; please check that it does what 
you meant in case I misunderstood.

Cheers,
Nikolay.

[-- Attachment #1.2: Type: text/html, Size: 1901 bytes --]

[-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --]
[-- Type: text/x-patch, Size: 1830 bytes --]

From 512194b982fd0edbc1dcaa50fafad75b1be26d42 Mon Sep 17 00:00:00 2001
From: Nikolay Amiantov <ab@fmap.me>
Date: Wed, 15 Apr 2026 07:28:19 +0000
Subject: [PATCH] fuse: fix stale page cache data race on file growth

---
 fs/fuse/inode.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 735abf426a06..20741869ac2f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
 	 * extend local i_size without keeping userspace server in sync. So,
 	 * attr->size coming from server can be stale. We cannot trust it.
 	 */
-	if (!(cache_mask & STATX_SIZE))
-		i_size_write(inode, attr->size);
+	if (!(cache_mask & STATX_SIZE)) {
+		if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
+			struct folio *folio;
+			pgoff_t index = oldsize >> PAGE_SHIFT;
+
+			spin_unlock(&fi->lock);
+			folio = __filemap_get_folio(inode->i_mapping, index,
+						    FGP_LOCK | FGP_CREAT,
+						    mapping_gfp_mask(inode->i_mapping));
+			if (!IS_ERR(folio)) {
+				spin_lock(&fi->lock);
+				if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
+					folio_clear_uptodate(folio);
+					i_size_write(inode, attr->size);
+				}
+				spin_unlock(&fi->lock);
+
+				folio_unlock(folio);
+				folio_put(folio);
+				goto size_updated;
+			}
+			spin_lock(&fi->lock);
+			/*
+			 * Folio alloc failed (ENOMEM). Recheck in case a
+			 * write/truncate started while we dropped the lock.
+			 */
+			if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state))
+				i_size_write(inode, attr->size);
+		} else {
+			i_size_write(inode, attr->size);
+		}
+	}
 	spin_unlock(&fi->lock);
 
+size_updated:
+
 	if (!cache_mask && S_ISREG(inode->i_mode)) {
 		bool inval = false;
 
-- 
2.47.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:41     ` Nikolay Amiantov
@ 2026-04-16 12:49       ` Nikolay Amiantov
  2026-04-16 23:19       ` Matthew Wilcox
  1 sibling, 0 replies; 6+ messages in thread
From: Nikolay Amiantov @ 2026-04-16 12:49 UTC (permalink / raw)
  To: Miklos Szeredi, Matthew Wilcox
  Cc: fuse-devel, linux-fsdevel, fuse-devel, linux-mm

On 4/16/26 19:41, Nikolay Amiantov via fuse-devel wrote:
> Finally, if I understand correctly, this also leaves a (much smaller) 
> theoretical race condition in `filemap_read` between checking uptodate 
> and getting the current inode size.
Correction: "would have resulted" in a race condition if we would be 
retrying to get a fresh folio instead of returning an EIO; I have 
assumed that's the case when I tried the patch.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:41     ` Nikolay Amiantov
  2026-04-16 12:49       ` Nikolay Amiantov
@ 2026-04-16 23:19       ` Matthew Wilcox
  2026-04-17  6:24         ` Nikolay Amiantov
  1 sibling, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2026-04-16 23:19 UTC (permalink / raw)
  To: Nikolay Amiantov
  Cc: Miklos Szeredi, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

On Thu, Apr 16, 2026 at 07:41:37PM +0700, Nikolay Amiantov wrote:
> +++ b/fs/fuse/inode.c
> @@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
>  	 * extend local i_size without keeping userspace server in sync. So,
>  	 * attr->size coming from server can be stale. We cannot trust it.
>  	 */
> -	if (!(cache_mask & STATX_SIZE))
> -		i_size_write(inode, attr->size);
> +	if (!(cache_mask & STATX_SIZE)) {
> +		if (S_ISREG(inode->i_mode) && attr->size > oldsize) {
> +			struct folio *folio;
> +			pgoff_t index = oldsize >> PAGE_SHIFT;
> +
> +			spin_unlock(&fi->lock);
> +			folio = __filemap_get_folio(inode->i_mapping, index,
> +						    FGP_LOCK | FGP_CREAT,
> +						    mapping_gfp_mask(inode->i_mapping));
> +			if (!IS_ERR(folio)) {
> +				spin_lock(&fi->lock);
> +				if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) {
> +					folio_clear_uptodate(folio);
> +					i_size_write(inode, attr->size);
> +				}
> +				spin_unlock(&fi->lock);
> +
> +				folio_unlock(folio);
> +				folio_put(folio);
> +				goto size_updated;

Yes, at this point, you've left the folio in an error state.  I'm sure you
didn't mean to do that, but the VFS interprets unlocked && !uptodate as
"an error happened" (there is a minor exception to this involving failed
readahead, but let's set that aside).

What you could do, rather than unlock the folio here is to initiate a
read of the folio and allow the read to unlock the folio.  But I don't
think this is a good idea, I like the idea of invalidating the folio
much better.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 23:19       ` Matthew Wilcox
@ 2026-04-17  6:24         ` Nikolay Amiantov
  0 siblings, 0 replies; 6+ messages in thread
From: Nikolay Amiantov @ 2026-04-17  6:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Miklos Szeredi, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1270 bytes --]

On 4/17/26 06:19, Matthew Wilcox wrote:
> Yes, at this point, you've left the folio in an error state.  I'm sure you
> didn't mean to do that, but the VFS interprets unlocked && !uptodate as
> "an error happened" (there is a minor exception to this involving failed
> readahead, but let's set that aside).

Thanks, I see!

To save my reasoning somewhere: another way to do this would be 
NFS/CIFS-style, in a lazy way. They set a flag in `getattr` and 
invalidate later in `read()` instead. This could avoid relocking the 
spinlock; I still opted for invalidating inside `getattr` though since 
FUSE already has invalidation later in the same call, and the cost of 
relocking feels low to me in this case.

Any ideas on how to resolve the remaining race condition [1]? If I'm 
correct it affects any network FS, and can't be fixed without changing 
the common VFS code somehow. I'd like someone to confirm my conclusions 
though.

I'm way over my head here though willing to learn; if someone is willing 
to mentor me on designing the fix, I'd be happy to. My best uneducated 
guess is to introduce another flag for a page and check it *after* we 
get the inode size in `filemap_read()`; if it's set, retry reading.

1: https://github.com/abbradar/nfs_stale_cache_test

[-- Attachment #2: Type: text/html, Size: 1853 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [fuse-devel] Debugging a stale kernel cache during file growth
  2026-04-16 12:12   ` [fuse-devel] Debugging a stale kernel cache during file growth Miklos Szeredi
  2026-04-16 12:41     ` Nikolay Amiantov
@ 2026-04-16 22:54     ` Matthew Wilcox
  1 sibling, 0 replies; 6+ messages in thread
From: Matthew Wilcox @ 2026-04-16 22:54 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein,
	fuse-devel, linux-mm

On Thu, Apr 16, 2026 at 02:12:37PM +0200, Miklos Szeredi wrote:
> On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote:
> 
> > > I've recently encountered a weird issue with JuiceFS [1], a network FS
> > > which uses FUSE. tl;dr: when a file was being slowly appended, a reader
> > > of the same file on another host would periodically read a block of zero
> > > bytes instead of the actual data.
> 
> Thanks for the report.
> 
> I wonder if we could clear PG_uptodate on the page which had its zero
> bytes exposed by the i_size increase?
> 
> Willy?

I think every filesystem which clear PG_uptodate is doing it wrong.
I know we have ~30 places which do it, and I haven't audited them all, 
but clearing the uptodate bit can lead to the VM throwing an absolute
fit if any of the pages in that folio are mapped.

I don't think it'll make much difference whether it's cleared or
invalidated from the page cache.  Either way we're re-reading all
the data in it, which would dominate the time saved by not doing a trip
through the page allocator.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-17  6:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me>
     [not found] ` <CAOQ4uxjoS-PvnZ2poh0bx0C6ocTYwuEpfV0q5md15SjS620OMg@mail.gmail.com>
2026-04-16 12:12   ` [fuse-devel] Debugging a stale kernel cache during file growth Miklos Szeredi
2026-04-16 12:41     ` Nikolay Amiantov
2026-04-16 12:49       ` Nikolay Amiantov
2026-04-16 23:19       ` Matthew Wilcox
2026-04-17  6:24         ` Nikolay Amiantov
2026-04-16 22:54     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox