* Re: [fuse-devel] Debugging a stale kernel cache during file growth [not found] ` <CAOQ4uxjoS-PvnZ2poh0bx0C6ocTYwuEpfV0q5md15SjS620OMg@mail.gmail.com> @ 2026-04-16 12:12 ` Miklos Szeredi 2026-04-16 12:41 ` Nikolay Amiantov 2026-04-16 22:54 ` Matthew Wilcox 0 siblings, 2 replies; 6+ messages in thread From: Miklos Szeredi @ 2026-04-16 12:12 UTC (permalink / raw) To: Matthew Wilcox Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote: > > I've recently encountered a weird issue with JuiceFS [1], a network FS > > which uses FUSE. tl;dr: when a file was being slowly appended, a reader > > of the same file on another host would periodically read a block of zero > > bytes instead of the actual data. Thanks for the report. I wonder if we could clear PG_uptodate on the page which had its zero bytes exposed by the i_size increase? Willy? Thanks, Miklos ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:12 ` [fuse-devel] Debugging a stale kernel cache during file growth Miklos Szeredi @ 2026-04-16 12:41 ` Nikolay Amiantov 2026-04-16 12:49 ` Nikolay Amiantov 2026-04-16 23:19 ` Matthew Wilcox 2026-04-16 22:54 ` Matthew Wilcox 1 sibling, 2 replies; 6+ messages in thread From: Nikolay Amiantov @ 2026-04-16 12:41 UTC (permalink / raw) To: Miklos Szeredi, Matthew Wilcox Cc: fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm [-- Attachment #1.1: Type: text/plain, Size: 1336 bytes --] On 4/16/26 19:12, Miklos Szeredi wrote: > I wonder if we could clear PG_uptodate on the page which had its zero > bytes exposed by the i_size increase? I've actually tried that first. The idea was to get or create a new page on the EOF boundary, lock it and poison it with an uptodate reset if we need to. But this resulted in an instantaneous EIO in my test. If I undestand correctly, this is because of another race condition: * A fresh page gets created and read by FUSE; uptodate is true; * The page is unlocked on return from `fuse_read_folio`; * Simultaneously, we run `getattr`. The page gets locked, uptodate is reset, the page is unlocked; * Now back from `fuse_read_folio`, `filemap_read_folio` gets this page, waits on `folio_wait_locked_killable` (waiting for the getattr to reset uptodate), and then checks `folio_test_uptodate`; * The page is !uptodate, so an EIO is returned. So it effectively results in inability to have a successful `read` when a `getattr` for a growing file happens simultaneously. Finally, if I understand correctly, this also leaves a (much smaller) theoretical race condition in `filemap_read` between checking uptodate and getting the current inode size. Attached is the patch with this attempt; please check that it does what you meant in case I misunderstood. Cheers, Nikolay. [-- Attachment #1.2: Type: text/html, Size: 1901 bytes --] [-- Attachment #2: 0001-fuse-fix-stale-page-cache-data-race-on-file-growth.patch --] [-- Type: text/x-patch, Size: 1830 bytes --] From 512194b982fd0edbc1dcaa50fafad75b1be26d42 Mon Sep 17 00:00:00 2001 From: Nikolay Amiantov <ab@fmap.me> Date: Wed, 15 Apr 2026 07:28:19 +0000 Subject: [PATCH] fuse: fix stale page cache data race on file growth --- fs/fuse/inode.c | 36 ++++++++++++++++++++++++++++++++++-- 1 file changed, 34 insertions(+), 2 deletions(-) diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 735abf426a06..20741869ac2f 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, * extend local i_size without keeping userspace server in sync. So, * attr->size coming from server can be stale. We cannot trust it. */ - if (!(cache_mask & STATX_SIZE)) - i_size_write(inode, attr->size); + if (!(cache_mask & STATX_SIZE)) { + if (S_ISREG(inode->i_mode) && attr->size > oldsize) { + struct folio *folio; + pgoff_t index = oldsize >> PAGE_SHIFT; + + spin_unlock(&fi->lock); + folio = __filemap_get_folio(inode->i_mapping, index, + FGP_LOCK | FGP_CREAT, + mapping_gfp_mask(inode->i_mapping)); + if (!IS_ERR(folio)) { + spin_lock(&fi->lock); + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { + folio_clear_uptodate(folio); + i_size_write(inode, attr->size); + } + spin_unlock(&fi->lock); + + folio_unlock(folio); + folio_put(folio); + goto size_updated; + } + spin_lock(&fi->lock); + /* + * Folio alloc failed (ENOMEM). Recheck in case a + * write/truncate started while we dropped the lock. + */ + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) + i_size_write(inode, attr->size); + } else { + i_size_write(inode, attr->size); + } + } spin_unlock(&fi->lock); +size_updated: + if (!cache_mask && S_ISREG(inode->i_mode)) { bool inval = false; -- 2.47.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:41 ` Nikolay Amiantov @ 2026-04-16 12:49 ` Nikolay Amiantov 2026-04-16 23:19 ` Matthew Wilcox 1 sibling, 0 replies; 6+ messages in thread From: Nikolay Amiantov @ 2026-04-16 12:49 UTC (permalink / raw) To: Miklos Szeredi, Matthew Wilcox Cc: fuse-devel, linux-fsdevel, fuse-devel, linux-mm On 4/16/26 19:41, Nikolay Amiantov via fuse-devel wrote: > Finally, if I understand correctly, this also leaves a (much smaller) > theoretical race condition in `filemap_read` between checking uptodate > and getting the current inode size. Correction: "would have resulted" in a race condition if we would be retrying to get a fresh folio instead of returning an EIO; I have assumed that's the case when I tried the patch. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:41 ` Nikolay Amiantov 2026-04-16 12:49 ` Nikolay Amiantov @ 2026-04-16 23:19 ` Matthew Wilcox 2026-04-17 6:24 ` Nikolay Amiantov 1 sibling, 1 reply; 6+ messages in thread From: Matthew Wilcox @ 2026-04-16 23:19 UTC (permalink / raw) To: Nikolay Amiantov Cc: Miklos Szeredi, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm On Thu, Apr 16, 2026 at 07:41:37PM +0700, Nikolay Amiantov wrote: > +++ b/fs/fuse/inode.c > @@ -334,10 +334,42 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, > * extend local i_size without keeping userspace server in sync. So, > * attr->size coming from server can be stale. We cannot trust it. > */ > - if (!(cache_mask & STATX_SIZE)) > - i_size_write(inode, attr->size); > + if (!(cache_mask & STATX_SIZE)) { > + if (S_ISREG(inode->i_mode) && attr->size > oldsize) { > + struct folio *folio; > + pgoff_t index = oldsize >> PAGE_SHIFT; > + > + spin_unlock(&fi->lock); > + folio = __filemap_get_folio(inode->i_mapping, index, > + FGP_LOCK | FGP_CREAT, > + mapping_gfp_mask(inode->i_mapping)); > + if (!IS_ERR(folio)) { > + spin_lock(&fi->lock); > + if (!test_bit(FUSE_I_SIZE_UNSTABLE, &fi->state)) { > + folio_clear_uptodate(folio); > + i_size_write(inode, attr->size); > + } > + spin_unlock(&fi->lock); > + > + folio_unlock(folio); > + folio_put(folio); > + goto size_updated; Yes, at this point, you've left the folio in an error state. I'm sure you didn't mean to do that, but the VFS interprets unlocked && !uptodate as "an error happened" (there is a minor exception to this involving failed readahead, but let's set that aside). What you could do, rather than unlock the folio here is to initiate a read of the folio and allow the read to unlock the folio. But I don't think this is a good idea, I like the idea of invalidating the folio much better. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 23:19 ` Matthew Wilcox @ 2026-04-17 6:24 ` Nikolay Amiantov 0 siblings, 0 replies; 6+ messages in thread From: Nikolay Amiantov @ 2026-04-17 6:24 UTC (permalink / raw) To: Matthew Wilcox Cc: Miklos Szeredi, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm [-- Attachment #1: Type: text/plain, Size: 1270 bytes --] On 4/17/26 06:19, Matthew Wilcox wrote: > Yes, at this point, you've left the folio in an error state. I'm sure you > didn't mean to do that, but the VFS interprets unlocked && !uptodate as > "an error happened" (there is a minor exception to this involving failed > readahead, but let's set that aside). Thanks, I see! To save my reasoning somewhere: another way to do this would be NFS/CIFS-style, in a lazy way. They set a flag in `getattr` and invalidate later in `read()` instead. This could avoid relocking the spinlock; I still opted for invalidating inside `getattr` though since FUSE already has invalidation later in the same call, and the cost of relocking feels low to me in this case. Any ideas on how to resolve the remaining race condition [1]? If I'm correct it affects any network FS, and can't be fixed without changing the common VFS code somehow. I'd like someone to confirm my conclusions though. I'm way over my head here though willing to learn; if someone is willing to mentor me on designing the fix, I'd be happy to. My best uneducated guess is to introduce another flag for a page and check it *after* we get the inode size in `filemap_read()`; if it's set, retry reading. 1: https://github.com/abbradar/nfs_stale_cache_test [-- Attachment #2: Type: text/html, Size: 1853 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [fuse-devel] Debugging a stale kernel cache during file growth 2026-04-16 12:12 ` [fuse-devel] Debugging a stale kernel cache during file growth Miklos Szeredi 2026-04-16 12:41 ` Nikolay Amiantov @ 2026-04-16 22:54 ` Matthew Wilcox 1 sibling, 0 replies; 6+ messages in thread From: Matthew Wilcox @ 2026-04-16 22:54 UTC (permalink / raw) To: Miklos Szeredi Cc: Nikolay Amiantov, fuse-devel, linux-fsdevel, Amir Goldstein, fuse-devel, linux-mm On Thu, Apr 16, 2026 at 02:12:37PM +0200, Miklos Szeredi wrote: > On Thu, 16 Apr 2026 at 09:24, Amir Goldstein <amir73il@gmail.com> wrote: > > > > I've recently encountered a weird issue with JuiceFS [1], a network FS > > > which uses FUSE. tl;dr: when a file was being slowly appended, a reader > > > of the same file on another host would periodically read a block of zero > > > bytes instead of the actual data. > > Thanks for the report. > > I wonder if we could clear PG_uptodate on the page which had its zero > bytes exposed by the i_size increase? > > Willy? I think every filesystem which clear PG_uptodate is doing it wrong. I know we have ~30 places which do it, and I haven't audited them all, but clearing the uptodate bit can lead to the VM throwing an absolute fit if any of the pages in that folio are mapped. I don't think it'll make much difference whether it's cleared or invalidated from the page cache. Either way we're re-reading all the data in it, which would dominate the time saved by not doing a trip through the page allocator. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-17 6:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <898a4e10-6193-4671-b3b1-7c7bc562a671@fmap.me>
[not found] ` <CAOQ4uxjoS-PvnZ2poh0bx0C6ocTYwuEpfV0q5md15SjS620OMg@mail.gmail.com>
2026-04-16 12:12 ` [fuse-devel] Debugging a stale kernel cache during file growth Miklos Szeredi
2026-04-16 12:41 ` Nikolay Amiantov
2026-04-16 12:49 ` Nikolay Amiantov
2026-04-16 23:19 ` Matthew Wilcox
2026-04-17 6:24 ` Nikolay Amiantov
2026-04-16 22:54 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox