linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: akpm@linux-foundation.org, david@redhat.com, miklos@szeredi.hu,
	 linux-mm@kvack.org, athul.krishna.kr@protonmail.com,
	j.neuschaefer@gmx.net,  carnil@debian.org,
	linux-fsdevel@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH v2 1/1] fs/writeback: skip AS_NO_DATA_INTEGRITY mappings in wait_sb_inodes()
Date: Tue, 6 Jan 2026 15:30:05 -0800	[thread overview]
Message-ID: <CAJnrk1aYpcDpm8MpN5Emb8qNOn34-qEiARLH0RudySKFtEZVpA@mail.gmail.com> (raw)
In-Reply-To: <ypyumqgv5p7dnxmq34q33keb6kzqnp66r33gtbm4pglgdmhma6@3oleltql2qgp>

On Tue, Jan 6, 2026 at 1:34 AM Jan Kara <jack@suse.cz> wrote:
>
Hi Jan,

> [Thanks to Andrew for CCing me on patch commit]

Sorry, I didn't mean to exclude you. I hadn't realized the
fs-writeback.c file had maintainers/reviewers listed for it. I'll make
sure to cc you next time.

>
> On Sun 14-12-25 19:00:43, Joanne Koong wrote:
> > Skip waiting on writeback for inodes that belong to mappings that do not
> > have data integrity guarantees (denoted by the AS_NO_DATA_INTEGRITY
> > mapping flag).
> >
> > This restores fuse back to prior behavior where syncs are no-ops. This
> > is needed because otherwise, if a system is running a faulty fuse
> > server that does not reply to issued write requests, this will cause
> > wait_sb_inodes() to wait forever.
> >
> > Fixes: 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree")
> > Reported-by: Athul Krishna <athul.krishna.kr@protonmail.com>
> > Reported-by: J. Neuschäfer <j.neuschaefer@gmx.net>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>
> OK, but the difference 0c58a97f919c introduced goes much further than just
> wait_sb_inodes(). Before 0c58a97f919c also filemap_fdatawait() (and all the
> other variants waiting for folio_writeback() to clear) returned immediately
> because folio writeback was done as soon as we've copied the content into
> the temporary page. Now they will block waiting for the server to finish
> the IO. So e.g. fsync() will block waiting for the server in
> file_write_and_wait_range() now, instead of blocking in fuse_fsync_common()
> -> fuse_simple_request(). Similarly e.g. truncate(2) will now block waiting
> for the server so that folio_writeback can be cleared.
>
> So I understand your patch fixes the regression with suspend blocking but I
> don't have a high confidence we are not just starting a whack-a-mole game
> catching all the places that previously hiddenly depended on
> folio_writeback getting cleared without any involvement of untrusted fuse
> server and now this changed. So do we have some higher-level idea what is /
> is not guaranteed with stuck fuse server?

The implications of 0c58a97f919c (eg clearing folio writeback only
when the server has completed writeback instead of clearing writeback
and returning immediately) had some analysis and discussion in this
prior thread [1]. Copying/pasting a snippet from the cover letter:

"With removing the temp page, writeback state is now only cleared on the dirty
page after the server has written it back to disk. This may take an
indeterminate amount of time. As well, there is also the possibility of
malicious or well-intentioned but buggy servers where writeback may in the
worst case scenario, never complete. This means that any
folio_wait_writeback() on a dirty page belonging to a FUSE filesystem needs to
be carefully audited.

In particular, these are the cases that need to be accounted for:
* potentially deadlocking in reclaim, as mentioned above
* potentially stalling sync(2)
* potentially stalling page migration / compaction

This patchset adds a new mapping flag, AS_WRITEBACK_INDETERMINATE, which
filesystems may set on its inode mappings to indicate that writeback
operations may take an indeterminate amount of time to complete. FUSE will set
this flag on its mappings. This patchset adds checks to the critical parts of
reclaim, sync, and page migration logic where writeback may be waited on.

Please note the following:
* For sync(2), waiting on writeback will be skipped for FUSE, but this has no
  effect on existing behavior. Dirty FUSE pages are already not guaranteed to
  be written to disk by the time sync(2) returns (eg writeback is cleared on
  the dirty page but the server may not have written out the temp page to disk
  yet). If the caller wishes to ensure the data has actually been synced to
  disk, they should use fsync(2)/fdatasync(2) instead.
* AS_WRITEBACK_INDETERMINATE does not indicate that the folios should never be
  waited on when in writeback. There are some cases where the wait is
  desirable. For example, for the sync_file_range() syscall, it is fine to
  wait on the writeback since the caller passes in a fd for the operation."

That was from v6 of the patchset and some things were changed between
that and the final version landed in v8 [2] (most notably, changing
AS_WRITEBACK_INDETERMINATE to AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM and
dropping the sync + page migration skips), but I think that analysis
of what cases need to be accounted for / audited remains the same. I
don't think there are any places beyond those 3 listed above that have
a core intrinsic dependency on folio writeback being cleared cleanly
(eg without any involvement of an untrusted fuse server).

For the fsync() and truncate() examples you mentioned, I don't think
it's an issue that these now wait for the server to finish the I/O and
hang if the server doesn't. I think it's actually more correct
behavior than what we had with temp pages, eg imo these actually ought
to wait for the writeback to have been completed by the server. If the
server is malicious / buggy and fsync/truncate hangs, I think that's
fine given that fsync/truncate is initiated by the user on a specific
file descriptor (as opposed to the generic sync()) (and imo it should
hang if it can't actually be executed correctly because the server is
malfunctioning).

As for why this sync user regression has surfaced and now needs to be
addressed, I don't think it's because there's a whack-a-mole game
where we're ad-hoc having to patch up places we didn't realize could
be broken by folio writeback potentially hanging. The original
patchset [1] contained patches that addressed the sync and compaction
case (eg maintaining the original behavior that the temp pages had),
so I don't think this is something that was missed. These patches were
dropped because in the discussion in [1], they seemed pointless to
mitigate / guard against when there already exists other ways
migration/sync could be stalled by a malicious/buggy fuse server. What
I missed was that it's more common than I had thought for
well-intentioned servers to not correctly implement writeback
handling, and that even if it's userspace's "fault", it's still
considered a kernel regression if buggy code previously sufficed but
now doesn't.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@gmail.com/T/#u
[2] https://lore.kernel.org/linux-fsdevel/CAJfpegveOFoL-XzDKQZZ4U6UF_AetNwTUDbfmf7rdJasRFm3xA@mail.gmail.com/T/#m56255519bf9af421ae07014208ccd68a96e72d52

>
>                                                                 Honza
>
> > ---
> >  fs/fs-writeback.c       |  3 ++-
> >  fs/fuse/file.c          |  4 +++-
> >  include/linux/pagemap.h | 11 +++++++++++
> >  3 files changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 6800886c4d10..ab2e279ed3c2 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2751,7 +2751,8 @@ static void wait_sb_inodes(struct super_block *sb)
> >                * do not have the mapping lock. Skip it here, wb completion
> >                * will remove it.
> >                */
> > -             if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> > +             if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
> > +                 mapping_no_data_integrity(mapping))
> >                       continue;
> >
> >               spin_unlock_irq(&sb->s_inode_wblist_lock);
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 01bc894e9c2b..3b2a171e652f 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -3200,8 +3200,10 @@ void fuse_init_file_inode(struct inode *inode, unsigned int flags)
> >
> >       inode->i_fop = &fuse_file_operations;
> >       inode->i_data.a_ops = &fuse_file_aops;
> > -     if (fc->writeback_cache)
> > +     if (fc->writeback_cache) {
> >               mapping_set_writeback_may_deadlock_on_reclaim(&inode->i_data);
> > +             mapping_set_no_data_integrity(&inode->i_data);
> > +     }
> >
> >       INIT_LIST_HEAD(&fi->write_files);
> >       INIT_LIST_HEAD(&fi->queued_writes);
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 31a848485ad9..ec442af3f886 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -210,6 +210,7 @@ enum mapping_flags {
> >       AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
> >       AS_KERNEL_FILE = 10,    /* mapping for a fake kernel file that shouldn't
> >                                  account usage to user cgroups */
> > +     AS_NO_DATA_INTEGRITY = 11, /* no data integrity guarantees */
> >       /* Bits 16-25 are used for FOLIO_ORDER */
> >       AS_FOLIO_ORDER_BITS = 5,
> >       AS_FOLIO_ORDER_MIN = 16,
> > @@ -345,6 +346,16 @@ static inline bool mapping_writeback_may_deadlock_on_reclaim(const struct addres
> >       return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
> >  }
> >
> > +static inline void mapping_set_no_data_integrity(struct address_space *mapping)
> > +{
> > +     set_bit(AS_NO_DATA_INTEGRITY, &mapping->flags);
> > +}
> > +
> > +static inline bool mapping_no_data_integrity(const struct address_space *mapping)
> > +{
> > +     return test_bit(AS_NO_DATA_INTEGRITY, &mapping->flags);
> > +}
> > +
> >  static inline gfp_t mapping_gfp_mask(const struct address_space *mapping)
> >  {
> >       return mapping->gfp_mask;
> > --
> > 2.47.3
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


  parent reply	other threads:[~2026-01-06 23:30 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-15  3:00 [PATCH v2 0/1] " Joanne Koong
2025-12-15  3:00 ` [PATCH v2 1/1] " Joanne Koong
2025-12-15 17:09   ` Bernd Schubert
2025-12-16  7:07     ` Joanne Koong
2025-12-16 18:13   ` J. Neuschäfer
2026-01-02 17:42   ` Joanne Koong
2026-01-03 18:03   ` Andrew Morton
2026-01-04 18:54     ` David Hildenbrand (Red Hat)
2026-01-05 19:55       ` Joanne Koong
2026-01-06  9:33   ` Jan Kara
2026-01-06 10:05     ` David Hildenbrand (Red Hat)
2026-01-06 13:13       ` Miklos Szeredi
2026-01-06 13:55         ` Jan Kara
2026-01-06 14:33         ` David Hildenbrand (Red Hat)
2026-01-06 15:21           ` Miklos Szeredi
2026-01-06 15:41             ` David Hildenbrand (Red Hat)
2026-01-06 16:05               ` Miklos Szeredi
2026-01-06 17:54                 ` David Hildenbrand (Red Hat)
2026-01-06 23:30     ` Joanne Koong [this message]
2026-01-07 10:12       ` Jan Kara
2026-01-07 23:20         ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJnrk1aYpcDpm8MpN5Emb8qNOn34-qEiARLH0RudySKFtEZVpA@mail.gmail.com \
    --to=joannelkoong@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=athul.krishna.kr@protonmail.com \
    --cc=carnil@debian.org \
    --cc=david@redhat.com \
    --cc=j.neuschaefer@gmx.net \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox