From: Joanne Koong <joannelkoong@gmail.com>
To: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: miklos@szeredi.hu, akpm@linux-foundation.org,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
shakeel.butt@linux.dev, david@redhat.com,
bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org,
kernel-team@meta.com, Miklos Szeredi <mszeredi@redhat.com>
Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
Date: Thu, 10 Apr 2025 09:11:28 -0700 [thread overview]
Message-ID: <CAJnrk1a_YL-Dg4HeVWnmwUVH2tCN2MYu30kiV5KSv4mkezWOZg@mail.gmail.com> (raw)
In-Reply-To: <9a3cfb55-faae-4551-9bef-b9650432848a@linux.alibaba.com>
On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> On 4/10/25 11:07 PM, Joanne Koong wrote:
> > On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 4/10/25 7:47 AM, Joanne Koong wrote:
> >>> On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> Hi Joanne,
> >>>>
> >>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
> >>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
> >>>>> ("fuse: support writable mmap")), a temp page is allocated for every
> >>>>> dirty page to be written back, the contents of the dirty page are copied over
> >>>>> to the temp page, and the temp page gets handed to the server to write back.
> >>>>>
> >>>>> This is done so that writeback may be immediately cleared on the dirty page,
> >>>>> and this in turn is done in order to mitigate the following deadlock scenario
> >>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
> >>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>> that needs a memory allocation
> >>>>> * memory allocation triggers direct reclaim
> >>>>> * direct reclaim waits on a folio under writeback
> >>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>> direct reclaim
> >>>>>
> >>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
> >>>>> the situations described above, FUSE writeback does not need to use
> >>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
> >>>>>
> >>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
> >>>>> and removes the temporary pages + extra copying and the internal rb
> >>>>> tree.
> >>>>>
> >>>>> fio benchmarks --
> >>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>
> >>>>> Setup:
> >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>
> >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>
> >>>>> bs = 1k 4k 1M
> >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> >>>>> % diff -3% 23% 45%
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> >>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
> >>>>
> >>>
> >>> Hi Jingbo,
> >>>
> >>> Thanks for sharing your analysis for this.
> >>>
> >>>> Overall this patch LGTM.
> >>>>
> >>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
> >>>> also unneeded then, at least the DIRECT IO routine (i.e.
> >>>
> >>> I took a look at fi->writectr and fi->queued_writes and my
> >>> understanding is that we do still need this. For example, for
> >>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> >>> prevent concurrent writeback or else the setattr request and the
> >>> writeback request could race which would result in a mismatch between
> >>> the file's reported size and the actual data written to disk.
> >>
> >> I haven't looked into the truncate routine yet. I will see it later.
> >>
> >>>
> >>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is
> >>>> because after removing the temp page, the DIRECT IO routine has already
> >>>> been waiting for all inflight WRITE requests, see
> >>>>
> >>>> # DIRECT read
> >>>> generic_file_read_iter
> >>>> kiocb_write_and_wait
> >>>> filemap_write_and_wait_range
> >>>
> >>> Where do you see generic_file_read_iter() getting called for direct io reads?
> >>
> >> # DIRECT read
> >> fuse_file_read_iter
> >> fuse_cache_read_iter
> >> generic_file_read_iter
> >> kiocb_write_and_wait
> >> filemap_write_and_wait_range
> >> a_ops->direct_IO(),i.e. fuse_direct_IO()
> >>
> >
> > Oh I see, I thought files opened with O_DIRECT automatically call the
> > .direct_IO handler for reads/writes but you're right, it first goes
> > through .read_iter / .write_iter handlers, and the .direct_IO handler
> > only gets invoked through generic_file_read_iter() /
> > generic_file_direct_write() in mm/filemap.c
> >
> > There's two paths for direct io in FUSE:
> > a) fuse server sets fi->direct_io = true when a file is opened, which
> > will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
> > b) fuse server doesn't set fi->direct_io = true, but the client opens
> > the file with O_DIRECT
> >
> > We only go through the stack trace you listed above for the b) case.
> > For the a) case, we'll hit
> >
> > if (ff->open_flags & FOPEN_DIRECT_IO)
> > return fuse_direct_read_iter(iocb, to);
> >
> > and
> >
> > if (ff->open_flags & FOPEN_DIRECT_IO)
> > return fuse_direct_write_iter(iocb, from);
> >
> > which will invoke fuse_direct_IO() / fuse_direct_io() without going
> > through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
> > kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
> > above.
> >
> > So for the a) case I think we'd still need the fuse_sync_writes() in
> > case there's still pending writeback.
> >
> > Do you agree with this analysis or am I missing something here?
>
> Yeah, that's true. But instead of calling fuse_sync_writes(), we can
> call filemap_wait_range() or something similar here.
>
Agreed. Actually, the more I look at this, the more I think we can
replace all fuse_sync_writes() and get rid of it entirely. Right now,
fuse_sync_writes() is called in:
fuse_fsync():
/*
* Start writeback against all dirty pages of the inode, then
* wait for all outstanding writes, before sending the FSYNC
* request.
*/
err = file_write_and_wait_range(file, start, end);
if (err)
goto out;
fuse_sync_writes(inode);
/*
* Due to implementation of fuse writeback
* file_write_and_wait_range() does not catch errors.
* We have to do this directly after fuse_sync_writes()
*/
err = file_check_and_advance_wb_err(file);
if (err)
goto out;
We can get rid of the fuse_sync_writes() and
file_check_and_advance_wb_err() entirely since now without temp pages,
the file_write_and_wait_range() call actually ensures that writeback
is completed
fuse_writeback_range():
static int fuse_writeback_range(struct inode *inode, loff_t
start, loff_t end)
{
int err =
filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
if (!err)
fuse_sync_writes(inode);
return err;
}
We can replace fuse_writeback_range() entirely with
filemap_write_and_wait_range().
fuse_direct_io():
if (fopen_direct_io && fc->direct_io_allow_mmap) {
res = filemap_write_and_wait_range(mapping, pos, pos +
count - 1);
if (res) {
fuse_io_free(ia);
return res;
}
}
if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
count - 1))) {
if (!write)
inode_lock(inode);
fuse_sync_writes(inode);
if (!write)
inode_unlock(inode);
}
I think this can just replaced with
if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
res = filemap_write_and_wait_range(mapping,
pos, pos + count - 1);
if (res) {
fuse_io_free(ia);
return res;
}
}
since for the !fopen_direct_io case, it will already go through
filemap_write_and_wait_range(), as you mentioned in your previous
message. I think this also fixes a bug (?) in the original code - in
the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
still need to write out dirty pages first, which we don't currently
do.
What do you think?
Thanks for all your careful review on this patchset throughout all of
its iterations.
>
>
> >> filp_close()
> >> filp_flush()
> >> filp->f_op->flush()
> >> fuse_flush()
> >> write_inode_now
> >> writeback_single_inode(WB_SYNC_ALL)
> >> do_writepages
> >> # flush dirty page
> >> filemap_fdatawait
> >> # wait for WRITE completion
> >
> > Nice. I missed that write_inode_now() will invoke filemap_fdatawait().
> > This seems pretty straightforward. I'll remove the fuse_sync_writes()
> > call in fuse_flush() when I send out v8.
> >
> > The direct io one above is less straight-forward. I won't add that to
> > v8 but that can be done in a separate future patch when we figure that
> > out.
>
> Thanks for keep working on this. Appreciated.
>
> --
> Thanks,
> Jingbo
next prev parent reply other threads:[~2025-04-10 16:11 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2025-04-04 19:13 ` David Hildenbrand
2025-04-04 20:09 ` Joanne Koong
2025-04-04 20:13 ` David Hildenbrand
2025-04-09 22:05 ` Shakeel Butt
2025-04-09 23:48 ` Joanne Koong
2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2025-04-09 2:43 ` Jingbo Xu
2025-04-09 23:47 ` Joanne Koong
2025-04-10 2:12 ` Jingbo Xu
2025-04-10 15:07 ` Joanne Koong
2025-04-10 15:11 ` Jingbo Xu
2025-04-10 16:11 ` Joanne Koong [this message]
2025-04-14 20:24 ` Joanne Koong
2025-04-15 7:49 ` Jingbo Xu
2025-04-15 15:59 ` Joanne Koong
2025-04-16 1:40 ` Jingbo Xu
2025-04-16 16:43 ` Joanne Koong
2025-04-16 18:05 ` Bernd Schubert
2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
2025-04-14 20:28 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJnrk1a_YL-Dg4HeVWnmwUVH2tCN2MYu30kiV5KSv4mkezWOZg@mail.gmail.com \
--to=joannelkoong@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bernd.schubert@fastmail.fm \
--cc=david@redhat.com \
--cc=jefflexu@linux.alibaba.com \
--cc=jlayton@kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
--cc=mszeredi@redhat.com \
--cc=shakeel.butt@linux.dev \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox