From: Jingbo Xu <jefflexu@linux.alibaba.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: miklos@szeredi.hu, akpm@linux-foundation.org,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
shakeel.butt@linux.dev, david@redhat.com,
bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org,
kernel-team@meta.com, Miklos Szeredi <mszeredi@redhat.com>
Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
Date: Tue, 15 Apr 2025 15:49:06 +0800 [thread overview]
Message-ID: <a738a765-a5e1-44bc-b1cd-e1a42d73e255@linux.alibaba.com> (raw)
In-Reply-To: <CAJnrk1a_YL-Dg4HeVWnmwUVH2tCN2MYu30kiV5KSv4mkezWOZg@mail.gmail.com>
Hi Joanne,
Sorry for the late reply...
On 4/11/25 12:11 AM, Joanne Koong wrote:
> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> On 4/10/25 11:07 PM, Joanne Koong wrote:
>>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
>>>>> On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> Hi Joanne,
>>>>>>
>>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>>>>>> dirty page to be written back, the contents of the dirty page are copied over
>>>>>>> to the temp page, and the temp page gets handed to the server to write back.
>>>>>>>
>>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
>>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
>>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>>>> that needs a memory allocation
>>>>>>> * memory allocation triggers direct reclaim
>>>>>>> * direct reclaim waits on a folio under writeback
>>>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>>>> direct reclaim
>>>>>>>
>>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>>>>>> the situations described above, FUSE writeback does not need to use
>>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>>>>>
>>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>>>>>> and removes the temporary pages + extra copying and the internal rb
>>>>>>> tree.
>>>>>>>
>>>>>>> fio benchmarks --
>>>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>>>
>>>>>>> Setup:
>>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>>>
>>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>>>
>>>>>>> bs = 1k 4k 1M
>>>>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
>>>>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
>>>>>>> % diff -3% 23% 45%
>>>>>>>
>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>>>>>
>>>>>
>>>>> Hi Jingbo,
>>>>>
>>>>> Thanks for sharing your analysis for this.
>>>>>
>>>>>> Overall this patch LGTM.
>>>>>>
>>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
>>>>>
>>>>> I took a look at fi->writectr and fi->queued_writes and my
>>>>> understanding is that we do still need this. For example, for
>>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
>>>>> prevent concurrent writeback or else the setattr request and the
>>>>> writeback request could race which would result in a mismatch between
>>>>> the file's reported size and the actual data written to disk.
>>>>
>>>> I haven't looked into the truncate routine yet. I will see it later.
>>>>
>>>>>
>>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore. That is
>>>>>> because after removing the temp page, the DIRECT IO routine has already
>>>>>> been waiting for all inflight WRITE requests, see
>>>>>>
>>>>>> # DIRECT read
>>>>>> generic_file_read_iter
>>>>>> kiocb_write_and_wait
>>>>>> filemap_write_and_wait_range
>>>>>
>>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
>>>>
>>>> # DIRECT read
>>>> fuse_file_read_iter
>>>> fuse_cache_read_iter
>>>> generic_file_read_iter
>>>> kiocb_write_and_wait
>>>> filemap_write_and_wait_range
>>>> a_ops->direct_IO(),i.e. fuse_direct_IO()
>>>>
>>>
>>> Oh I see, I thought files opened with O_DIRECT automatically call the
>>> .direct_IO handler for reads/writes but you're right, it first goes
>>> through .read_iter / .write_iter handlers, and the .direct_IO handler
>>> only gets invoked through generic_file_read_iter() /
>>> generic_file_direct_write() in mm/filemap.c
>>>
>>> There's two paths for direct io in FUSE:
>>> a) fuse server sets fi->direct_io = true when a file is opened, which
>>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
>>> b) fuse server doesn't set fi->direct_io = true, but the client opens
>>> the file with O_DIRECT
>>>
>>> We only go through the stack trace you listed above for the b) case.
>>> For the a) case, we'll hit
>>>
>>> if (ff->open_flags & FOPEN_DIRECT_IO)
>>> return fuse_direct_read_iter(iocb, to);
>>>
>>> and
>>>
>>> if (ff->open_flags & FOPEN_DIRECT_IO)
>>> return fuse_direct_write_iter(iocb, from);
>>>
>>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
>>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
>>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
>>> above.
>>>
>>> So for the a) case I think we'd still need the fuse_sync_writes() in
>>> case there's still pending writeback.
>>>
>>> Do you agree with this analysis or am I missing something here?
>>
>> Yeah, that's true. But instead of calling fuse_sync_writes(), we can
>> call filemap_wait_range() or something similar here.
>>
>
> Agreed. Actually, the more I look at this, the more I think we can
> replace all fuse_sync_writes() and get rid of it entirely.
I have seen your latest reply that this cleaning up won't be included in
this series, which is okay.
> fuse_sync_writes() is called in:
>
> fuse_fsync():
> /*
> * Start writeback against all dirty pages of the inode, then
> * wait for all outstanding writes, before sending the FSYNC
> * request.
> */
> err = file_write_and_wait_range(file, start, end);
> if (err)
> goto out;
>
> fuse_sync_writes(inode);
>
> /*
> * Due to implementation of fuse writeback
> * file_write_and_wait_range() does not catch errors.
> * We have to do this directly after fuse_sync_writes()
> */
> err = file_check_and_advance_wb_err(file);
> if (err)
> goto out;
>
>
> We can get rid of the fuse_sync_writes() and
> file_check_and_advance_wb_err() entirely since now without temp pages,
> the file_write_and_wait_range() call actually ensures that writeback
> is completed
>
>
>
> fuse_writeback_range():
> static int fuse_writeback_range(struct inode *inode, loff_t
> start, loff_t end)
> {
> int err =
> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
>
> if (!err)
> fuse_sync_writes(inode);
>
> return err;
> }
>
>
> We can replace fuse_writeback_range() entirely with
> filemap_write_and_wait_range().
>
>
>
> fuse_direct_io():
> if (fopen_direct_io && fc->direct_io_allow_mmap) {
> res = filemap_write_and_wait_range(mapping, pos, pos +
> count - 1);
> if (res) {
> fuse_io_free(ia);
> return res;
> }
> }
> if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
> count - 1))) {
> if (!write)
> inode_lock(inode);
> fuse_sync_writes(inode);
> if (!write)
> inode_unlock(inode);
> }
>
>
> I think this can just replaced with
> if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
> res = filemap_write_and_wait_range(mapping,
> pos, pos + count - 1);
> if (res) {
> fuse_io_free(ia);
> return res;
> }
> }
Alright. But I would prefer doing this filemap_write_and_wait_range() in
fuse_direct_write_iter() rather than fuse_direct_io() if possible.
> since for the !fopen_direct_io case, it will already go through
> filemap_write_and_wait_range(), as you mentioned in your previous
> message. I think this also fixes a bug (?) in the original code - in
> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
> still need to write out dirty pages first, which we don't currently
> do.
Nope. In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
won't be any page cache at all, right?
--
Thanks,
Jingbo
next prev parent reply other threads:[~2025-04-15 7:49 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2025-04-04 19:13 ` David Hildenbrand
2025-04-04 20:09 ` Joanne Koong
2025-04-04 20:13 ` David Hildenbrand
2025-04-09 22:05 ` Shakeel Butt
2025-04-09 23:48 ` Joanne Koong
2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2025-04-09 2:43 ` Jingbo Xu
2025-04-09 23:47 ` Joanne Koong
2025-04-10 2:12 ` Jingbo Xu
2025-04-10 15:07 ` Joanne Koong
2025-04-10 15:11 ` Jingbo Xu
2025-04-10 16:11 ` Joanne Koong
2025-04-14 20:24 ` Joanne Koong
2025-04-15 7:49 ` Jingbo Xu [this message]
2025-04-15 15:59 ` Joanne Koong
2025-04-16 1:40 ` Jingbo Xu
2025-04-16 16:43 ` Joanne Koong
2025-04-16 18:05 ` Bernd Schubert
2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
2025-04-14 20:28 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a738a765-a5e1-44bc-b1cd-e1a42d73e255@linux.alibaba.com \
--to=jefflexu@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=bernd.schubert@fastmail.fm \
--cc=david@redhat.com \
--cc=jlayton@kernel.org \
--cc=joannelkoong@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
--cc=mszeredi@redhat.com \
--cc=shakeel.butt@linux.dev \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox