linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jingbo Xu <jefflexu@linux.alibaba.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: miklos@szeredi.hu, akpm@linux-foundation.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	shakeel.butt@linux.dev, david@redhat.com,
	bernd.schubert@fastmail.fm, ziy@nvidia.com, jlayton@kernel.org,
	kernel-team@meta.com, Miklos Szeredi <mszeredi@redhat.com>
Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
Date: Thu, 10 Apr 2025 10:12:32 +0800	[thread overview]
Message-ID: <7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com> (raw)
In-Reply-To: <CAJnrk1bTGFXy+ZTchC7p4OYUnbfKZ7TtVkCsrsv87Mg1r8KkGA@mail.gmail.com>



On 4/10/25 7:47 AM, Joanne Koong wrote:
>   On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi Joanne,
>>
>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>> dirty page to be written back, the contents of the dirty page are copied over
>>> to the temp page, and the temp page gets handed to the server to write back.
>>>
>>> This is done so that writeback may be immediately cleared on the dirty page,
>>> and this in turn is done in order to mitigate the following deadlock scenario
>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>> * single-threaded FUSE server is in the middle of handling a request
>>>   that needs a memory allocation
>>> * memory allocation triggers direct reclaim
>>> * direct reclaim waits on a folio under writeback
>>> * the FUSE server can't write back the folio since it's stuck in
>>>   direct reclaim
>>>
>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>> the situations described above, FUSE writeback does not need to use
>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>
>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>> and removes the temporary pages + extra copying and the internal rb
>>> tree.
>>>
>>> fio benchmarks --
>>> (using averages observed from 10 runs, throwing away outliers)
>>>
>>> Setup:
>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>
>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>
>>>         bs =  1k          4k            1M
>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>> % diff        -3%          23%         45%
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>
> 
> Hi Jingbo,
> 
> Thanks for sharing your analysis for this.
> 
>> Overall this patch LGTM.
>>
>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>> also unneeded then, at least the DIRECT IO routine (i.e.
> 
> I took a look at fi->writectr and fi->queued_writes and my
> understanding is that we do still need this. For example, for
> truncates (I'm looking at fuse_do_setattr()), I think we still need to
> prevent concurrent writeback or else the setattr request and the
> writeback request could race which would result in a mismatch between
> the file's reported size and the actual data written to disk.

I haven't looked into the truncate routine yet.  I will see it later.

> 
>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>> because after removing the temp page, the DIRECT IO routine has already
>> been waiting for all inflight WRITE requests, see
>>
>> # DIRECT read
>> generic_file_read_iter
>>   kiocb_write_and_wait
>>     filemap_write_and_wait_range
> 
> Where do you see generic_file_read_iter() getting called for direct io reads?

# DIRECT read
fuse_file_read_iter
  fuse_cache_read_iter
    generic_file_read_iter
      kiocb_write_and_wait
       filemap_write_and_wait_range
      a_ops->direct_IO(),i.e. fuse_direct_IO()


> Similarly, where do you see generic_file_write_iter() getting called
> for direct io writes?

# DIRECT read
fuse_file_write_iter
  fuse_cache_write_iter
    generic_file_write_iter
      generic_file_direct_write
        kiocb_invalidate_pages
         filemap_invalidate_pages
           filemap_write_and_wait_range
      a_ops->direct_IO(),i.e. fuse_direct_IO()


> Where do you see fi->writectr / fi->queued-writes preventing this
> race?

IMO overall fi->writectr / fi->queued-writes are introduced to prevent
DIRECT IO and writeback from sending duplicate (inflight) WRITE requests
for the same page.

For the DIRECT write routine:

# non-FOPEN_DIRECT_IO DIRECT write
fuse_cache_write_iter
  fuse_direct_IO
    fuse_direct_io
      fuse_sync_writes


# FOPEN_DIRECT_IO DIRECT write
fuse_direct_write_iter
  fuse_direct_IO
    fuse_direct_io
      fuse_sync_writes


For the writeback routine:
fuse_writepages()
  fuse_writepages_fill
    fuse_writepages_send
      # buffer the WRITE request in queued_writes list
      fuse_flush_writepages
	# flush WRITE only when fi->writectr >= 0
	


> It looks to me like in the existing code, this race condition
> you described of direct write invalidating the page cache, then
> another buffer write reads the page cache and dirties it, then
> writeback is called on that, and the 2 write requests racing, could
> still happen?
> 
> 
>> However it seems that the writeback
>> won't wait for previous inflight DIRECT WRITE requests, so I'm not much
>> sure about that.  Maybe other folks could offer more insights...
> 
> My understanding is that these lines
> 
> if (!cuse && filemap_range_has_writeback(...)) {
>    ...
>    fuse_sync_writes(inode);
>    ...
> }
> 
> in fuse_direct_io() is what waits on previous inflight direct write
> requests to complete before the direct io happens.

Right.

> 
> 
>>
>> Also fuse_sync_writes() is not needed in fuse_flush() anymore, with
>> which I'm pretty sure.
> 
> Why don't we still need this for fuse_flush()?
> 
> If a caller calls close(), this will call
> 
> filp_close()
>   filp_flush()
>       filp->f_op->flush()
>           fuse_flush()
> 
> it seems like we should still be waiting for all writebacks to finish
> before sending the fuse server the fuse_flush request, no?
> 

filp_close()
   filp_flush()
       filp->f_op->flush()
           fuse_flush()
 	     write_inode_now
		writeback_single_inode(WB_SYNC_ALL)
		  do_writepages
		    # flush dirty page
		  filemap_fdatawait
		    # wait for WRITE completion

>>
>>> ---
>>>  fs/fuse/file.c   | 360 ++++-------------------------------------------
>>>  fs/fuse/fuse_i.h |   3 -
>>>  2 files changed, 28 insertions(+), 335 deletions(-)
>>>
>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>> index 754378dd9f71..91ada0208863 100644
>>> --- a/fs/fuse/file.c
>>> +++ b/fs/fuse/file.c
>>> @@ -415,89 +415,11 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
>>>
>>>  struct fuse_writepage_args {
>>>       struct fuse_io_args ia;
>>> -     struct rb_node writepages_entry;
>>>       struct list_head queue_entry;
>>> -     struct fuse_writepage_args *next;
>>>       struct inode *inode;
>>>       struct fuse_sync_bucket *bucket;
>>>  };
>>>
>>> -static struct fuse_writepage_args *fuse_find_writeback(struct fuse_inode *fi,
>>> -                                         pgoff_t idx_from, pgoff_t idx_to)
>>> -{
>>> -     struct rb_node *n;
>>> -
>>> -     n = fi->writepages.rb_node;
>>> -
>>> -     while (n) {
>>> -             struct fuse_writepage_args *wpa;
>>> -             pgoff_t curr_index;
>>> -
>>
>> --
>> Thanks,
>> Jingbo

-- 
Thanks,
Jingbo


  reply	other threads:[~2025-04-10  2:12 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2025-04-04 19:13   ` David Hildenbrand
2025-04-04 20:09     ` Joanne Koong
2025-04-04 20:13       ` David Hildenbrand
2025-04-09 22:05         ` Shakeel Butt
2025-04-09 23:48           ` Joanne Koong
2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2025-04-09  2:43   ` Jingbo Xu
2025-04-09 23:47     ` Joanne Koong
2025-04-10  2:12       ` Jingbo Xu [this message]
2025-04-10 15:07         ` Joanne Koong
2025-04-10 15:11           ` Jingbo Xu
2025-04-10 16:11             ` Joanne Koong
2025-04-14 20:24               ` Joanne Koong
2025-04-15  7:49               ` Jingbo Xu
2025-04-15 15:59                 ` Joanne Koong
2025-04-16  1:40                   ` Jingbo Xu
2025-04-16 16:43                     ` Joanne Koong
2025-04-16 18:05                       ` Bernd Schubert
2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
2025-04-14 20:28   ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e9b1a40-4708-42a8-b8fc-44fa50227e5b@linux.alibaba.com \
    --to=jefflexu@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=bernd.schubert@fastmail.fm \
    --cc=david@redhat.com \
    --cc=jlayton@kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=mszeredi@redhat.com \
    --cc=shakeel.butt@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox