linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Bernd Schubert <bernd.schubert@fastmail.fm>
To: Joanne Koong <joannelkoong@gmail.com>,
	Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: miklos@szeredi.hu, akpm@linux-foundation.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	shakeel.butt@linux.dev, david@redhat.com, ziy@nvidia.com,
	jlayton@kernel.org, kernel-team@meta.com,
	Miklos Szeredi <mszeredi@redhat.com>
Subject: Re: [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree
Date: Wed, 16 Apr 2025 20:05:32 +0200	[thread overview]
Message-ID: <70521e78-cca2-40e2-b3d3-da2ef74ca625@fastmail.fm> (raw)
In-Reply-To: <CAJnrk1buLrVuOxX-Q8QBZkqrM7fF_EaoOCmsxM0nyf1HndkYtA@mail.gmail.com>



On 4/16/25 18:43, Joanne Koong wrote:
> On Tue, Apr 15, 2025 at 6:40 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> On 4/15/25 11:59 PM, Joanne Koong wrote:
>>> On Tue, Apr 15, 2025 at 12:49 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> Hi Joanne,
>>>>
>>>> Sorry for the late reply...
>>>
>>> Hi Jingbo,
>>>
>>> No worries at all.
>>>>
>>>>
>>>> On 4/11/25 12:11 AM, Joanne Koong wrote:
>>>>> On Thu, Apr 10, 2025 at 8:11 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> On 4/10/25 11:07 PM, Joanne Koong wrote:
>>>>>>> On Wed, Apr 9, 2025 at 7:12 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/10/25 7:47 AM, Joanne Koong wrote:
>>>>>>>>>    On Tue, Apr 8, 2025 at 7:43 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Joanne,
>>>>>>>>>>
>>>>>>>>>> On 4/5/25 2:14 AM, Joanne Koong wrote:
>>>>>>>>>>> In the current FUSE writeback design (see commit 3be5a52b30aa
>>>>>>>>>>> ("fuse: support writable mmap")), a temp page is allocated for every
>>>>>>>>>>> dirty page to be written back, the contents of the dirty page are copied over
>>>>>>>>>>> to the temp page, and the temp page gets handed to the server to write back.
>>>>>>>>>>>
>>>>>>>>>>> This is done so that writeback may be immediately cleared on the dirty page,
>>>>>>>>>>> and this in turn is done in order to mitigate the following deadlock scenario
>>>>>>>>>>> that may arise if reclaim waits on writeback on the dirty page to complete:
>>>>>>>>>>> * single-threaded FUSE server is in the middle of handling a request
>>>>>>>>>>>    that needs a memory allocation
>>>>>>>>>>> * memory allocation triggers direct reclaim
>>>>>>>>>>> * direct reclaim waits on a folio under writeback
>>>>>>>>>>> * the FUSE server can't write back the folio since it's stuck in
>>>>>>>>>>>    direct reclaim
>>>>>>>>>>>
>>>>>>>>>>> With a recent change that added AS_WRITEBACK_INDETERMINATE and mitigates
>>>>>>>>>>> the situations described above, FUSE writeback does not need to use
>>>>>>>>>>> temp pages if it sets AS_WRITEBACK_INDETERMINATE on its inode mappings.
>>>>>>>>>>>
>>>>>>>>>>> This commit sets AS_WRITEBACK_INDETERMINATE on the inode mappings
>>>>>>>>>>> and removes the temporary pages + extra copying and the internal rb
>>>>>>>>>>> tree.
>>>>>>>>>>>
>>>>>>>>>>> fio benchmarks --
>>>>>>>>>>> (using averages observed from 10 runs, throwing away outliers)
>>>>>>>>>>>
>>>>>>>>>>> Setup:
>>>>>>>>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>>>>>>>>>>   ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>>>>>>>>>
>>>>>>>>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>>>>>>>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>>>>>>>>>
>>>>>>>>>>>          bs =  1k          4k            1M
>>>>>>>>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
>>>>>>>>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
>>>>>>>>>>> % diff        -3%          23%         45%
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>>>>>>>>>> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>>>>> Acked-by: Miklos Szeredi <mszeredi@redhat.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Jingbo,
>>>>>>>>>
>>>>>>>>> Thanks for sharing your analysis for this.
>>>>>>>>>
>>>>>>>>>> Overall this patch LGTM.
>>>>>>>>>>
>>>>>>>>>> Apart from that, IMO the fi->writectr and fi->queued_writes mechanism is
>>>>>>>>>> also unneeded then, at least the DIRECT IO routine (i.e.
>>>>>>>>>
>>>>>>>>> I took a look at fi->writectr and fi->queued_writes and my
>>>>>>>>> understanding is that we do still need this. For example, for
>>>>>>>>> truncates (I'm looking at fuse_do_setattr()), I think we still need to
>>>>>>>>> prevent concurrent writeback or else the setattr request and the
>>>>>>>>> writeback request could race which would result in a mismatch between
>>>>>>>>> the file's reported size and the actual data written to disk.
>>>>>>>>
>>>>>>>> I haven't looked into the truncate routine yet.  I will see it later.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> fuse_direct_io()) doesn't need fuse_sync_writes() anymore.  That is
>>>>>>>>>> because after removing the temp page, the DIRECT IO routine has already
>>>>>>>>>> been waiting for all inflight WRITE requests, see
>>>>>>>>>>
>>>>>>>>>> # DIRECT read
>>>>>>>>>> generic_file_read_iter
>>>>>>>>>>    kiocb_write_and_wait
>>>>>>>>>>      filemap_write_and_wait_range
>>>>>>>>>
>>>>>>>>> Where do you see generic_file_read_iter() getting called for direct io reads?
>>>>>>>>
>>>>>>>> # DIRECT read
>>>>>>>> fuse_file_read_iter
>>>>>>>>    fuse_cache_read_iter
>>>>>>>>      generic_file_read_iter
>>>>>>>>        kiocb_write_and_wait
>>>>>>>>         filemap_write_and_wait_range
>>>>>>>>        a_ops->direct_IO(),i.e. fuse_direct_IO()
>>>>>>>>
>>>>>>>
>>>>>>> Oh I see, I thought files opened with O_DIRECT automatically call the
>>>>>>> .direct_IO handler for reads/writes but you're right, it first goes
>>>>>>> through .read_iter / .write_iter handlers, and the .direct_IO handler
>>>>>>> only gets invoked through generic_file_read_iter() /
>>>>>>> generic_file_direct_write() in mm/filemap.c
>>>>>>>
>>>>>>> There's two paths for direct io in FUSE:
>>>>>>> a) fuse server sets fi->direct_io = true when a file is opened, which
>>>>>>> will set the FOPEN_DIRECT_IO bit in ff->open_flags on the kernel side
>>>>>>> b) fuse server doesn't set fi->direct_io = true, but the client opens
>>>>>>> the file with O_DIRECT
>>>>>>>
>>>>>>> We only go through the stack trace you listed above for the b) case.
>>>>>>> For the a) case, we'll hit
>>>>>>>
>>>>>>>          if (ff->open_flags & FOPEN_DIRECT_IO)
>>>>>>>                  return fuse_direct_read_iter(iocb, to);
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>>          if (ff->open_flags & FOPEN_DIRECT_IO)
>>>>>>>                  return fuse_direct_write_iter(iocb, from);
>>>>>>>
>>>>>>> which will invoke fuse_direct_IO() / fuse_direct_io() without going
>>>>>>> through the kiocb_write_and_wait() -> filemap_write_and_wait_range() /
>>>>>>> kiocb_invalidate_pages() -> filemap_write_and_wait_range() you listed
>>>>>>> above.
>>>>>>>
>>>>>>> So for the a) case I think we'd still need the fuse_sync_writes() in
>>>>>>> case there's still pending writeback.
>>>>>>>
>>>>>>> Do you agree with this analysis or am I missing something here?
>>>>>>
>>>>>> Yeah, that's true.  But instead of calling fuse_sync_writes(), we can
>>>>>> call filemap_wait_range() or something similar here.
>>>>>>
>>>>>
>>>>> Agreed. Actually, the more I look at this, the more I think we can
>>>>> replace all fuse_sync_writes() and get rid of it entirely.
>>>>
>>>>
>>>> I have seen your latest reply that this cleaning up won't be included in
>>>> this series, which is okay.
>>>>
>>>>
>>>>> fuse_sync_writes() is called in:
>>>>>
>>>>> fuse_fsync():
>>>>>          /*
>>>>>           * Start writeback against all dirty pages of the inode, then
>>>>>           * wait for all outstanding writes, before sending the FSYNC
>>>>>           * request.
>>>>>           */
>>>>>          err = file_write_and_wait_range(file, start, end);
>>>>>          if (err)
>>>>>                  goto out;
>>>>>
>>>>>          fuse_sync_writes(inode);
>>>>>
>>>>>          /*
>>>>>           * Due to implementation of fuse writeback
>>>>>           * file_write_and_wait_range() does not catch errors.
>>>>>           * We have to do this directly after fuse_sync_writes()
>>>>>           */
>>>>>          err = file_check_and_advance_wb_err(file);
>>>>>          if (err)
>>>>>                  goto out;
>>>>>
>>>>>
>>>>>        We can get rid of the fuse_sync_writes() and
>>>>> file_check_and_advance_wb_err() entirely since now without temp pages,
>>>>> the file_write_and_wait_range() call actually ensures that writeback
>>>>> is completed
>>>>>
>>>>>
>>>>>
>>>>> fuse_writeback_range():
>>>>>          static int fuse_writeback_range(struct inode *inode, loff_t
>>>>> start, loff_t end)
>>>>>          {
>>>>>                  int err =
>>>>> filemap_write_and_wait_range(inode->i_mapping, start, LLONG_MAX);
>>>>>
>>>>>                  if (!err)
>>>>>                          fuse_sync_writes(inode);
>>>>>
>>>>>                  return err;
>>>>>          }
>>>>>
>>>>>
>>>>>        We can replace fuse_writeback_range() entirely with
>>>>> filemap_write_and_wait_range().
>>>>>
>>>>>
>>>>>
>>>>> fuse_direct_io():
>>>>>          if (fopen_direct_io && fc->direct_io_allow_mmap) {
>>>>>                  res = filemap_write_and_wait_range(mapping, pos, pos +
>>>>> count - 1);
>>>>>                  if (res) {
>>>>>                          fuse_io_free(ia);
>>>>>                          return res;
>>>>>                  }
>>>>>          }
>>>>>          if (!cuse && filemap_range_has_writeback(mapping, pos, (pos +
>>>>> count - 1))) {
>>>>>                  if (!write)
>>>>>                          inode_lock(inode);
>>>>>                  fuse_sync_writes(inode);
>>>>>                  if (!write)
>>>>>                          inode_unlock(inode);
>>>>>          }
>>>>>
>>>>>
>>>>>         I think this can just replaced with
>>>>>                  if (fopen_direct_io && (fc->direct_io_allow_mmap || !cuse)) {
>>>>>                          res = filemap_write_and_wait_range(mapping,
>>>>> pos, pos + count - 1);
>>>>>                          if (res) {
>>>>>                                  fuse_io_free(ia);
>>>>>                                  return res;
>>>>>                          }
>>>>>                  }
>>>>
>>>> Alright. But I would prefer doing this filemap_write_and_wait_range() in
>>>> fuse_direct_write_iter() rather than fuse_direct_io() if possible.
>>>>
>>>>>         since for the !fopen_direct_io case, it will already go through
>>>>> filemap_write_and_wait_range(), as you mentioned in your previous
>>>>> message. I think this also fixes a bug (?) in the original code - in
>>>>> the fopen_direct_io && !fc->direct_io_allow_mmap case, I think we
>>>>> still need to write out dirty pages first, which we don't currently
>>>>> do.
>>>>
>>>> Nope.  In case of fopen_direct_io && !fc->direct_io_allow_mmap, there
>>>> won't be any page cache at all, right?
>>>>
>>>
>>> Isn't there still a page cache if the file was previously opened
>>> without direct io and then the client opens another handle to that
>>> file with direct io? In that case, the pages could still be dirty in
>>> the page cache and would need to be written back first, no?
>>
>> Do you mean that when the inode is firstly opened, FOPEN_DIRECT_IO is
>> not set by the FUSE server, while it is secondly opened, the flag is set?
>>
>> Though the behavior of the FUSE daemon is quite confusing in this case,
>> it is completely possible in real life.  So yes we'd better add
>> filemap_write_and_wait_range() unconditionally in fopen_direct_io case.
>>
> 
> I think this behavior on the server side is pretty common. From what
> I've seen on most servers, the server when handling the open sets
> fi->direct_io depending on if the client opens with O_DIRECT, eg
> 
>          if (fi->flags & O_DIRECT)
>                  fi->direct_io = 1;

One should do that, actually, to get a shared lock on the inode.
With the additional

fi.parallel_direct_writes = 1;

> 
> If a client opens a file without O_DIRECT and then opens the same file
> with O_DIRECT, then we run into this case. Though I'm not sure how
> common it generally is for clients to do this.
> 


I guess the common case is

application1 does mmap
application2 does normal read/write

fuse-server might set  fi->direct_io = 1 on all opens, with the
additional

fuse_set_feature_flag(connp, FUSE_CAP_DIRECT_IO_ALLOW_MMAP);

Probably will only come to it tomorrow, but will review,
especially to check about cached/uncached io-modes.



Thanks,
Bernd


  reply	other threads:[~2025-04-16 18:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-04 18:14 [PATCH v7 0/3] fuse: remove temp page copies in writeback Joanne Koong
2025-04-04 18:14 ` [PATCH v7 1/3] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2025-04-04 19:13   ` David Hildenbrand
2025-04-04 20:09     ` Joanne Koong
2025-04-04 20:13       ` David Hildenbrand
2025-04-09 22:05         ` Shakeel Butt
2025-04-09 23:48           ` Joanne Koong
2025-04-04 18:14 ` [PATCH v7 2/3] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2025-04-04 18:14 ` [PATCH v7 3/3] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2025-04-09  2:43   ` Jingbo Xu
2025-04-09 23:47     ` Joanne Koong
2025-04-10  2:12       ` Jingbo Xu
2025-04-10 15:07         ` Joanne Koong
2025-04-10 15:11           ` Jingbo Xu
2025-04-10 16:11             ` Joanne Koong
2025-04-14 20:24               ` Joanne Koong
2025-04-15  7:49               ` Jingbo Xu
2025-04-15 15:59                 ` Joanne Koong
2025-04-16  1:40                   ` Jingbo Xu
2025-04-16 16:43                     ` Joanne Koong
2025-04-16 18:05                       ` Bernd Schubert [this message]
2025-04-14 16:21 ` [PATCH v7 0/3] fuse: remove temp page copies in writeback Jeff Layton
2025-04-14 20:28   ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=70521e78-cca2-40e2-b3d3-da2ef74ca625@fastmail.fm \
    --to=bernd.schubert@fastmail.fm \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=jlayton@kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=mszeredi@redhat.com \
    --cc=shakeel.butt@linux.dev \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox