linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
	shakeel.butt@linux.dev,  josef@toxicpanda.com,
	linux-mm@kvack.org, bernd.schubert@fastmail.fm,
	 kernel-team@meta.com
Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
Date: Thu, 14 Nov 2024 10:19:37 -0800	[thread overview]
Message-ID: <CAJnrk1Y+CZq5uL72kp1vXxF4Vf1kf+Nk_oGmYFHA8b-uw2gfgQ@mail.gmail.com> (raw)
In-Reply-To: <47661fe5-8616-4133-8d9c-faeb1ab96962@linux.alibaba.com>

On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 11/14/24 8:39 AM, Joanne Koong wrote:
> > On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> Hi Joanne,
> >>
> >> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>> Currently, we allocate and copy data to a temporary folio when
> >>> handling writeback in order to mitigate the following deadlock scenario
> >>> that may arise if reclaim waits on writeback to complete:
> >>> * single-threaded FUSE server is in the middle of handling a request
> >>>   that needs a memory allocation
> >>> * memory allocation triggers direct reclaim
> >>> * direct reclaim waits on a folio under writeback
> >>> * the FUSE server can't write back the folio since it's stuck in
> >>>   direct reclaim
> >>>
> >>> To work around this, we allocate a temporary folio and copy over the
> >>> original folio to the temporary folio so that writeback can be
> >>> immediately cleared on the original folio. This additionally requires us
> >>> to maintain an internal rb tree to keep track of writeback state on the
> >>> temporary folios.
> >>>
> >>> A recent change prevents reclaim logic from waiting on writeback for
> >>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>> will prevent FUSE folios from running into the reclaim deadlock described
> >>> above) and removes the temporary folio + extra copying and the internal
> >>> rb tree.
> >>>
> >>> fio benchmarks --
> >>> (using averages observed from 10 runs, throwing away outliers)
> >>>
> >>> Setup:
> >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>
> >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>
> >>>         bs =  1k          4k            1M
> >>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>> % diff        -3%          23%         45%
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>
> >> I think there are some places checking or waiting for writeback could be
> >> reconsidered if they are still needed or not after we dropping the temp
> >> page design.
> >>
> >> As they are inherited from the original implementation, at least they
> >> are harmless.  I think they could be remained in this patch, and could
> >> be cleaned up later if really needed.
> >>
> >
> > Thank you for the thorough inspection!
> >
> >>
> >>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> >>>        * have writeback that extends beyond the lifetime of the folio.  So
> >>>        * make sure we read a properly synced folio.
> >>>        */
> >>> -     fuse_wait_on_folio_writeback(inode, folio);
> >>> +     folio_wait_writeback(folio);
> >>
> >> I doubt if wait-on-writeback is needed here, as now page cache won't be
> >> freed until the writeback IO completes.
> >>
> >> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> >> (generally called by distributed filesystems when the file content has
> >> been modified from remote) or truncate_inode_pages() (called from
> >> truncate(2) or inode eviction) will wait for writeback completion (if
> >> any) before freeing page.
> >>
> >> Thus I don't think there's any possibility that .read_folio() or
> >> .readahead() will be called when the writeback has not completed.
> >>
> >
> > Great point. I'll remove this line and the comment above it too.
> >
> >>
> >>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> >>>       int err;
> >>>
> >>>       for (i = 0; i < ap->num_folios; i++)
> >>> -             fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> >>> +             folio_wait_writeback(ap->folios[i]);
> >>
> >> Ditto.
>
> Actually this is a typo and I originally meant that waiting for
> writeback in fuse_send_readpages() is unneeded as page cache won't be
> freed until the writeback IO completes.
>
> > -     wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> > +     filemap_fdatawait_range(inode->i_mapping, first, last);
>

Gotcha and agreed. I'll remove this wait from readahead().

>
> In fact the above waiting for writeback in fuse_send_write_pages() is
> needed.  The reason is as follows:
>
> >>
> >
> > Why did we need this fuse_wait_on_folio_writeback() even when we had
> > the temp pages? If I'm understanding it correctly,
> > fuse_send_write_pages() is only called for the writethrough case (by
> > fuse_perform_write()), so these folios would never even be under
> > writeback, no?
>
> I think mmap write could make the page dirty and the writeback could be
> triggered then.
>

Ohhh I see, thanks for the explanation.

> >
> >>
> >>
> >>>  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> >>> -                                       struct folio *tmp_folio, uint32_t folio_index)
> >>> +                                       uint32_t folio_index)
> >>>  {
> >>>       struct inode *inode = folio->mapping->host;
> >>>       struct fuse_args_pages *ap = &wpa->ia.ap;
> >>>
> >>> -     folio_copy(tmp_folio, folio);
> >>> -
> >>> -     ap->folios[folio_index] = tmp_folio;
> >>> +     folio_get(folio);
> >>
> >> I still think this folio_get() here is harmless but redundant.
> >>
> >> Ditto page cache won't be freed before writeback completes.
> >>
> >> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> >> doen't get the refcount when constructing the writeback IO.
> >>
> >
> > Point taken. I'll remove this then, since other .writepages() also
> > don't obtain a refcount.
> >
> >>
> >>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> >>>       if (IS_ERR(folio))
> >>>               goto error;
> >>>
> >>> -     fuse_wait_on_page_writeback(mapping->host, folio->index);
> >>> +     folio_wait_writeback(folio);
> >>
> >> I also doubt if wait_on_writeback() is needed here, as now there won't
> >> be duplicate writeback IOs for the same page.
> >
> > What prevents there from being a duplicate writeback write for the
> > same page? This is the path I'm looking at:
> >
> > ksys_write()
> >   vfs_write()
> >     new_sync_write()
> >        op->write_iter()
> >           fuse_file_write_iter()
> >             fuse_cache_write_iter()
> >                generic_file_write_iter()
> >                    __generic_file_write_iter()
> >                        generic_perform_write()
> >                            op->write_begin()
> >                                fuse_write_begin()
> >
> > but I'm not seeing where there's anything that prevents a duplicate
> > write from happening.
>
> I mean there won't be duplicate *writeback* rather than *write* for the
> same page.  You could write the page cache and make it dirty at the time
> when the writeback for the same page is still on going, as long as we
> can ensure that even when the page is dirtied again, there won't be a
> duplicate writeback IO for the same page when the previous writeback IO
> has not completed yet.
>

I think we still need this folio_wait_writeback() since we're calling
fuse_do_readfolio() and removing the folio_wait_writeback() from
fuse_do_readfolio(). else we could read back stale data if the
writeback hasn't gone through yet.
I think we could probably move the folio_wait_writeback() here in
fuse_write_begin() to be right before the fuse_do_readfolio() call and
skip waiting on writeback if we hit the "success" gotos.


Thanks,
Joanne
>
>
> --
> Thanks,
> Jingbo


  reply	other threads:[~2024-11-14 18:19 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
2024-11-09  0:10   ` Shakeel Butt
2024-11-11 21:11     ` Joanne Koong
2024-11-15 19:33       ` Joanne Koong
2024-11-15 20:17         ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
2024-11-09  0:16   ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
2024-11-08 17:33   ` SeongJae Park
2024-11-08 18:56     ` David Hildenbrand
2024-11-08 19:00       ` David Hildenbrand
2024-11-08 21:27         ` Shakeel Butt
2024-11-08 21:42           ` Joanne Koong
2024-11-08 22:16             ` Shakeel Butt
2024-11-08 22:20               ` Joanne Koong
2024-11-08 21:59     ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08  8:48   ` Jingbo Xu
2024-11-08 22:33     ` Joanne Koong
2024-11-11  8:32   ` Jingbo Xu
2024-11-11 21:30     ` Joanne Koong
2024-11-12  2:31       ` Jingbo Xu
2024-11-13 19:11         ` Joanne Koong
2024-11-12  9:25   ` Jingbo Xu
2024-11-14  0:39     ` Joanne Koong
2024-11-14  1:46       ` Jingbo Xu
2024-11-14 18:19         ` Joanne Koong [this message]
2024-11-15  2:18           ` Jingbo Xu
2024-11-15 18:29             ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJnrk1Y+CZq5uL72kp1vXxF4Vf1kf+Nk_oGmYFHA8b-uw2gfgQ@mail.gmail.com \
    --to=joannelkoong@gmail.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=jefflexu@linux.alibaba.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox