linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
	shakeel.butt@linux.dev,  josef@toxicpanda.com,
	linux-mm@kvack.org, bernd.schubert@fastmail.fm,
	 kernel-team@meta.com
Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
Date: Fri, 15 Nov 2024 10:29:56 -0800	[thread overview]
Message-ID: <CAJnrk1Z05CP6dTu0Wym020OWf=cyt2BRSP7vNshRmDVqUHL0Yg@mail.gmail.com> (raw)
In-Reply-To: <8cd0a96c-7d8f-4a38-afc6-2c48bc701da8@linux.alibaba.com>

On Thu, Nov 14, 2024 at 6:18 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 11/15/24 2:19 AM, Joanne Koong wrote:
> > On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 11/14/24 8:39 AM, Joanne Koong wrote:
> >>> On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> Hi Joanne,
> >>>>
> >>>> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>>>> Currently, we allocate and copy data to a temporary folio when
> >>>>> handling writeback in order to mitigate the following deadlock scenario
> >>>>> that may arise if reclaim waits on writeback to complete:
> >>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>>   that needs a memory allocation
> >>>>> * memory allocation triggers direct reclaim
> >>>>> * direct reclaim waits on a folio under writeback
> >>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>>   direct reclaim
> >>>>>
> >>>>> To work around this, we allocate a temporary folio and copy over the
> >>>>> original folio to the temporary folio so that writeback can be
> >>>>> immediately cleared on the original folio. This additionally requires us
> >>>>> to maintain an internal rb tree to keep track of writeback state on the
> >>>>> temporary folios.
> >>>>>
> >>>>> A recent change prevents reclaim logic from waiting on writeback for
> >>>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>>>> will prevent FUSE folios from running into the reclaim deadlock described
> >>>>> above) and removes the temporary folio + extra copying and the internal
> >>>>> rb tree.
> >>>>>
> >>>>> fio benchmarks --
> >>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>
> >>>>> Setup:
> >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>
> >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>
> >>>>>         bs =  1k          4k            1M
> >>>>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>>>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>>>> % diff        -3%          23%         45%
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>
> >>>> I think there are some places checking or waiting for writeback could be
> >>>> reconsidered if they are still needed or not after we dropping the temp
> >>>> page design.
> >>>>
> >>>> As they are inherited from the original implementation, at least they
> >>>> are harmless.  I think they could be remained in this patch, and could
> >>>> be cleaned up later if really needed.
> >>>>
> >>>
> >>> Thank you for the thorough inspection!
> >>>
> >>>>
> >>>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> >>>>>        * have writeback that extends beyond the lifetime of the folio.  So
> >>>>>        * make sure we read a properly synced folio.
> >>>>>        */
> >>>>> -     fuse_wait_on_folio_writeback(inode, folio);
> >>>>> +     folio_wait_writeback(folio);
> >>>>
> >>>> I doubt if wait-on-writeback is needed here, as now page cache won't be
> >>>> freed until the writeback IO completes.
> >>>>
> >>>> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> >>>> (generally called by distributed filesystems when the file content has
> >>>> been modified from remote) or truncate_inode_pages() (called from
> >>>> truncate(2) or inode eviction) will wait for writeback completion (if
> >>>> any) before freeing page.
> >>>>
> >>>> Thus I don't think there's any possibility that .read_folio() or
> >>>> .readahead() will be called when the writeback has not completed.
> >>>>
> >>>
> >>> Great point. I'll remove this line and the comment above it too.
> >>>
> >>>>
> >>>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> >>>>>       int err;
> >>>>>
> >>>>>       for (i = 0; i < ap->num_folios; i++)
> >>>>> -             fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> >>>>> +             folio_wait_writeback(ap->folios[i]);
> >>>>
> >>>> Ditto.
> >>
> >> Actually this is a typo and I originally meant that waiting for
> >> writeback in fuse_send_readpages() is unneeded as page cache won't be
> >> freed until the writeback IO completes.
> >>
> >>> -     wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> >>> +     filemap_fdatawait_range(inode->i_mapping, first, last);
> >>
> >
> > Gotcha and agreed. I'll remove this wait from readahead().
> >
> >>
> >> In fact the above waiting for writeback in fuse_send_write_pages() is
> >> needed.  The reason is as follows:
> >>
> >>>>
> >>>
> >>> Why did we need this fuse_wait_on_folio_writeback() even when we had
> >>> the temp pages? If I'm understanding it correctly,
> >>> fuse_send_write_pages() is only called for the writethrough case (by
> >>> fuse_perform_write()), so these folios would never even be under
> >>> writeback, no?
> >>
> >> I think mmap write could make the page dirty and the writeback could be
> >> triggered then.
> >>
> >
> > Ohhh I see, thanks for the explanation.
> >
> >>>
> >>>>
> >>>>
> >>>>>  static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> >>>>> -                                       struct folio *tmp_folio, uint32_t folio_index)
> >>>>> +                                       uint32_t folio_index)
> >>>>>  {
> >>>>>       struct inode *inode = folio->mapping->host;
> >>>>>       struct fuse_args_pages *ap = &wpa->ia.ap;
> >>>>>
> >>>>> -     folio_copy(tmp_folio, folio);
> >>>>> -
> >>>>> -     ap->folios[folio_index] = tmp_folio;
> >>>>> +     folio_get(folio);
> >>>>
> >>>> I still think this folio_get() here is harmless but redundant.
> >>>>
> >>>> Ditto page cache won't be freed before writeback completes.
> >>>>
> >>>> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> >>>> doen't get the refcount when constructing the writeback IO.
> >>>>
> >>>
> >>> Point taken. I'll remove this then, since other .writepages() also
> >>> don't obtain a refcount.
> >>>
> >>>>
> >>>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> >>>>>       if (IS_ERR(folio))
> >>>>>               goto error;
> >>>>>
> >>>>> -     fuse_wait_on_page_writeback(mapping->host, folio->index);
> >>>>> +     folio_wait_writeback(folio);
> >>>>
> >>>> I also doubt if wait_on_writeback() is needed here, as now there won't
> >>>> be duplicate writeback IOs for the same page.
> >>>
> >>> What prevents there from being a duplicate writeback write for the
> >>> same page? This is the path I'm looking at:
> >>>
> >>> ksys_write()
> >>>   vfs_write()
> >>>     new_sync_write()
> >>>        op->write_iter()
> >>>           fuse_file_write_iter()
> >>>             fuse_cache_write_iter()
> >>>                generic_file_write_iter()
> >>>                    __generic_file_write_iter()
> >>>                        generic_perform_write()
> >>>                            op->write_begin()
> >>>                                fuse_write_begin()
> >>>
> >>> but I'm not seeing where there's anything that prevents a duplicate
> >>> write from happening.
> >>
> >> I mean there won't be duplicate *writeback* rather than *write* for the
> >> same page.  You could write the page cache and make it dirty at the time
> >> when the writeback for the same page is still on going, as long as we
> >> can ensure that even when the page is dirtied again, there won't be a
> >> duplicate writeback IO for the same page when the previous writeback IO
> >> has not completed yet.
> >>
> >
> > I think we still need this folio_wait_writeback() since we're calling
> > fuse_do_readfolio() and removing the folio_wait_writeback() from
> > fuse_do_readfolio(). else we could read back stale data if the
> > writeback hasn't gone through yet.
> > I think we could probably move the folio_wait_writeback() here in
> > fuse_write_begin() to be right before the fuse_do_readfolio() call and
> > skip waiting on writeback if we hit the "success" gotos.
>
> IIUC if cache hit when fuse_write_begin() is called, the page is marked
> with PG_update which indicates that (.readpage()) the IO request reading
> data from disk to page is already done.  Thus fuse_write_begin() will do
> nothing and return directly.
>
> >       if (folio_test_uptodate(folio) || len >= folio_size(folio))
> >               goto success;
>
> Otherwise fuse_do_readfolio() is called to read data when cache miss,
> i.e. the page doesn't exist in the page cache before or it gets
> invalidated previously.  Since we can ensure that no page can be
> invalidated until the writeback IO completes, we can ensure that there's
> no in-flight writeback IO when fuse_do_readfolio() is called.
>
> Other folks can also correct me if I get wrong, since I also attempts to
> figure out all these details about the page cache management recently.
>

You're right, this writeback wait is unneeded. I'll remove this for v5.

Thanks,
Joanne
>
> --
> Thanks,
> Jingbo


      reply	other threads:[~2024-11-15 18:30 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
2024-11-09  0:10   ` Shakeel Butt
2024-11-11 21:11     ` Joanne Koong
2024-11-15 19:33       ` Joanne Koong
2024-11-15 20:17         ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
2024-11-09  0:16   ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
2024-11-08 17:33   ` SeongJae Park
2024-11-08 18:56     ` David Hildenbrand
2024-11-08 19:00       ` David Hildenbrand
2024-11-08 21:27         ` Shakeel Butt
2024-11-08 21:42           ` Joanne Koong
2024-11-08 22:16             ` Shakeel Butt
2024-11-08 22:20               ` Joanne Koong
2024-11-08 21:59     ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08  8:48   ` Jingbo Xu
2024-11-08 22:33     ` Joanne Koong
2024-11-11  8:32   ` Jingbo Xu
2024-11-11 21:30     ` Joanne Koong
2024-11-12  2:31       ` Jingbo Xu
2024-11-13 19:11         ` Joanne Koong
2024-11-12  9:25   ` Jingbo Xu
2024-11-14  0:39     ` Joanne Koong
2024-11-14  1:46       ` Jingbo Xu
2024-11-14 18:19         ` Joanne Koong
2024-11-15  2:18           ` Jingbo Xu
2024-11-15 18:29             ` Joanne Koong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJnrk1Z05CP6dTu0Wym020OWf=cyt2BRSP7vNshRmDVqUHL0Yg@mail.gmail.com' \
    --to=joannelkoong@gmail.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=jefflexu@linux.alibaba.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox