From: Joanne Koong <joannelkoong@gmail.com>
To: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
shakeel.butt@linux.dev, josef@toxicpanda.com,
linux-mm@kvack.org, bernd.schubert@fastmail.fm,
kernel-team@meta.com
Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
Date: Fri, 15 Nov 2024 10:29:56 -0800 [thread overview]
Message-ID: <CAJnrk1Z05CP6dTu0Wym020OWf=cyt2BRSP7vNshRmDVqUHL0Yg@mail.gmail.com> (raw)
In-Reply-To: <8cd0a96c-7d8f-4a38-afc6-2c48bc701da8@linux.alibaba.com>
On Thu, Nov 14, 2024 at 6:18 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 11/15/24 2:19 AM, Joanne Koong wrote:
> > On Wed, Nov 13, 2024 at 5:47 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 11/14/24 8:39 AM, Joanne Koong wrote:
> >>> On Tue, Nov 12, 2024 at 1:25 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> Hi Joanne,
> >>>>
> >>>> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>>>> Currently, we allocate and copy data to a temporary folio when
> >>>>> handling writeback in order to mitigate the following deadlock scenario
> >>>>> that may arise if reclaim waits on writeback to complete:
> >>>>> * single-threaded FUSE server is in the middle of handling a request
> >>>>> that needs a memory allocation
> >>>>> * memory allocation triggers direct reclaim
> >>>>> * direct reclaim waits on a folio under writeback
> >>>>> * the FUSE server can't write back the folio since it's stuck in
> >>>>> direct reclaim
> >>>>>
> >>>>> To work around this, we allocate a temporary folio and copy over the
> >>>>> original folio to the temporary folio so that writeback can be
> >>>>> immediately cleared on the original folio. This additionally requires us
> >>>>> to maintain an internal rb tree to keep track of writeback state on the
> >>>>> temporary folios.
> >>>>>
> >>>>> A recent change prevents reclaim logic from waiting on writeback for
> >>>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>>>> will prevent FUSE folios from running into the reclaim deadlock described
> >>>>> above) and removes the temporary folio + extra copying and the internal
> >>>>> rb tree.
> >>>>>
> >>>>> fio benchmarks --
> >>>>> (using averages observed from 10 runs, throwing away outliers)
> >>>>>
> >>>>> Setup:
> >>>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>>>
> >>>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>>>
> >>>>> bs = 1k 4k 1M
> >>>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
> >>>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
> >>>>> % diff -3% 23% 45%
> >>>>>
> >>>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>>>
> >>>> I think there are some places checking or waiting for writeback could be
> >>>> reconsidered if they are still needed or not after we dropping the temp
> >>>> page design.
> >>>>
> >>>> As they are inherited from the original implementation, at least they
> >>>> are harmless. I think they could be remained in this patch, and could
> >>>> be cleaned up later if really needed.
> >>>>
> >>>
> >>> Thank you for the thorough inspection!
> >>>
> >>>>
> >>>>> @@ -891,7 +813,7 @@ static int fuse_do_readfolio(struct file *file, struct folio *folio)
> >>>>> * have writeback that extends beyond the lifetime of the folio. So
> >>>>> * make sure we read a properly synced folio.
> >>>>> */
> >>>>> - fuse_wait_on_folio_writeback(inode, folio);
> >>>>> + folio_wait_writeback(folio);
> >>>>
> >>>> I doubt if wait-on-writeback is needed here, as now page cache won't be
> >>>> freed until the writeback IO completes.
> >>>>
> >>>> The routine attempts to free page cache, e.g. invalidate_inode_pages2()
> >>>> (generally called by distributed filesystems when the file content has
> >>>> been modified from remote) or truncate_inode_pages() (called from
> >>>> truncate(2) or inode eviction) will wait for writeback completion (if
> >>>> any) before freeing page.
> >>>>
> >>>> Thus I don't think there's any possibility that .read_folio() or
> >>>> .readahead() will be called when the writeback has not completed.
> >>>>
> >>>
> >>> Great point. I'll remove this line and the comment above it too.
> >>>
> >>>>
> >>>>> @@ -1172,7 +1093,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
> >>>>> int err;
> >>>>>
> >>>>> for (i = 0; i < ap->num_folios; i++)
> >>>>> - fuse_wait_on_folio_writeback(inode, ap->folios[i]);
> >>>>> + folio_wait_writeback(ap->folios[i]);
> >>>>
> >>>> Ditto.
> >>
> >> Actually this is a typo and I originally meant that waiting for
> >> writeback in fuse_send_readpages() is unneeded as page cache won't be
> >> freed until the writeback IO completes.
> >>
> >>> - wait_event(fi->page_waitq, !fuse_range_is_writeback(inode, first, last));
> >>> + filemap_fdatawait_range(inode->i_mapping, first, last);
> >>
> >
> > Gotcha and agreed. I'll remove this wait from readahead().
> >
> >>
> >> In fact the above waiting for writeback in fuse_send_write_pages() is
> >> needed. The reason is as follows:
> >>
> >>>>
> >>>
> >>> Why did we need this fuse_wait_on_folio_writeback() even when we had
> >>> the temp pages? If I'm understanding it correctly,
> >>> fuse_send_write_pages() is only called for the writethrough case (by
> >>> fuse_perform_write()), so these folios would never even be under
> >>> writeback, no?
> >>
> >> I think mmap write could make the page dirty and the writeback could be
> >> triggered then.
> >>
> >
> > Ohhh I see, thanks for the explanation.
> >
> >>>
> >>>>
> >>>>
> >>>>> static void fuse_writepage_args_page_fill(struct fuse_writepage_args *wpa, struct folio *folio,
> >>>>> - struct folio *tmp_folio, uint32_t folio_index)
> >>>>> + uint32_t folio_index)
> >>>>> {
> >>>>> struct inode *inode = folio->mapping->host;
> >>>>> struct fuse_args_pages *ap = &wpa->ia.ap;
> >>>>>
> >>>>> - folio_copy(tmp_folio, folio);
> >>>>> -
> >>>>> - ap->folios[folio_index] = tmp_folio;
> >>>>> + folio_get(folio);
> >>>>
> >>>> I still think this folio_get() here is harmless but redundant.
> >>>>
> >>>> Ditto page cache won't be freed before writeback completes.
> >>>>
> >>>> Besides, other .writepages() implementaions e.g. iomap_writepages() also
> >>>> doen't get the refcount when constructing the writeback IO.
> >>>>
> >>>
> >>> Point taken. I'll remove this then, since other .writepages() also
> >>> don't obtain a refcount.
> >>>
> >>>>
> >>>>> @@ -2481,7 +2200,7 @@ static int fuse_write_begin(struct file *file, struct address_space *mapping,
> >>>>> if (IS_ERR(folio))
> >>>>> goto error;
> >>>>>
> >>>>> - fuse_wait_on_page_writeback(mapping->host, folio->index);
> >>>>> + folio_wait_writeback(folio);
> >>>>
> >>>> I also doubt if wait_on_writeback() is needed here, as now there won't
> >>>> be duplicate writeback IOs for the same page.
> >>>
> >>> What prevents there from being a duplicate writeback write for the
> >>> same page? This is the path I'm looking at:
> >>>
> >>> ksys_write()
> >>> vfs_write()
> >>> new_sync_write()
> >>> op->write_iter()
> >>> fuse_file_write_iter()
> >>> fuse_cache_write_iter()
> >>> generic_file_write_iter()
> >>> __generic_file_write_iter()
> >>> generic_perform_write()
> >>> op->write_begin()
> >>> fuse_write_begin()
> >>>
> >>> but I'm not seeing where there's anything that prevents a duplicate
> >>> write from happening.
> >>
> >> I mean there won't be duplicate *writeback* rather than *write* for the
> >> same page. You could write the page cache and make it dirty at the time
> >> when the writeback for the same page is still on going, as long as we
> >> can ensure that even when the page is dirtied again, there won't be a
> >> duplicate writeback IO for the same page when the previous writeback IO
> >> has not completed yet.
> >>
> >
> > I think we still need this folio_wait_writeback() since we're calling
> > fuse_do_readfolio() and removing the folio_wait_writeback() from
> > fuse_do_readfolio(). else we could read back stale data if the
> > writeback hasn't gone through yet.
> > I think we could probably move the folio_wait_writeback() here in
> > fuse_write_begin() to be right before the fuse_do_readfolio() call and
> > skip waiting on writeback if we hit the "success" gotos.
>
> IIUC if cache hit when fuse_write_begin() is called, the page is marked
> with PG_update which indicates that (.readpage()) the IO request reading
> data from disk to page is already done. Thus fuse_write_begin() will do
> nothing and return directly.
>
> > if (folio_test_uptodate(folio) || len >= folio_size(folio))
> > goto success;
>
> Otherwise fuse_do_readfolio() is called to read data when cache miss,
> i.e. the page doesn't exist in the page cache before or it gets
> invalidated previously. Since we can ensure that no page can be
> invalidated until the writeback IO completes, we can ensure that there's
> no in-flight writeback IO when fuse_do_readfolio() is called.
>
> Other folks can also correct me if I get wrong, since I also attempts to
> figure out all these details about the page cache management recently.
>
You're right, this writeback wait is unneeded. I'll remove this for v5.
Thanks,
Joanne
>
> --
> Thanks,
> Jingbo
prev parent reply other threads:[~2024-11-15 18:30 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
2024-11-09 0:10 ` Shakeel Butt
2024-11-11 21:11 ` Joanne Koong
2024-11-15 19:33 ` Joanne Koong
2024-11-15 20:17 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
2024-11-09 0:16 ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
2024-11-08 17:33 ` SeongJae Park
2024-11-08 18:56 ` David Hildenbrand
2024-11-08 19:00 ` David Hildenbrand
2024-11-08 21:27 ` Shakeel Butt
2024-11-08 21:42 ` Joanne Koong
2024-11-08 22:16 ` Shakeel Butt
2024-11-08 22:20 ` Joanne Koong
2024-11-08 21:59 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08 8:48 ` Jingbo Xu
2024-11-08 22:33 ` Joanne Koong
2024-11-11 8:32 ` Jingbo Xu
2024-11-11 21:30 ` Joanne Koong
2024-11-12 2:31 ` Jingbo Xu
2024-11-13 19:11 ` Joanne Koong
2024-11-12 9:25 ` Jingbo Xu
2024-11-14 0:39 ` Joanne Koong
2024-11-14 1:46 ` Jingbo Xu
2024-11-14 18:19 ` Joanne Koong
2024-11-15 2:18 ` Jingbo Xu
2024-11-15 18:29 ` Joanne Koong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAJnrk1Z05CP6dTu0Wym020OWf=cyt2BRSP7vNshRmDVqUHL0Yg@mail.gmail.com' \
--to=joannelkoong@gmail.com \
--cc=bernd.schubert@fastmail.fm \
--cc=jefflexu@linux.alibaba.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox