linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Joanne Koong <joannelkoong@gmail.com>
To: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
	shakeel.butt@linux.dev,  josef@toxicpanda.com,
	linux-mm@kvack.org, bernd.schubert@fastmail.fm,
	 kernel-team@meta.com
Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
Date: Wed, 13 Nov 2024 11:11:03 -0800	[thread overview]
Message-ID: <CAJnrk1YvCj2t123iRgyewfWkMBYBMJsw+s47su-=13u4dF7W3Q@mail.gmail.com> (raw)
In-Reply-To: <0f585a7c-678b-492a-9492-358f21e57291@linux.alibaba.com>

On Mon, Nov 11, 2024 at 6:31 PM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 11/12/24 5:30 AM, Joanne Koong wrote:
> > On Mon, Nov 11, 2024 at 12:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>
> >> Hi, Joanne and Miklos,
> >>
> >> On 11/8/24 7:56 AM, Joanne Koong wrote:
> >>> Currently, we allocate and copy data to a temporary folio when
> >>> handling writeback in order to mitigate the following deadlock scenario
> >>> that may arise if reclaim waits on writeback to complete:
> >>> * single-threaded FUSE server is in the middle of handling a request
> >>>   that needs a memory allocation
> >>> * memory allocation triggers direct reclaim
> >>> * direct reclaim waits on a folio under writeback
> >>> * the FUSE server can't write back the folio since it's stuck in
> >>>   direct reclaim
> >>>
> >>> To work around this, we allocate a temporary folio and copy over the
> >>> original folio to the temporary folio so that writeback can be
> >>> immediately cleared on the original folio. This additionally requires us
> >>> to maintain an internal rb tree to keep track of writeback state on the
> >>> temporary folios.
> >>>
> >>> A recent change prevents reclaim logic from waiting on writeback for
> >>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
> >>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
> >>> will prevent FUSE folios from running into the reclaim deadlock described
> >>> above) and removes the temporary folio + extra copying and the internal
> >>> rb tree.
> >>>
> >>> fio benchmarks --
> >>> (using averages observed from 10 runs, throwing away outliers)
> >>>
> >>> Setup:
> >>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
> >>>  ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
> >>>
> >>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
> >>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
> >>>
> >>>         bs =  1k          4k            1M
> >>> Before  351 MiB/s     1818 MiB/s     1851 MiB/s
> >>> After   341 MiB/s     2246 MiB/s     2685 MiB/s
> >>> % diff        -3%          23%         45%
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>
> >>
> >> IIUC this patch seems to break commit
> >> 8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same
> >> page rewrites").
> >>
> >
> > Interesting!  My understanding was that we only needed that commit
> > because we were clearing writeback on the original folio before
> > writeback had actually finished.
> >
> > Now that folio writeback state is accounted for normally (eg through
> > writeback being set/cleared on the original folio), does the
> > folio_wait_writeback() call we do in fuse_page_mkwrite() not mitigate
> > this?
>
> Yes, after inspecting the writeback logic more, it seems that the second
> writeback won't be initiated if the first one has not completed yet, see
>
> ```
> a_ops->writepages
>   write_cache_pages
>     writeback_iter
>       writeback_get_folio
>         folio_prepare_writeback
>           if folio_test_writeback(folio):
>             folio_wait_writeback(folio)
> ```
>
> and thus it won't be an issue to remove the auxiliary list ;)
>

Awesome, thanks for double-checking!

> >
> >>> -     /*
> >>> -      * Being under writeback is unlikely but possible.  For example direct
> >>> -      * read to an mmaped fuse file will set the page dirty twice; once when
> >>> -      * the pages are faulted with get_user_pages(), and then after the read
> >>> -      * completed.
> >>> -      */
> >>
> >> In short, the target scenario is like:
> >>
> >> ```
> >> # open a fuse file and mmap
> >> fd1 = open("fuse-file-path", ...)
> >> uaddr = mmap(fd1, ...)
> >>
> >> # DIRECT read to the mmaped fuse file
> >> fd2 = open("ext4-file-path", O_DIRECT, ...)
> >> read(fd2, uaddr, ...)
> >>     # get_user_pages() of uaddr, and triggers faultin
> >>     # a_ops->dirty_folio() <--- mark PG_dirty
> >>
> >>     # when DIRECT IO completed:
> >>     # a_ops->dirty_folio() <--- mark PG_dirty
> >
> > If you have the direct io function call stack at hand, could you point
> > me to the function where the direct io completion marks this folio as
> > dirty?
>
>
> FYI The full call stack is like:
>
> ```
> # DIRECT read(2) to the mmaped fuse file
> read(fd2, uaddr1, ...)
>   f_ops->read_iter()
>     (iomap-based ) iomap_dio_rw
>       # for READ && user_backed_iter(iter):
>         dio->flags |= IOMAP_DIO_DIRTY
>       iomap_dio_iter
>         iomap_dio_bio_iter
>           # add user or kernel pages to a bio
>           bio_iov_iter_get_pages
>             ...
>             pin_user_pages_fast(..., FOLL_WRITE, ...)
>               # find corresponding vma of dest buffer (fuse page cache)
>               # search page table (pet) to find corresponding page
>               # if not fault yet, trigger explicit faultin:
>                 faultin_page(..., FOLL_WRITE, ...)
>                   handle_mm_fault(..., FAULT_FLAG_WRITE)
>                     handle_pte_fault
>                       do_wp_page
>                         (vma->vm_flags & VM_SHARED) wp_page_shared
>                           ...
>                           fault_dirty_shared_page
>                             folio_mark_dirty
>                               a_ops->dirty_folio(), i.e.,
> filemap_dirty_folio()
>                                 # set PG_dirty
>                                 folio_test_set_dirty(folio)
>                                 # set PAGECACHE_TAG_DIRTY
>                                 __folio_mark_dirty
>
>
>           # if dio->flags & IOMAP_DIO_DIRTY:
>           bio_set_pages_dirty
>             (for each dest page) folio_mark_dirty
>                a_ops->dirty_folio(), i.e., filemap_dirty_folio()
>                  # set PG_dirty
>                  folio_test_set_dirty(folio)
>                  # set PAGECACHE_TAG_DIRTY
>                  __folio_mark_dirty
> ```
>

Thanks for this info, Jingbo.

>
> >
> >> ```
> >>
> >> The auxiliary write request list was introduced to fix this.
> >>
> >> I'm not sure if there's an alternative other than the auxiliary list to
> >> fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so
> >> that the same folio won't get dirtied when the writeback has not
> >> completed yet?
> >>
> >
> > I'm curious how other filesystems solve for this - this seems like a
> > generic situation other filesystems would run into as well.
> >
>
> As mentioned above, the writeback path will prevent the duplicate
> writeback request on the same page when the first writeback IO has not
> completed yet.
>
> Sorry for the noise...
>
> --
> Thanks,
> Jingbo


  reply	other threads:[~2024-11-13 19:11 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
2024-11-09  0:10   ` Shakeel Butt
2024-11-11 21:11     ` Joanne Koong
2024-11-15 19:33       ` Joanne Koong
2024-11-15 20:17         ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
2024-11-09  0:16   ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
2024-11-08 17:33   ` SeongJae Park
2024-11-08 18:56     ` David Hildenbrand
2024-11-08 19:00       ` David Hildenbrand
2024-11-08 21:27         ` Shakeel Butt
2024-11-08 21:42           ` Joanne Koong
2024-11-08 22:16             ` Shakeel Butt
2024-11-08 22:20               ` Joanne Koong
2024-11-08 21:59     ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08  8:48   ` Jingbo Xu
2024-11-08 22:33     ` Joanne Koong
2024-11-11  8:32   ` Jingbo Xu
2024-11-11 21:30     ` Joanne Koong
2024-11-12  2:31       ` Jingbo Xu
2024-11-13 19:11         ` Joanne Koong [this message]
2024-11-12  9:25   ` Jingbo Xu
2024-11-14  0:39     ` Joanne Koong
2024-11-14  1:46       ` Jingbo Xu
2024-11-14 18:19         ` Joanne Koong
2024-11-15  2:18           ` Jingbo Xu
2024-11-15 18:29             ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJnrk1YvCj2t123iRgyewfWkMBYBMJsw+s47su-=13u4dF7W3Q@mail.gmail.com' \
    --to=joannelkoong@gmail.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=jefflexu@linux.alibaba.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox