From: Jingbo Xu <jefflexu@linux.alibaba.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
shakeel.butt@linux.dev, josef@toxicpanda.com, linux-mm@kvack.org,
bernd.schubert@fastmail.fm, kernel-team@meta.com
Subject: Re: [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree
Date: Tue, 12 Nov 2024 10:31:43 +0800 [thread overview]
Message-ID: <0f585a7c-678b-492a-9492-358f21e57291@linux.alibaba.com> (raw)
In-Reply-To: <CAJnrk1YUPZhCUhGqu+bBngzrG-yCCRLZc7fiOfXQZ0dxCHJV8Q@mail.gmail.com>
On 11/12/24 5:30 AM, Joanne Koong wrote:
> On Mon, Nov 11, 2024 at 12:32 AM Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi, Joanne and Miklos,
>>
>> On 11/8/24 7:56 AM, Joanne Koong wrote:
>>> Currently, we allocate and copy data to a temporary folio when
>>> handling writeback in order to mitigate the following deadlock scenario
>>> that may arise if reclaim waits on writeback to complete:
>>> * single-threaded FUSE server is in the middle of handling a request
>>> that needs a memory allocation
>>> * memory allocation triggers direct reclaim
>>> * direct reclaim waits on a folio under writeback
>>> * the FUSE server can't write back the folio since it's stuck in
>>> direct reclaim
>>>
>>> To work around this, we allocate a temporary folio and copy over the
>>> original folio to the temporary folio so that writeback can be
>>> immediately cleared on the original folio. This additionally requires us
>>> to maintain an internal rb tree to keep track of writeback state on the
>>> temporary folios.
>>>
>>> A recent change prevents reclaim logic from waiting on writeback for
>>> folios whose mappings have the AS_WRITEBACK_MAY_BLOCK flag set in it.
>>> This commit sets AS_WRITEBACK_MAY_BLOCK on FUSE inode mappings (which
>>> will prevent FUSE folios from running into the reclaim deadlock described
>>> above) and removes the temporary folio + extra copying and the internal
>>> rb tree.
>>>
>>> fio benchmarks --
>>> (using averages observed from 10 runs, throwing away outliers)
>>>
>>> Setup:
>>> sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
>>> ./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
>>>
>>> fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
>>> --numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
>>>
>>> bs = 1k 4k 1M
>>> Before 351 MiB/s 1818 MiB/s 1851 MiB/s
>>> After 341 MiB/s 2246 MiB/s 2685 MiB/s
>>> % diff -3% 23% 45%
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>
>>
>> IIUC this patch seems to break commit
>> 8b284dc47291daf72fe300e1138a2e7ed56f38ab ("fuse: writepages: handle same
>> page rewrites").
>>
>
> Interesting! My understanding was that we only needed that commit
> because we were clearing writeback on the original folio before
> writeback had actually finished.
>
> Now that folio writeback state is accounted for normally (eg through
> writeback being set/cleared on the original folio), does the
> folio_wait_writeback() call we do in fuse_page_mkwrite() not mitigate
> this?
Yes, after inspecting the writeback logic more, it seems that the second
writeback won't be initiated if the first one has not completed yet, see
```
a_ops->writepages
write_cache_pages
writeback_iter
writeback_get_folio
folio_prepare_writeback
if folio_test_writeback(folio):
folio_wait_writeback(folio)
```
and thus it won't be an issue to remove the auxiliary list ;)
>
>>> - /*
>>> - * Being under writeback is unlikely but possible. For example direct
>>> - * read to an mmaped fuse file will set the page dirty twice; once when
>>> - * the pages are faulted with get_user_pages(), and then after the read
>>> - * completed.
>>> - */
>>
>> In short, the target scenario is like:
>>
>> ```
>> # open a fuse file and mmap
>> fd1 = open("fuse-file-path", ...)
>> uaddr = mmap(fd1, ...)
>>
>> # DIRECT read to the mmaped fuse file
>> fd2 = open("ext4-file-path", O_DIRECT, ...)
>> read(fd2, uaddr, ...)
>> # get_user_pages() of uaddr, and triggers faultin
>> # a_ops->dirty_folio() <--- mark PG_dirty
>>
>> # when DIRECT IO completed:
>> # a_ops->dirty_folio() <--- mark PG_dirty
>
> If you have the direct io function call stack at hand, could you point
> me to the function where the direct io completion marks this folio as
> dirty?
FYI The full call stack is like:
```
# DIRECT read(2) to the mmaped fuse file
read(fd2, uaddr1, ...)
f_ops->read_iter()
(iomap-based ) iomap_dio_rw
# for READ && user_backed_iter(iter):
dio->flags |= IOMAP_DIO_DIRTY
iomap_dio_iter
iomap_dio_bio_iter
# add user or kernel pages to a bio
bio_iov_iter_get_pages
...
pin_user_pages_fast(..., FOLL_WRITE, ...)
# find corresponding vma of dest buffer (fuse page cache)
# search page table (pet) to find corresponding page
# if not fault yet, trigger explicit faultin:
faultin_page(..., FOLL_WRITE, ...)
handle_mm_fault(..., FAULT_FLAG_WRITE)
handle_pte_fault
do_wp_page
(vma->vm_flags & VM_SHARED) wp_page_shared
...
fault_dirty_shared_page
folio_mark_dirty
a_ops->dirty_folio(), i.e.,
filemap_dirty_folio()
# set PG_dirty
folio_test_set_dirty(folio)
# set PAGECACHE_TAG_DIRTY
__folio_mark_dirty
# if dio->flags & IOMAP_DIO_DIRTY:
bio_set_pages_dirty
(for each dest page) folio_mark_dirty
a_ops->dirty_folio(), i.e., filemap_dirty_folio()
# set PG_dirty
folio_test_set_dirty(folio)
# set PAGECACHE_TAG_DIRTY
__folio_mark_dirty
```
>
>> ```
>>
>> The auxiliary write request list was introduced to fix this.
>>
>> I'm not sure if there's an alternative other than the auxiliary list to
>> fix it, e.g. calling folio_wait_writeback() in a_ops->dirty_folio() so
>> that the same folio won't get dirtied when the writeback has not
>> completed yet?
>>
>
> I'm curious how other filesystems solve for this - this seems like a
> generic situation other filesystems would run into as well.
>
As mentioned above, the writeback path will prevent the duplicate
writeback request on the same page when the first writeback IO has not
completed yet.
Sorry for the noise...
--
Thanks,
Jingbo
next prev parent reply other threads:[~2024-11-12 2:31 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-07 23:56 [PATCH v4 0/6] fuse: remove temp page copies in writeback Joanne Koong
2024-11-07 23:56 ` [PATCH v4 1/6] mm: add AS_WRITEBACK_MAY_BLOCK mapping flag Joanne Koong
2024-11-09 0:10 ` Shakeel Butt
2024-11-11 21:11 ` Joanne Koong
2024-11-15 19:33 ` Joanne Koong
2024-11-15 20:17 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 2/6] mm: skip reclaiming folios in legacy memcg writeback contexts that may block Joanne Koong
2024-11-09 0:16 ` Shakeel Butt
2024-11-07 23:56 ` [PATCH v4 3/6] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 4/6] mm/memory-hotplug: add finite retries in offline_pages() if migration fails Joanne Koong
2024-11-08 17:33 ` SeongJae Park
2024-11-08 18:56 ` David Hildenbrand
2024-11-08 19:00 ` David Hildenbrand
2024-11-08 21:27 ` Shakeel Butt
2024-11-08 21:42 ` Joanne Koong
2024-11-08 22:16 ` Shakeel Butt
2024-11-08 22:20 ` Joanne Koong
2024-11-08 21:59 ` Joanne Koong
2024-11-07 23:56 ` [PATCH v4 5/6] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_MAY_BLOCK mappings Joanne Koong
2024-11-07 23:56 ` [PATCH v4 6/6] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-08 8:48 ` Jingbo Xu
2024-11-08 22:33 ` Joanne Koong
2024-11-11 8:32 ` Jingbo Xu
2024-11-11 21:30 ` Joanne Koong
2024-11-12 2:31 ` Jingbo Xu [this message]
2024-11-13 19:11 ` Joanne Koong
2024-11-12 9:25 ` Jingbo Xu
2024-11-14 0:39 ` Joanne Koong
2024-11-14 1:46 ` Jingbo Xu
2024-11-14 18:19 ` Joanne Koong
2024-11-15 2:18 ` Jingbo Xu
2024-11-15 18:29 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0f585a7c-678b-492a-9492-358f21e57291@linux.alibaba.com \
--to=jefflexu@linux.alibaba.com \
--cc=bernd.schubert@fastmail.fm \
--cc=joannelkoong@gmail.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=miklos@szeredi.hu \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox