From: Jingbo Xu <jefflexu@linux.alibaba.com>
To: David Hildenbrand <david@redhat.com>,
Shakeel Butt <shakeel.butt@linux.dev>
Cc: Zi Yan <ziy@nvidia.com>, Joanne Koong <joannelkoong@gmail.com>,
miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
josef@toxicpanda.com, bernd.schubert@fastmail.fm,
linux-mm@kvack.org, kernel-team@meta.com,
Matthew Wilcox <willy@infradead.org>,
Oscar Salvador <osalvador@suse.de>,
Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
Date: Fri, 20 Dec 2024 15:55:14 +0800 [thread overview]
Message-ID: <d48ae58e-500f-4ef1-bc6f-a41a8f5b94bf@linux.alibaba.com> (raw)
In-Reply-To: <968d3543-d8ac-4b5a-af8e-e6921311d5cf@redhat.com>
Hi,
On 12/20/24 12:41 AM, David Hildenbrand wrote:
> On 19.12.24 17:40, Shakeel Butt wrote:
>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
>> [...]
>>>>
>>>> If you check the code just above this patch, this
>>>> mapping_writeback_indeterminate() check only happen for pages under
>>>> writeback which is a temp state. Anyways, fuse folios should not be
>>>> unmovable for their lifetime but only while under writeback which is
>>>> same for all fs.
>>>
>>> But there, writeback is expected to be a temporary thing, not possibly:
>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>>
>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>>> guarantees, and unfortunately, it sounds like this is the case here,
>>> unless
>>> I am missing something important.
>>>
>>
>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
>> the confusion. The writeback state is not indefinite. A proper fuse fs,
>> like anyother fs, should handle writeback pages appropriately. These
>> additional checks and skips are for (I think) untrusted fuse servers.
>
> Can unprivileged user space provoke this case?
>
There are some details on the initial problem that FUSE community wants
to fix [1].
In summary, a non-malicious fuse daemon may need to allocate some memory
when processing a FUSE_WRITE request (initiated from the writeback
routine), in which case memory reclaim and compaction is triggered when
allocating memory, which in turn leads to waiting on the writeback of
**FUSE** dirty pages (which itself waits for the fuse daemon to handle
it) - a deadlock here.
The current FUSE implementation fixes this by introducing "temp page" in
the writeback routine for FUSE. In short, a temporary page (allocated
from ZONE_UNMOVABLE) is allocated for each dirty page cache needs to be
written back. The content is copied from the original page cache to the
temporary page. And then the original page cache (to writeback,
allocated from ZONE_MOVABLE) clears PG_writeback bit immediately, so
that the fuse daemon won't possibly stuck in deadlock waiting for the
writeback of FUSE page cache. Instead, the actual writeback work is
done upon the cloned temporary page then.
Thus there are actually two pages for each FUSE page cache, one is the
original FUSE page cache (in ZONE_MOVABLE) and the other is the
temporary page (in ZONE_UNMOVABLE).
- For the original page cache, it will clear PG_writeback bit very
quickly in the writeback routine and won't block the memory direct
reclaim and compaction at all
- As for the temporary page, in the normal case, the fuse server will
complete FUSE_WRITE request as expected, and thus the temporary page
will get freed soon.
However FUSE supports unprivileged mount, in which case the fuse daemon
is run and mounted by an unprivileged user. Thus the backend fuse
daemon may be malicious (started by an unprivileged user) and refuses to
process any FUSE requests. Thus in the worst case, these temporary
pages will never complete writeback and get pinned in ZONE_UNMOVABLE
forever. (One thing worth noting is that, once the fuse daemon gets
killed, the whole FUSE filesystem will be aborted, all inflight FUSE
requests are flushed, and all the temporary pages will be freed then)
What this patchset does is to drop the temporary page design in the FUSE
writeback routine, while this patch is introduced to avoid the above
mentioned deadlock for a *sane* FUSE daemon in memory compaction after
dropping the temp page design.
Currently the FUSE writeback pages (i.e. FUSE page cache) is allocated
from GFP_HIGHUSER_MOVABLE, which is consistent with other filesystems.
In the normal case (the FUSE is backed by a well-behaved FUSE daemon),
the page cache will be completed in a reasonable manner and it won't
affect the usability of ZONE_MOVABLE.
While in the worst case (a malicious FUSE daemon run by an unprivileged
user), these page cache in ZONE_MOVABLE can be pinned there indefinitely.
We can argue that in the current implementation (without this patch
series), ZONE_UNMOVABLE can also grow larger and larger, and pin quite
many memory usage (correct me if I'm wrong) in the worst case. In this
degree this patch doesn't make things even worse. Besides FUSE enables
strictlimit feature by default, in which each FUSE filesystem can
consume at most 1% of global vm.dirty_background_thresh before write
throttle is triggered.
[1]
https://lore.kernel.org/all/8eec0912-7a6c-4387-b9be-6718f438a111@linux.alibaba.com/
--
Thanks,
Jingbo
next prev parent reply other threads:[~2024-12-20 7:55 UTC|newest]
Thread overview: 124+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
2024-12-19 13:05 ` David Hildenbrand
2024-12-19 14:19 ` Zi Yan
2024-12-19 15:08 ` Zi Yan
2024-12-19 15:39 ` David Hildenbrand
2024-12-19 15:47 ` Zi Yan
2024-12-19 15:50 ` David Hildenbrand
2024-12-19 15:43 ` Shakeel Butt
2024-12-19 15:47 ` David Hildenbrand
2024-12-19 15:53 ` Shakeel Butt
2024-12-19 15:55 ` Zi Yan
2024-12-19 15:56 ` Bernd Schubert
2024-12-19 16:00 ` Zi Yan
2024-12-19 16:02 ` Zi Yan
2024-12-19 16:09 ` Bernd Schubert
2024-12-19 16:14 ` Zi Yan
2024-12-19 16:26 ` Shakeel Butt
2024-12-19 16:31 ` David Hildenbrand
2024-12-19 16:53 ` Shakeel Butt
2024-12-19 16:22 ` Shakeel Butt
2024-12-19 16:29 ` David Hildenbrand
2024-12-19 16:40 ` Shakeel Butt
2024-12-19 16:41 ` David Hildenbrand
2024-12-19 17:14 ` Shakeel Butt
2024-12-19 17:26 ` David Hildenbrand
2024-12-19 17:30 ` Bernd Schubert
2024-12-19 17:37 ` Shakeel Butt
2024-12-19 17:40 ` Bernd Schubert
2024-12-19 17:44 ` Joanne Koong
2024-12-19 17:54 ` Shakeel Butt
2024-12-20 11:44 ` David Hildenbrand
2024-12-20 12:15 ` Bernd Schubert
2024-12-20 14:49 ` David Hildenbrand
2024-12-20 15:26 ` Bernd Schubert
2024-12-20 18:01 ` Shakeel Butt
2024-12-21 2:28 ` Jingbo Xu
2024-12-21 16:23 ` David Hildenbrand
2024-12-22 2:47 ` Jingbo Xu
2024-12-24 11:32 ` David Hildenbrand
2024-12-21 16:18 ` David Hildenbrand
2024-12-23 22:14 ` Shakeel Butt
2024-12-24 12:37 ` David Hildenbrand
2024-12-26 15:11 ` Zi Yan
2024-12-26 20:13 ` Shakeel Butt
2024-12-26 22:02 ` Bernd Schubert
2024-12-27 20:08 ` Joanne Koong
2024-12-27 20:32 ` Bernd Schubert
2024-12-30 17:52 ` Joanne Koong
2024-12-30 10:16 ` David Hildenbrand
2024-12-30 18:38 ` Joanne Koong
2024-12-30 19:52 ` David Hildenbrand
2024-12-30 20:11 ` Shakeel Butt
2025-01-02 18:54 ` Joanne Koong
2025-01-03 20:31 ` David Hildenbrand
2025-01-06 10:19 ` Miklos Szeredi
2025-01-06 18:17 ` Shakeel Butt
2025-01-07 8:34 ` David Hildenbrand
2025-01-07 18:07 ` Shakeel Butt
2025-01-09 11:22 ` David Hildenbrand
2025-01-10 20:28 ` Jeff Layton
2025-01-10 21:13 ` David Hildenbrand
2025-01-10 22:00 ` Shakeel Butt
2025-01-13 15:27 ` David Hildenbrand
2025-01-13 21:44 ` Jeff Layton
2025-01-14 8:38 ` Miklos Szeredi
2025-01-14 9:40 ` Miklos Szeredi
2025-01-14 9:55 ` Bernd Schubert
2025-01-14 10:07 ` Miklos Szeredi
2025-01-14 18:07 ` Joanne Koong
2025-01-14 18:58 ` Miklos Szeredi
2025-01-14 19:12 ` Joanne Koong
2025-01-14 20:00 ` Miklos Szeredi
2025-01-14 20:29 ` Jeff Layton
2025-01-14 21:40 ` Bernd Schubert
2025-01-23 16:06 ` Pavel Begunkov
2025-01-14 20:51 ` Joanne Koong
2025-01-24 12:25 ` David Hildenbrand
2025-01-14 15:49 ` Jeff Layton
2025-01-24 12:29 ` David Hildenbrand
2025-01-28 10:16 ` Miklos Szeredi
2025-01-14 15:44 ` Jeff Layton
2025-01-14 18:58 ` Joanne Koong
2025-01-10 23:11 ` Jeff Layton
2025-01-10 20:16 ` Jeff Layton
2025-01-10 20:20 ` David Hildenbrand
2025-01-10 20:43 ` Jeff Layton
2025-01-10 21:00 ` David Hildenbrand
2025-01-10 21:07 ` Jeff Layton
2025-01-10 21:21 ` David Hildenbrand
2025-01-07 16:15 ` Miklos Szeredi
2025-01-08 1:40 ` Jingbo Xu
2024-12-30 20:04 ` Shakeel Butt
2025-01-02 19:59 ` Joanne Koong
2025-01-02 20:26 ` Zi Yan
2024-12-20 21:01 ` Joanne Koong
2024-12-21 16:25 ` David Hildenbrand
2024-12-21 21:59 ` Bernd Schubert
2024-12-23 19:00 ` Joanne Koong
2024-12-26 22:44 ` Bernd Schubert
2024-12-27 18:25 ` Joanne Koong
2024-12-19 17:55 ` Joanne Koong
2024-12-19 18:04 ` Bernd Schubert
2024-12-19 18:11 ` Shakeel Butt
2024-12-20 7:55 ` Jingbo Xu [this message]
2025-04-02 21:34 ` Joanne Koong
2025-04-03 3:31 ` Jingbo Xu
2025-04-03 9:18 ` David Hildenbrand
2025-04-03 9:25 ` Bernd Schubert
2025-04-03 9:35 ` Christian Brauner
2025-04-03 19:09 ` Joanne Koong
2025-04-03 20:44 ` David Hildenbrand
2025-04-03 22:04 ` Joanne Koong
2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-25 9:46 ` Jingbo Xu
2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-12-13 11:52 ` Miklos Szeredi
2024-12-13 16:47 ` Shakeel Butt
2024-12-18 17:37 ` Joanne Koong
2024-12-18 17:44 ` Shakeel Butt
2024-12-18 17:53 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d48ae58e-500f-4ef1-bc6f-a41a8f5b94bf@linux.alibaba.com \
--to=jefflexu@linux.alibaba.com \
--cc=bernd.schubert@fastmail.fm \
--cc=david@redhat.com \
--cc=joannelkoong@gmail.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=miklos@szeredi.hu \
--cc=osalvador@suse.de \
--cc=shakeel.butt@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox