From: Vlastimil Babka <vbabka@suse.cz>
To: Michal Hocko <mhocko@suse.com>, Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
libaokun@huaweicloud.com, linux-mm@kvack.org,
akpm@linux-foundation.org, surenb@google.com,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
jack@suse.cz, yi.zhang@huawei.com, yangerkun@huawei.com,
libaokun1@huawei.com
Subject: Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
Date: Mon, 3 Nov 2025 10:01:54 +0100 [thread overview]
Message-ID: <9d5790f0-4a07-4cca-9f94-de101084a7e6@suse.cz> (raw)
In-Reply-To: <aQhf5LJJMlvT-rrE@tiehlicka>
On 11/3/25 08:55, Michal Hocko wrote:
> On Fri 31-10-25 16:55:44, Matthew Wilcox wrote:
>> On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote:
>> > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new
>> > (FS specific) gfp is fine but will require some maintenance to avoid
>> > abuse.
>>
>> I don't think a new GFP flag is the answer. GFP_TRUST_ME_BRO just
>> doesn't feel right.
>
> Yeah, as usual a new gfp flag seems convenient except history has taught
> us this rarely works.
>
>> > I am more interested in how to codify "you can reclaim one I've already
>> > allocated". I have a different scenario where network stack keep
>> > stealing memory from direct reclaimers and keeping them in reclaim for
>> > long time. If we have some mechanism to allow reclaimers to get the
>> > memory they have reclaimed (at least for some cases), I think that can
>> > be used in both cases.
>>
>> The only thing that comes to mind is putting pages freed by reclaim on
>> a list in task_struct instead of sending them back to the allocator.
>> Then the task can allocate from there and free up anything else it's
>> reclaimed at some later point. I don't think this is a good idea,
>> but it's the only idea that comes to mind.
>
> I have played with that idea years ago. Mostly to deal with direct
> reclaim unfairness when some reclaimers were doing a lot of work on
> behalf of everybody else. IIRC I have hit into different problems, like
> reclaim throttling and over-reclaim.
Btw, meanwhile we got this implemented in compaction, see
compaction_capture(). As the hook is in __free_one_page() it should now be
straightforward to arm it also for direct reclaim of e.g. __GFP_NOFAIL
costly order allocations. It probably wouldn't make sense for non-costly
orders because they are freed to the pcplists and we wouldn't want to make
those more expensive by adding the hook there too.
It's likely the hook in compaction already helps such allocations. But if
you expect the order-4 pages reclaim to be common thanks to the large
blocks, it could maybe help if capture was done in reclaim too.
> Anyway, page allocator does respect GFP_NOFAIL even for high order
> requests. The oom killer will be disabled for order-4 but as these will
> likely be GFP_NOFS anyway then the order doesn't make much of a
> difference. So these requests could really take long time to succeed but
> I guess this will be generally understood. As the vmalloc fallback
> doesn't seem to be a feasible option short (maybe even mid) term then
> this is the only choice we have other than failing allocations and
> seeing a lot of fs failures.
>
> That being said I would much rather go and drop the order warning than
> trying to invent some fine tuning based on usecase. We might need to
Agreed. Note it would also solve the warnings we saw syzbot etc trigger via
slab by allocating a <8k object with __GFP_NOFAIL. This would normally only
pass the __GFP_NOFAIL only to the fallback minimum size (order-1) slab
allocation and thus be fine, but can result in order>1 allocation if you
enable KASAN or other debugging option that bumps the <8k object size to >8k
space needed with the debug metadata.
Maybe we could keep the warning for >=PMD_ORDER as that would still mean
someone made an error?
> invent some OOM protection for order-3 nofail requests as OOM killer
> could just make too much harm killing tasks without much of chance to
> defragment memory. Let's deal with that once we see that happening.
next prev parent reply other threads:[~2025-11-03 9:02 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-31 6:13 libaokun
2025-10-31 7:25 ` Michal Hocko
2025-10-31 10:12 ` Vlastimil Babka
2025-10-31 14:26 ` Matthew Wilcox
2025-10-31 15:35 ` Shakeel Butt
2025-10-31 15:52 ` Shakeel Butt
2025-10-31 15:54 ` Matthew Wilcox
2025-10-31 16:46 ` Shakeel Butt
2025-10-31 16:55 ` Matthew Wilcox
2025-11-03 2:45 ` Baokun Li
2025-11-03 7:55 ` Michal Hocko
2025-11-03 9:01 ` Vlastimil Babka [this message]
2025-11-03 9:25 ` Michal Hocko
2025-11-04 10:31 ` Michal Hocko
2025-11-04 12:32 ` Vlastimil Babka
2025-11-04 12:50 ` Michal Hocko
2025-11-04 12:57 ` Vlastimil Babka
2025-11-04 16:43 ` Michal Hocko
2025-11-05 6:23 ` Baokun Li
2025-11-03 18:53 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9d5790f0-4a07-4cca-9f94-de101084a7e6@suse.cz \
--to=vbabka@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=jackmanb@google.com \
--cc=libaokun1@huawei.com \
--cc=libaokun@huaweicloud.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=willy@infradead.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox