From: Shakeel Butt <shakeel.butt@linux.dev>
To: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
libaokun@huaweicloud.com, linux-mm@kvack.org,
akpm@linux-foundation.org, surenb@google.com,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
jack@suse.cz, yi.zhang@huawei.com, yangerkun@huawei.com,
libaokun1@huawei.com
Subject: Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
Date: Fri, 31 Oct 2025 08:35:50 -0700 [thread overview]
Message-ID: <fwm7tynepbnpq3db7u2fqrzmeq55gs5472r7wbggxmndap3k26@ngwgdcy26vxx> (raw)
In-Reply-To: <aQTHMI3t5mNXp0M1@casper.infradead.org>
On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote:
> On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> > On 10/31/25 08:25, Michal Hocko wrote:
> > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> > >> From: Baokun Li <libaokun1@huawei.com>
> > >>
> > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> > >> reads at critical points, since they cannot afford to go read-only,
> > >> shut down, or enter an inconsistent state due to memory pressure.
> > >>
> > >> Currently, attempting to allocate page units greater than order-1 with
> > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> > >> can easily require allocations larger than order-1.
> > >>
> > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> > >> be many clean folios in the page cache that are 64KiB or larger.
> > >>
> > >> Therefore, to avoid the warning when LBS is enabled, we relax this
> > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> > >> maximum supported logical block size is 64KiB, meaning the maximum order
> > >> handled here is 4.
> > >
> > > Would be using kvmalloc an option instead of this?
> >
> > The thread under Link: suggests xfs has its own vmalloc callback. But it's
> > not one of the 5 options listed, so it's good question how difficult would
> > be to implement that for ext4 or in general.
>
> It's implicit in options 1-4. Today, the buffer cache is an alias into
> the page cache. The page cache can only store folios. So to use
> vmalloc, we either have to make folios discontiguous, stop the buffer
> cache being an alias into the page cache, or stop ext4 from using the
> buffer cache.
>
> > > This change doesn't really make much sense to me TBH. While the order=1
> > > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > > the allocator can sustain for NOFAIL requests is directly related to
> > > memory reclaim and internal allocator operation rather than something as
> > > external as block size. If the allocator needs to support 64kB NOFAIL
> > > requests because there is a strong demand for that then fine and we can
> > > see whether this is feasible.
>
> Maybe Baokun's explanation for why this is unlikel to be a problem in
> practice didn't make sense to you. Let me try again, perhaps being more
> explicit about things which an fs developer would know but an MM person
> might not realise.
>
> Hard drive manufacturers are absolutely gagging to ship drives with a
> 64KiB sector size. Once they do, the minimum transfer size to/from a
> device becomes 64KiB. That means the page cache will cache all files
> (and fs metadata) from that drive in contiguous 64KiB chunks. That means
> that when reclaim shakes the page cache, it's going to find a lot of
> order-4 folios to free ... which means that the occasional GFP_NOFAIL
> order-4 allocation is going to have no trouble finding order-4 pages to
> satisfy the allocation.
>
> Now, the problem is the non-filesystems which may now take advantage of
> this to write lazy code. It'd be nice if we had some token that said
> "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
> NOFAIL high-order allocation, you can reclaim one I've already allocated
> and everything will be fine". But I can't see a way to put that kind
> of token into our interfaces.
A new gfp flag should be easy enough. However "you can reclaim one I've
already allocated" is not something current allocation & reclaim can
take any action on. Maybe that is something we can add. In addition the
behavior change of costly order needs more thought.
next prev parent reply other threads:[~2025-10-31 15:36 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-31 6:13 libaokun
2025-10-31 7:25 ` Michal Hocko
2025-10-31 10:12 ` Vlastimil Babka
2025-10-31 14:26 ` Matthew Wilcox
2025-10-31 15:35 ` Shakeel Butt [this message]
2025-10-31 15:52 ` Shakeel Butt
2025-10-31 15:54 ` Matthew Wilcox
2025-10-31 16:46 ` Shakeel Butt
2025-10-31 16:55 ` Matthew Wilcox
2025-11-03 2:45 ` Baokun Li
2025-11-03 7:55 ` Michal Hocko
2025-11-03 9:01 ` Vlastimil Babka
2025-11-03 9:25 ` Michal Hocko
2025-11-04 10:31 ` Michal Hocko
2025-11-04 12:32 ` Vlastimil Babka
2025-11-04 12:50 ` Michal Hocko
2025-11-04 12:57 ` Vlastimil Babka
2025-11-04 16:43 ` Michal Hocko
2025-11-05 6:23 ` Baokun Li
2025-11-03 18:53 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fwm7tynepbnpq3db7u2fqrzmeq55gs5472r7wbggxmndap3k26@ngwgdcy26vxx \
--to=shakeel.butt@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=jackmanb@google.com \
--cc=libaokun1@huawei.com \
--cc=libaokun@huaweicloud.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox