From: Shakeel Butt <shakeel.butt@linux.dev>
To: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
libaokun@huaweicloud.com, linux-mm@kvack.org,
akpm@linux-foundation.org, surenb@google.com,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
jack@suse.cz, yi.zhang@huawei.com, yangerkun@huawei.com,
libaokun1@huawei.com
Subject: Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
Date: Fri, 31 Oct 2025 08:52:49 -0700 [thread overview]
Message-ID: <qfo7raavlfupjqwbgl2mnmfvmn3z5oslnxcehorib3xjdgf4yo@bp76xmxplre7> (raw)
In-Reply-To: <fwm7tynepbnpq3db7u2fqrzmeq55gs5472r7wbggxmndap3k26@ngwgdcy26vxx>
On Fri, Oct 31, 2025 at 08:35:50AM -0700, Shakeel Butt wrote:
> On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote:
> > On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> > > On 10/31/25 08:25, Michal Hocko wrote:
> > > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> > > >> From: Baokun Li <libaokun1@huawei.com>
> > > >>
> > > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> > > >> reads at critical points, since they cannot afford to go read-only,
> > > >> shut down, or enter an inconsistent state due to memory pressure.
> > > >>
> > > >> Currently, attempting to allocate page units greater than order-1 with
> > > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> > > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> > > >> can easily require allocations larger than order-1.
> > > >>
> > > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> > > >> be many clean folios in the page cache that are 64KiB or larger.
> > > >>
> > > >> Therefore, to avoid the warning when LBS is enabled, we relax this
> > > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> > > >> maximum supported logical block size is 64KiB, meaning the maximum order
> > > >> handled here is 4.
> > > >
> > > > Would be using kvmalloc an option instead of this?
> > >
> > > The thread under Link: suggests xfs has its own vmalloc callback. But it's
> > > not one of the 5 options listed, so it's good question how difficult would
> > > be to implement that for ext4 or in general.
> >
> > It's implicit in options 1-4. Today, the buffer cache is an alias into
> > the page cache. The page cache can only store folios. So to use
> > vmalloc, we either have to make folios discontiguous, stop the buffer
> > cache being an alias into the page cache, or stop ext4 from using the
> > buffer cache.
> >
> > > > This change doesn't really make much sense to me TBH. While the order=1
> > > > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > > > the allocator can sustain for NOFAIL requests is directly related to
> > > > memory reclaim and internal allocator operation rather than something as
> > > > external as block size. If the allocator needs to support 64kB NOFAIL
> > > > requests because there is a strong demand for that then fine and we can
> > > > see whether this is feasible.
> >
> > Maybe Baokun's explanation for why this is unlikel to be a problem in
> > practice didn't make sense to you. Let me try again, perhaps being more
> > explicit about things which an fs developer would know but an MM person
> > might not realise.
> >
> > Hard drive manufacturers are absolutely gagging to ship drives with a
> > 64KiB sector size. Once they do, the minimum transfer size to/from a
> > device becomes 64KiB. That means the page cache will cache all files
> > (and fs metadata) from that drive in contiguous 64KiB chunks. That means
> > that when reclaim shakes the page cache, it's going to find a lot of
> > order-4 folios to free ... which means that the occasional GFP_NOFAIL
> > order-4 allocation is going to have no trouble finding order-4 pages to
> > satisfy the allocation.
> >
> > Now, the problem is the non-filesystems which may now take advantage of
> > this to write lazy code. It'd be nice if we had some token that said
> > "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
> > NOFAIL high-order allocation, you can reclaim one I've already allocated
> > and everything will be fine". But I can't see a way to put that kind
> > of token into our interfaces.
>
> A new gfp flag should be easy enough. However "you can reclaim one I've
> already allocated" is not something current allocation & reclaim can
> take any action on. Maybe that is something we can add. In addition the
> behavior change of costly order needs more thought.
>
After reading the background link, it seems like the actual allocation
will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not
really reclaim any file memory (page cache). However I wonder with the
writeback gone from reclaim path, should we allow reclaiming clean file
pages even for NOFS context (need some digging).
next prev parent reply other threads:[~2025-10-31 15:53 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-31 6:13 libaokun
2025-10-31 7:25 ` Michal Hocko
2025-10-31 10:12 ` Vlastimil Babka
2025-10-31 14:26 ` Matthew Wilcox
2025-10-31 15:35 ` Shakeel Butt
2025-10-31 15:52 ` Shakeel Butt [this message]
2025-10-31 15:54 ` Matthew Wilcox
2025-10-31 16:46 ` Shakeel Butt
2025-10-31 16:55 ` Matthew Wilcox
2025-11-03 2:45 ` Baokun Li
2025-11-03 7:55 ` Michal Hocko
2025-11-03 9:01 ` Vlastimil Babka
2025-11-03 9:25 ` Michal Hocko
2025-11-04 10:31 ` Michal Hocko
2025-11-04 12:32 ` Vlastimil Babka
2025-11-04 12:50 ` Michal Hocko
2025-11-04 12:57 ` Vlastimil Babka
2025-11-04 16:43 ` Michal Hocko
2025-11-05 6:23 ` Baokun Li
2025-11-03 18:53 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=qfo7raavlfupjqwbgl2mnmfvmn3z5oslnxcehorib3xjdgf4yo@bp76xmxplre7 \
--to=shakeel.butt@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=jackmanb@google.com \
--cc=libaokun1@huawei.com \
--cc=libaokun@huaweicloud.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox