Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shakeel Butt <shakeel.butt@linux.dev>
To: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
	 libaokun@huaweicloud.com, linux-mm@kvack.org,
	akpm@linux-foundation.org, surenb@google.com,
	 jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	jack@suse.cz,  yi.zhang@huawei.com, yangerkun@huawei.com,
	libaokun1@huawei.com
Subject: Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
Date: Fri, 31 Oct 2025 08:35:50 -0700	[thread overview]
Message-ID: <fwm7tynepbnpq3db7u2fqrzmeq55gs5472r7wbggxmndap3k26@ngwgdcy26vxx> (raw)
In-Reply-To: <aQTHMI3t5mNXp0M1@casper.infradead.org>

On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote:
> On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> > On 10/31/25 08:25, Michal Hocko wrote:
> > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> > >> From: Baokun Li <libaokun1@huawei.com>
> > >> 
> > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> > >> reads at critical points, since they cannot afford to go read-only,
> > >> shut down, or enter an inconsistent state due to memory pressure.
> > >> 
> > >> Currently, attempting to allocate page units greater than order-1 with
> > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> > >> can easily require allocations larger than order-1.
> > >> 
> > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> > >> be many clean folios in the page cache that are 64KiB or larger.
> > >> 
> > >> Therefore, to avoid the warning when LBS is enabled, we relax this
> > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> > >> maximum supported logical block size is 64KiB, meaning the maximum order
> > >> handled here is 4.
> > > 
> > > Would be using kvmalloc an option instead of this?
> > 
> > The thread under Link: suggests xfs has its own vmalloc callback. But it's
> > not one of the 5 options listed, so it's good question how difficult would
> > be to implement that for ext4 or in general.
> 
> It's implicit in options 1-4.  Today, the buffer cache is an alias into
> the page cache.  The page cache can only store folios.  So to use
> vmalloc, we either have to make folios discontiguous, stop the buffer
> cache being an alias into the page cache, or stop ext4 from using the
> buffer cache.
> 
> > > This change doesn't really make much sense to me TBH. While the order=1
> > > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > > the allocator can sustain for NOFAIL requests is directly related to
> > > memory reclaim and internal allocator operation rather than something as
> > > external as block size. If the allocator needs to support 64kB NOFAIL
> > > requests because there is a strong demand for that then fine and we can
> > > see whether this is feasible.
> 
> Maybe Baokun's explanation for why this is unlikel to be a problem in
> practice didn't make sense to you.  Let me try again, perhaps being more
> explicit about things which an fs developer would know but an MM person
> might not realise.
> 
> Hard drive manufacturers are absolutely gagging to ship drives with a
> 64KiB sector size.  Once they do, the minimum transfer size to/from a
> device becomes 64KiB.  That means the page cache will cache all files
> (and fs metadata) from that drive in contiguous 64KiB chunks.  That means
> that when reclaim shakes the page cache, it's going to find a lot of
> order-4 folios to free ... which means that the occasional GFP_NOFAIL
> order-4 allocation is going to have no trouble finding order-4 pages to
> satisfy the allocation.
> 
> Now, the problem is the non-filesystems which may now take advantage of
> this to write lazy code.  It'd be nice if we had some token that said
> "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
> NOFAIL high-order allocation, you can reclaim one I've already allocated
> and everything will be fine".  But I can't see a way to put that kind
> of token into our interfaces.

A new gfp flag should be easy enough. However "you can reclaim one I've
already allocated" is not something current allocation & reclaim can
take any action on. Maybe that is something we can add. In addition the
behavior change of costly order needs more thought.

next prev parent reply	other threads:[~2025-10-31 15:36 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-31  6:13 libaokun
2025-10-31  7:25 ` Michal Hocko
2025-10-31 10:12   ` Vlastimil Babka
2025-10-31 14:26     ` Matthew Wilcox
2025-10-31 15:35       ` Shakeel Butt [this message]
2025-10-31 15:52         ` Shakeel Butt
2025-10-31 15:54           ` Matthew Wilcox
2025-10-31 16:46             ` Shakeel Butt
2025-10-31 16:55               ` Matthew Wilcox
2025-11-03  2:45                 ` Baokun Li
2025-11-03  7:55                 ` Michal Hocko
2025-11-03  9:01                   ` Vlastimil Babka
2025-11-03  9:25                     ` Michal Hocko
2025-11-04 10:31                       ` Michal Hocko
2025-11-04 12:32                         ` Vlastimil Babka
2025-11-04 12:50                           ` Michal Hocko
2025-11-04 12:57                             ` Vlastimil Babka
2025-11-04 16:43                               ` Michal Hocko
2025-11-05  6:23                                 ` Baokun Li
2025-11-03 18:53                     ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fwm7tynepbnpq3db7u2fqrzmeq55gs5472r7wbggxmndap3k26@ngwgdcy26vxx \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=libaokun1@huawei.com \
    --cc=libaokun@huaweicloud.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox