Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shakeel Butt <shakeel.butt@linux.dev>
To: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>, Michal Hocko <mhocko@suse.com>,
	 libaokun@huaweicloud.com, linux-mm@kvack.org,
	akpm@linux-foundation.org, surenb@google.com,
	 jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	jack@suse.cz,  yi.zhang@huawei.com, yangerkun@huawei.com,
	libaokun1@huawei.com
Subject: Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
Date: Fri, 31 Oct 2025 08:52:49 -0700	[thread overview]
Message-ID: <qfo7raavlfupjqwbgl2mnmfvmn3z5oslnxcehorib3xjdgf4yo@bp76xmxplre7> (raw)
In-Reply-To: <fwm7tynepbnpq3db7u2fqrzmeq55gs5472r7wbggxmndap3k26@ngwgdcy26vxx>

On Fri, Oct 31, 2025 at 08:35:50AM -0700, Shakeel Butt wrote:
> On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote:
> > On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> > > On 10/31/25 08:25, Michal Hocko wrote:
> > > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> > > >> From: Baokun Li <libaokun1@huawei.com>
> > > >> 
> > > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> > > >> reads at critical points, since they cannot afford to go read-only,
> > > >> shut down, or enter an inconsistent state due to memory pressure.
> > > >> 
> > > >> Currently, attempting to allocate page units greater than order-1 with
> > > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> > > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> > > >> can easily require allocations larger than order-1.
> > > >> 
> > > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> > > >> be many clean folios in the page cache that are 64KiB or larger.
> > > >> 
> > > >> Therefore, to avoid the warning when LBS is enabled, we relax this
> > > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> > > >> maximum supported logical block size is 64KiB, meaning the maximum order
> > > >> handled here is 4.
> > > > 
> > > > Would be using kvmalloc an option instead of this?
> > > 
> > > The thread under Link: suggests xfs has its own vmalloc callback. But it's
> > > not one of the 5 options listed, so it's good question how difficult would
> > > be to implement that for ext4 or in general.
> > 
> > It's implicit in options 1-4.  Today, the buffer cache is an alias into
> > the page cache.  The page cache can only store folios.  So to use
> > vmalloc, we either have to make folios discontiguous, stop the buffer
> > cache being an alias into the page cache, or stop ext4 from using the
> > buffer cache.
> > 
> > > > This change doesn't really make much sense to me TBH. While the order=1
> > > > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > > > the allocator can sustain for NOFAIL requests is directly related to
> > > > memory reclaim and internal allocator operation rather than something as
> > > > external as block size. If the allocator needs to support 64kB NOFAIL
> > > > requests because there is a strong demand for that then fine and we can
> > > > see whether this is feasible.
> > 
> > Maybe Baokun's explanation for why this is unlikel to be a problem in
> > practice didn't make sense to you.  Let me try again, perhaps being more
> > explicit about things which an fs developer would know but an MM person
> > might not realise.
> > 
> > Hard drive manufacturers are absolutely gagging to ship drives with a
> > 64KiB sector size.  Once they do, the minimum transfer size to/from a
> > device becomes 64KiB.  That means the page cache will cache all files
> > (and fs metadata) from that drive in contiguous 64KiB chunks.  That means
> > that when reclaim shakes the page cache, it's going to find a lot of
> > order-4 folios to free ... which means that the occasional GFP_NOFAIL
> > order-4 allocation is going to have no trouble finding order-4 pages to
> > satisfy the allocation.
> > 
> > Now, the problem is the non-filesystems which may now take advantage of
> > this to write lazy code.  It'd be nice if we had some token that said
> > "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
> > NOFAIL high-order allocation, you can reclaim one I've already allocated
> > and everything will be fine".  But I can't see a way to put that kind
> > of token into our interfaces.
> 
> A new gfp flag should be easy enough. However "you can reclaim one I've
> already allocated" is not something current allocation & reclaim can
> take any action on. Maybe that is something we can add. In addition the
> behavior change of costly order needs more thought.
> 

After reading the background link, it seems like the actual allocation
will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not
really reclaim any file memory (page cache). However I wonder with the
writeback gone from reclaim path, should we allow reclaiming clean file
pages even for NOFS context (need some digging).

next prev parent reply	other threads:[~2025-10-31 15:53 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-31  6:13 libaokun
2025-10-31  7:25 ` Michal Hocko
2025-10-31 10:12   ` Vlastimil Babka
2025-10-31 14:26     ` Matthew Wilcox
2025-10-31 15:35       ` Shakeel Butt
2025-10-31 15:52         ` Shakeel Butt [this message]
2025-10-31 15:54           ` Matthew Wilcox
2025-10-31 16:46             ` Shakeel Butt
2025-10-31 16:55               ` Matthew Wilcox
2025-11-03  2:45                 ` Baokun Li
2025-11-03  7:55                 ` Michal Hocko
2025-11-03  9:01                   ` Vlastimil Babka
2025-11-03  9:25                     ` Michal Hocko
2025-11-04 10:31                       ` Michal Hocko
2025-11-04 12:32                         ` Vlastimil Babka
2025-11-04 12:50                           ` Michal Hocko
2025-11-04 12:57                             ` Vlastimil Babka
2025-11-04 16:43                               ` Michal Hocko
2025-11-05  6:23                                 ` Baokun Li
2025-11-03 18:53                     ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=qfo7raavlfupjqwbgl2mnmfvmn3z5oslnxcehorib3xjdgf4yo@bp76xmxplre7 \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=libaokun1@huawei.com \
    --cc=libaokun@huaweicloud.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox