Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Baokun Li <libaokun1@huawei.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	<linux-ext4@vger.kernel.org>, <tytso@mit.edu>,
	<adilger.kernel@dilger.ca>, <jack@suse.cz>,
	<linux-kernel@vger.kernel.org>, <kernel@pankajraghav.com>,
	<mcgrof@kernel.org>, <linux-fsdevel@vger.kernel.org>,
	<linux-mm@kvack.org>, <yi.zhang@huawei.com>,
	<yangerkun@huawei.com>, <chengzhihao1@huawei.com>,
	<catherine.hoang@oracle.com>,
	Baokun Li <libaokun@huaweicloud.com>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS
Date: Mon, 27 Oct 2025 10:57:15 +0800	[thread overview]
Message-ID: <2d5ee2b9-e348-4d4e-a514-6c698f19f7e5@huawei.com> (raw)
In-Reply-To: <aP0PachXS8Qxjo9Q@casper.infradead.org>

On 2025-10-26 01:56, Matthew Wilcox wrote:
> On Sat, Oct 25, 2025 at 02:32:45PM +0800, Baokun Li wrote:
>> On 2025-10-25 12:45, Matthew Wilcox wrote:
>>> On Sat, Oct 25, 2025 at 11:22:18AM +0800, libaokun@huaweicloud.com wrote:
>>>> +	while (1) {
>>>> +		folio = __filemap_get_folio(mapping, index, fgp_flags,
>>>> +					    gfp & ~__GFP_NOFAIL);
>>>> +		if (!IS_ERR(folio) || !(gfp & __GFP_NOFAIL))
>>>> +			return folio;
>>>> +
>>>> +		if (PTR_ERR(folio) != -ENOMEM && PTR_ERR(folio) != -EAGAIN)
>>>> +			return folio;
>>>> +
>>>> +		memalloc_retry_wait(gfp);
>>>> +	}
>>> No, absolutely not.  We're not having open-coded GFP_NOFAIL semantics.
>>> The right way forward is for ext4 to use iomap, not for buffer heads
>>> to support large block sizes.
>> ext4 only calls getblk_unmovable or __getblk when reading critical
>> metadata. Both of these functions set __GFP_NOFAIL to ensure that
>> metadata reads do not fail due to memory pressure.
> If filesystems actually require __GFP_NOFAIL for high-order allocations,
> then this is a new requirement that needs to be communicated to the MM
> developers, not hacked around in filesystems (or the VFS).  And that
> communication needs to be a separate thread with a clear subject line
> to attract the right attention, not buried in patch 26/28.

EXT4 is not the first filesystem to support LBS. I believe other
filesystems that already support LBS, even if they manage their own
metadata, have similar requirements. A filesystem cannot afford to become
read-only, shut down, or enter an inconsistent state due to memory
allocation failures in critical paths. Large folios have been around for
some time, and the fact that this warning still exists shows that the
problem is not trivial to solve.

Therefore, following the approach of filesystems that already support LBS,
such as XFS and the soon-to-be-removed bcachefs, I avoid adding
__GFP_NOFAIL for large allocations and instead retry internally to prevent
failures.

I do not intend to hide this issue in Patch 22/25. I cc’d linux-mm@kvack.org
precisely to invite memory management experts to share their thoughts on
the current situation.

Here is my limited understanding of the history of __GFP_NOFAIL:

Originally, in commit 4923abf9f1a4 ("Don't warn about order-1 allocations
with __GFP_NOFAIL"), Linus Torvalds raised the warning order from 0 to 1,
and commented,
    "Maybe we should remove this warning entirely."

We had considered removing this warning, but then saw the discussion below.

Previously we used WARN_ON_ONCE_GFP, which meant the warning could be
suppressed with __GFP_NOWARN. But with the introduction of large folios,
memory allocation and reclaim have become much more challenging.
__GFP_NOFAIL can still fail, and many callers do not check the return
value, leading to potential NULL pointer dereferences.

Linus also noted that __GFP_NOFAIL is heavily abused, and even said in [1]:
“Honestly, I'm perfectly fine with just removing that stupid useless flag
 entirely.”
"Because the blame should go *there*, and it should not even remotely look
 like "oh, the MM code failed". No. The caller was garbage."

[1]:
https://lore.kernel.org/linux-mm/CAHk-=wgv2-=Bm16Gtn5XHWj9J6xiqriV56yamU+iG07YrN28SQ@mail.gmail.com/

From this, my understanding is that handling or retrying large allocation
failures in the caller is the direction going forward.

As for why retries are done in the VFS, there are two reasons: first, both
ext4 and jbd2 read metadata through blkdev, so a unified change is simpler.
Second, retrying here allows other buffer-head-based filesystems to support
LBS more easily.

For now, until large memory allocation and reclaim are properly handled,
this approach serves as a practical workaround.

> For what it's worth, I think you have a good case.  This really is
> a new requirement (bs>PS) and in this scenario, we should be able to
> reclaim page cache memory of the appropriate order to satisfy the NOFAIL
> requirement.  There will be concerns that other users will now be able to
> use it without warning, but I think eventually this use case will prevail.
Yeah, it would be best if the memory subsystem could add a flag like
__GFP_LBS to suppress these warnings and guide allocation and reclaim to
perform optimizations suited for this scenario.
>> Both functions eventually call grow_dev_folio(), which is why we
>> handle the __GFP_NOFAIL logic there. xfs_buf_alloc_backing_mem()
>> has similar logic, but XFS manages its own metadata, allowing it
>> to use vmalloc for memory allocation.
> The other possibility is that we switch ext4 away from the buffer cache
> entirely.  This is a big job!  I know Catherine has been working on
> a generic replacement for the buffer cache, but I'm not sure if it's
> ready yet.
>
The key issue is not whether ext4 uses buffer heads; even using vmalloc
with __GFP_NOFAIL for large allocations faces the same problem. 

As Linus also mentioned in the link[1] above:  
"It has then expanded and is now a problem. The cases using GFP_NOFAIL
 for things like vmalloc() - which is by definition not a small
 allocation - should be just removed as outright bugs."

Thanks,
Baokun

next prev parent reply	other threads:[~2025-10-27  2:57 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-25  3:21 [PATCH 00/25] ext4: enable block size larger than page size libaokun
2025-10-25  3:21 ` [PATCH 01/25] ext4: remove page offset calculation in ext4_block_zero_page_range() libaokun
2025-11-03  7:41   ` Jan Kara
2025-10-25  3:21 ` [PATCH 02/25] ext4: remove page offset calculation in ext4_block_truncate_page() libaokun
2025-11-03  7:42   ` Jan Kara
2025-10-25  3:21 ` [PATCH 03/25] ext4: remove PAGE_SIZE checks for rec_len conversion libaokun
2025-11-03  7:43   ` Jan Kara
2025-10-25  3:22 ` [PATCH 04/25] ext4: make ext4_punch_hole() support large block size libaokun
2025-11-03  8:05   ` Jan Kara
2025-11-04  6:55     ` Baokun Li
2025-10-25  3:22 ` [PATCH 05/25] ext4: enable DIOREAD_NOLOCK by default for BS > PS as well libaokun
2025-11-03  8:06   ` Jan Kara
2025-10-25  3:22 ` [PATCH 06/25] ext4: introduce s_min_folio_order for future BS > PS support libaokun
2025-11-03  8:19   ` Jan Kara
2025-10-25  3:22 ` [PATCH 07/25] ext4: support large block size in ext4_calculate_overhead() libaokun
2025-11-03  8:14   ` Jan Kara
2025-11-03 14:37     ` Baokun Li
2025-10-25  3:22 ` [PATCH 08/25] ext4: support large block size in ext4_readdir() libaokun
2025-11-03  8:27   ` Jan Kara
2025-10-25  3:22 ` [PATCH 09/25] ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversion libaokun
2025-11-03  8:21   ` Jan Kara
2025-10-25  3:22 ` [PATCH 10/25] ext4: add EXT4_LBLK_TO_P and EXT4_P_TO_LBLK for block/page conversion libaokun
2025-11-03  8:26   ` Jan Kara
2025-11-03 14:45     ` Baokun Li
2025-11-05  8:27       ` Jan Kara
2025-10-25  3:22 ` [PATCH 11/25] ext4: support large block size in ext4_mb_load_buddy_gfp() libaokun
2025-11-05  8:46   ` Jan Kara
2025-10-25  3:22 ` [PATCH 12/25] ext4: support large block size in ext4_mb_get_buddy_page_lock() libaokun
2025-11-05  9:13   ` Jan Kara
2025-11-05  9:44     ` Baokun Li
2025-10-25  3:22 ` [PATCH 13/25] ext4: support large block size in ext4_mb_init_cache() libaokun
2025-11-05  9:18   ` Jan Kara
2025-10-25  3:22 ` [PATCH 14/25] ext4: prepare buddy cache inode for BS > PS with large folios libaokun
2025-11-05  9:19   ` Jan Kara
2025-10-25  3:22 ` [PATCH 15/25] ext4: rename 'page' references to 'folio' in multi-block allocator libaokun
2025-11-05  9:21   ` Jan Kara
2025-10-25  3:22 ` [PATCH 16/25] ext4: support large block size in ext4_mpage_readpages() libaokun
2025-11-05  9:26   ` Jan Kara
2025-10-25  3:22 ` [PATCH 17/25] ext4: support large block size in ext4_block_write_begin() libaokun
2025-11-05  9:28   ` Jan Kara
2025-10-25  3:22 ` [PATCH 18/25] ext4: support large block size in mpage_map_and_submit_buffers() libaokun
2025-11-05  9:30   ` Jan Kara
2025-10-25  3:22 ` [PATCH 19/25] ext4: support large block size in mpage_prepare_extent_to_map() libaokun
2025-11-05  9:31   ` Jan Kara
2025-10-25  3:22 ` [PATCH 20/25] ext4: support large block size in __ext4_block_zero_page_range() libaokun
2025-11-05  9:33   ` Jan Kara
2025-10-25  3:22 ` [PATCH 21/25] ext4: make online defragmentation support large block size libaokun
2025-11-05  9:50   ` Jan Kara
2025-11-05 10:48     ` Zhang Yi
2025-11-05 11:28     ` Baokun Li
2025-10-25  3:22 ` [PATCH 22/25] fs/buffer: prevent WARN_ON in __alloc_pages_slowpath() when BS > PS libaokun
2025-10-25  4:45   ` Matthew Wilcox
2025-10-25  5:13     ` Darrick J. Wong
2025-10-25  6:32     ` Baokun Li
2025-10-25  7:01       ` Zhang Yi
2025-10-25 17:56       ` Matthew Wilcox
2025-10-27  2:57         ` Baokun Li [this message]
2025-10-27  7:40         ` Christoph Hellwig
2025-10-30 21:25       ` Matthew Wilcox
2025-10-31  1:47         ` Zhang Yi
2025-10-31  1:55         ` Baokun Li
2025-10-25  6:34     ` Baokun Li
2025-10-25  3:22 ` [PATCH 23/25] jbd2: " libaokun
2025-10-25  3:22 ` [PATCH 24/25] ext4: add checks for large folio incompatibilities " libaokun
2025-11-05  9:59   ` Jan Kara
2025-10-25  3:22 ` [PATCH 25/25] ext4: enable block size larger than page size libaokun
2025-11-05 10:14   ` Jan Kara
2025-11-06  2:44     ` Baokun Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2d5ee2b9-e348-4d4e-a514-6c698f19f7e5@huawei.com \
    --to=libaokun1@huawei.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=catherine.hoang@oracle.com \
    --cc=chengzhihao1@huawei.com \
    --cc=djwong@kernel.org \
    --cc=jack@suse.cz \
    --cc=kernel@pankajraghav.com \
    --cc=libaokun@huaweicloud.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox