From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Daniel Palmer <daniel@0x0f.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
mm-commits@vger.kernel.org
Subject: Re: [GIT PULL] MM updates for 6.19-rc1
Date: Thu, 11 Dec 2025 17:53:30 -0800 [thread overview]
Message-ID: <20251212015330.1874521-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <20251211225947.822866-1-joshua.hahnjy@gmail.com>
On Thu, 11 Dec 2025 14:59:46 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> On Thu, 11 Dec 2025 20:12:18 +0900 Daniel Palmer <daniel@0x0f.com> wrote:
>
> > Hi Andrew,
> >
> > On Thu, 4 Dec 2025 at 14:29, Andrew Morton <akpm@linux-foundation.org> wrote:
> > > mm/page_alloc: prevent reporting pcp->batch = 0
> >
> > I think, maybe, the following part of this patch broke nommu.
> >
> > - new_batch = max(1, zone_batchsize(zone));
> > + new_batch = zone_batchsize(zone);
> >
> > Before this change on nommu zone_batchsize() returns 0 but the max()
> > changes it to 1. Now it'll stay as 0 and anywhere that depends on it
> > not being 0 won't work?
>
> Hi Daniel,
>
> Thank you for taking a look at this and finding that this was the source of
> the deadlock. I took a look, it's definitely an issue. The problem is that
> the patch gets rid of the max(1, zone_batchsize()) and handles the MMU case
> by ensuring zone_batchsize never returns a value less than 1, but the
> NOMMU case always returns 0.
>
> I think your solution below works. I've also come up with a simler workaround
> which doesn't change drain_pages_zone:
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0f026ec10b6..9d638697cec8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5919,7 +5919,7 @@ static int zone_batchsize(struct zone *zone)
> * recycled, this leads to the once large chunks of space being
> * fragmented and becoming unavailable for high-order allocations.
> */
> - return 0;
> + return 1;
> #endif
> }
>
> Would this be enough? Then we don't have to worry about handling zero values
> from the callsites for NOMMU machines as well. But this has the opposite problem
> that I was initially trying to fix, which is that NOMMU machines will now
> report a batchsize of 1 in zone_pcp_init and print it out to dmesg, which
> may be confusing for NOMMU users who expect there to be no batchsize. So
> it totally makes sense for me to drop my original patch completely as well.
> I'm not a NOMMU user so I am hoping to receive some feedback from folks who do
> who can chime in on which approach is better.
>
> > I'm seeing a deadlock on nommu:
> >
> > https://lore.kernel.org/lkml/20251211102607.2538595-1-daniel@thingy.jp/
>
> I would also like to take this opportunity to ask any NOMMU experts out there
> about the apparent disagreement between the comment in zone_batchsize under the
> NOMMU case, which suggests that NOMMU is harmed by batched freeing:
>
> /* The deferral and batching of frees should be suppressed under NOMMU
> * conditions.
>
> And returns 0 here which makes sense, only to artificially set it to 1 via
> the max() later on and still do batching anyways by
> << CONFIG_PCP_BATCH_SCALE_MAX.
>
> Thank you Daniel again for helping root cause this. Hopefully this fix works
> to fix the deadlock you mentioned! Have a great day : -)
> Joshua
On further reflection there's actually an even simpler solution, which is to
just keep the max(1, zone_batchsize(zone)).
For zone_pcp_init, calling zone_batchsize() should:
- For MMU: never return 0
- For NOMMU: always return 0
For zone_set_pageset_high_and_batch:
- For MMU: never return 0
- For NOMMU: always return 1
And I think the solution should ideally not introduce more #ifdefs...
So what if we drop my patch (and the fixlet above) and replace it with the
following patch? Happy to send it as a separate patch if there are any
follow-ups : -)
---
zone_batchsize returns the appropriate value that should be used for
pcp->batch. If it finds a zone with less than 4096 pages or PAGE_SIZE >
1M, however, it leads to some incorrect math.
In the above case, we will get an intermediary value of 1, which is then
rounded down to the nearest power of two, and 1 is subtracted from it.
Since 1 is already a power of two, we will get batch = 1-1 = 0:
batch = rounddown_pow_of_two(batch + batch/2) - 1;
A pcp->batch value of 0 is nonsensical, for MMU systems. If this were
actually set, then functions like drain_zone_pages would become no-ops,
since they would free 0 pages at a time.
Of the two callers of zone_batchsize, the one that is actually used to
set pcp->batch works around this by setting pcp->batch to the maximum
of 1 and zone_batchsize. However, the other caller, zone_pcp_init,
incorrectly prints out the batch size of the zone to be 0.
This is probably rare in a typical zone, but the DMA zone can often have
less than 4096 pages, which means it will print out "LIFO batch:0".
Before: [ 0.001216] DMA zone: 3998 pages, LIFO batch:0
After: [ 0.001210] DMA zone: 3998 pages, LIFO batch:1
With all of this said, NOMMU differs in two ways. Semantically, it
should report that pcp->batch is 0. At the same time, it can never
really have a pcp->batch size of 0 since it will reach a deadlock in
pcp freeing functions. For this reason, zone_batchsize should still
report 0 for NOMMU, but zone_set_pageset_high_and_batch should still
interpret it as 1, meaning we cannot get rid of max(1, zone_batchsize())
in zone_set_pageset_high_and_batch.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f928d37eeb6a..95172f4610ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5888,7 +5888,7 @@ static int zone_batchsize(struct zone *zone)
* and zone lock contention.
*/
batch = min(zone_managed_pages(zone) >> 12, SZ_256K / PAGE_SIZE);
- if (batch < 1)
+ if (batch <= 1)
batch = 1;
/*
base-commit: e0669bdbba170f6c1a7bf5763f72df3f58a4945c
prev parent reply other threads:[~2025-12-12 1:53 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-04 5:29 Andrew Morton
2025-12-05 21:56 ` Linus Torvalds
2025-12-06 2:54 ` Jason Gunthorpe
2025-12-06 3:43 ` Linus Torvalds
2025-12-05 22:14 ` pr-tracker-bot
2025-12-11 11:12 ` Daniel Palmer
2025-12-11 22:59 ` Joshua Hahn
2025-12-12 1:53 ` Joshua Hahn [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251212015330.1874521-1-joshua.hahnjy@gmail.com \
--to=joshua.hahnjy@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=daniel@0x0f.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mm-commits@vger.kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox