Re: [GIT PULL] MM updates for 6.19-rc1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Daniel Palmer <daniel@0x0f.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	mm-commits@vger.kernel.org
Subject: Re: [GIT PULL] MM updates for 6.19-rc1
Date: Thu, 11 Dec 2025 17:53:30 -0800	[thread overview]
Message-ID: <20251212015330.1874521-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <20251211225947.822866-1-joshua.hahnjy@gmail.com>

On Thu, 11 Dec 2025 14:59:46 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On Thu, 11 Dec 2025 20:12:18 +0900 Daniel Palmer <daniel@0x0f.com> wrote:
> 
> > Hi Andrew,
> > 
> > On Thu, 4 Dec 2025 at 14:29, Andrew Morton <akpm@linux-foundation.org> wrote:
> > >       mm/page_alloc: prevent reporting pcp->batch = 0
> > 
> > I think, maybe, the following part of this patch broke nommu.
> > 
> > -       new_batch = max(1, zone_batchsize(zone));
> > +       new_batch = zone_batchsize(zone);
> > 
> > Before this change on nommu zone_batchsize() returns 0 but the max()
> > changes it to 1. Now it'll stay as 0 and anywhere that depends on it
> > not being 0 won't work?
> 
> Hi Daniel,
> 
> Thank you for taking a look at this and finding that this was the source of
> the deadlock. I took a look, it's definitely an issue. The problem is that
> the patch gets rid of the max(1, zone_batchsize()) and handles the MMU case
> by ensuring zone_batchsize never returns a value less than 1, but the
> NOMMU case always returns 0.
> 
> I think your solution below works. I've also come up with a simler workaround
> which doesn't change drain_pages_zone:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0f026ec10b6..9d638697cec8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5919,7 +5919,7 @@ static int zone_batchsize(struct zone *zone)
>          * recycled, this leads to the once large chunks of space being
>          * fragmented and becoming unavailable for high-order allocations.
>          */
> -       return 0;
> +       return 1;
>  #endif
>  }
> 
> Would this be enough? Then we don't have to worry about handling zero values
> from the callsites for NOMMU machines as well. But this has the opposite problem
> that I was initially trying to fix, which is that NOMMU machines will now
> report a batchsize of 1 in zone_pcp_init and print it out to dmesg, which
> may be confusing for NOMMU users who expect there to be no batchsize. So
> it totally makes sense for me to drop my original patch completely as well.
> I'm not a NOMMU user so I am hoping to receive some feedback from folks who do
> who can chime in on which approach is better.
> 
> > I'm seeing a deadlock on nommu:
> > 
> > https://lore.kernel.org/lkml/20251211102607.2538595-1-daniel@thingy.jp/
> 
> I would also like to take this opportunity to ask any NOMMU experts out there
> about the apparent disagreement between the comment in zone_batchsize under the
> NOMMU case, which suggests that NOMMU is harmed by batched freeing:
> 
> 	/* The deferral and batching of frees should be suppressed under NOMMU
> 	 * conditions.
> 
> And returns 0 here which makes sense, only to artificially set it to 1 via
> the max() later on and still do batching anyways by
> << CONFIG_PCP_BATCH_SCALE_MAX.
> 
> Thank you Daniel again for helping root cause this. Hopefully this fix works
> to fix the deadlock you mentioned! Have a great day : -)
> Joshua

On further reflection there's actually an even simpler solution, which is to
just keep the max(1, zone_batchsize(zone)).

For zone_pcp_init, calling zone_batchsize() should:
- For MMU: never return 0
- For NOMMU: always return 0

For zone_set_pageset_high_and_batch:
- For MMU: never return 0
- For NOMMU: always return 1

And I think the solution should ideally not introduce more #ifdefs...
So what if we drop my patch (and the fixlet above) and replace it with the
following patch? Happy to send it as a separate patch if there are any
follow-ups : -)

---

zone_batchsize returns the appropriate value that should be used for
pcp->batch. If it finds a zone with less than 4096 pages or PAGE_SIZE >
1M, however, it leads to some incorrect math.

In the above case, we will get an intermediary value of 1, which is then
rounded down to the nearest power of two, and 1 is subtracted from it.
Since 1 is already a power of two, we will get batch = 1-1 = 0:

        batch = rounddown_pow_of_two(batch + batch/2) - 1;

A pcp->batch value of 0 is nonsensical, for MMU systems. If this were
actually set, then functions like drain_zone_pages would become no-ops,
since they would free 0 pages at a time.

Of the two callers of zone_batchsize, the one that is actually used to
set pcp->batch works around this by setting pcp->batch to the maximum
of 1 and zone_batchsize. However, the other caller, zone_pcp_init,
incorrectly prints out the batch size of the zone to be 0.

This is probably rare in a typical zone, but the DMA zone can often have
less than 4096 pages, which means it will print out "LIFO batch:0".

Before: [    0.001216]   DMA zone: 3998 pages, LIFO batch:0
After:  [    0.001210]   DMA zone: 3998 pages, LIFO batch:1

With all of this said, NOMMU differs in two ways. Semantically, it
should report that pcp->batch is 0. At the same time, it can never
really have a pcp->batch size of 0 since it will reach a deadlock in
pcp freeing functions. For this reason, zone_batchsize should still
report 0 for NOMMU, but zone_set_pageset_high_and_batch should still
interpret it as 1, meaning we cannot get rid of max(1, zone_batchsize())
in zone_set_pageset_high_and_batch.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f928d37eeb6a..95172f4610ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5888,7 +5888,7 @@ static int zone_batchsize(struct zone *zone)
         * and zone lock contention.
         */
        batch = min(zone_managed_pages(zone) >> 12, SZ_256K / PAGE_SIZE);
-       if (batch < 1)
+       if (batch <= 1)
                batch = 1;

        /*

base-commit: e0669bdbba170f6c1a7bf5763f72df3f58a4945c

     prev parent reply	other threads:[~2025-12-12  1:53 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04  5:29 Andrew Morton
2025-12-05 21:56 ` Linus Torvalds
2025-12-06  2:54   ` Jason Gunthorpe
2025-12-06  3:43     ` Linus Torvalds
2025-12-05 22:14 ` pr-tracker-bot
2025-12-11 11:12 ` Daniel Palmer
2025-12-11 22:59   ` Joshua Hahn
2025-12-12  1:53     ` Joshua Hahn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251212015330.1874521-1-joshua.hahnjy@gmail.com \
    --to=joshua.hahnjy@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=daniel@0x0f.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox