[RFC] [PATCH] mm/page_alloc: pcp->batch tuning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH] mm/page_alloc: pcp->batch tuning
@ 2025-10-06 14:54 Joshua Hahn
  2025-10-08 15:34 ` Dave Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: Joshua Hahn @ 2025-10-06 14:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm

Recently while working on another patch about batching
free_pcppages_bulk [1], I was curious why pcp->batch was always 63 on my
machine. This led me to zone_batchsize(), where I found this set of
lines to determine what the batch size should be for the host:

	batch = min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE);
	batch /= 4;		/* We effectively *= 4 below */
	if (batch < 1)
		batch = 1;

All of this is good, except the comment above which says "We effectively
*= 4 below". Nowhere else in the function zone_batchsize(), is there a
corresponding multipliation by 4. Looking into the history of this, it
seems like Dave Hansen had also noticed this back in 2013 [1]. Turns out
there *used* to be a corresponding *= 4, which was turned into a *= 6
later on to be used in pageset_setup_from_batch_size(), which no longer
exists.

This leaves us with a /= 4 with no corresponding *= 4 anywhere, which
leaves pcp->batch mistuned from the original intent when it was
introduced. This is made worse by the fact that pcp lists are generally
larger today than they were in 2013, meaning batch sizes should have
increased, not decreased.

While the obvious solution is to remove this /= 4 to restore the
original tuning heuristics, I think this discovery opens up a discussion
on what pcp->batch should be, and whether this is something that should
be dynamically tuned based on the system's usage, like pcp->high.

Naively removing the /= 4 also changes the tuning for the entire system,
so I am a bit hesitant to just simply remove this, even though I believe
having a larger batch size (this means the new default batch size will
be the # of pages it takes to get 1M) can be helpful for the general
scale of machines running today, as opposed to 12 years ago.

I've left this patch as an RFC to see what folks have to say about this
decision.

[1] https://lore.kernel.org/all/20251002204636.4016712-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/20131015203547.8724C69C@viggo.jf.intel.com/

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/page_alloc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d1d037f97c5f..b4db0d09d145 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5815,7 +5815,6 @@ static int zone_batchsize(struct zone *zone)
 	 * and zone lock contention.
 	 */
 	batch = min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE);
-	batch /= 4;		/* We effectively *= 4 below */
 	if (batch < 1)
 		batch = 1;

base-commit: 097a6c336d0080725c626fda118ecfec448acd0f
-- 
2.47.3

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning
  2025-10-06 14:54 [RFC] [PATCH] mm/page_alloc: pcp->batch tuning Joshua Hahn
@ 2025-10-08 15:34 ` Dave Hansen
  2025-10-08 19:36   ` Joshua Hahn
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Hansen @ 2025-10-08 15:34 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, ying.huang

First of all, I do agree that the comment should go away or get fixed up.

But...

On 10/6/25 07:54, Joshua Hahn wrote:
> This leaves us with a /= 4 with no corresponding *= 4 anywhere, which
> leaves pcp->batch mistuned from the original intent when it was
> introduced. This is made worse by the fact that pcp lists are generally
> larger today than they were in 2013, meaning batch sizes should have
> increased, not decreased.

pcp->batch and pcp->high do very different things. pcp->high is a limit
on the amount of memory that can be tied up. pcp->batch balances
throughput with latency. I'm not sure I buy the idea that a higher
pcp->high means we should necessarily do larger batches.

So I dunno... f someone wanted to alter the initial batch size, they'd
ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm:
restrict the pcp batch scale factor to avoid too long latency").

Better yet, just absorb the /=4 into the two existing batch assignments.
It will probably compile to exactly the same code and have no functional
changes and get rid of the comment.

Wouldn't this compile to the same thing?

        batch = zone->managed_pages / 4096;
        if (batch * PAGE_SIZE > 128 * 1024)
                batch = (128 * 1024) / PAGE_SIZE;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning
  2025-10-08 15:34 ` Dave Hansen
@ 2025-10-08 19:36   ` Joshua Hahn
  2025-10-09  2:57     ` Huang, Ying
  0 siblings, 1 reply; 5+ messages in thread
From: Joshua Hahn @ 2025-10-08 19:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, ying.huang

On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@intel.com> wrote:

Hello Dave, thank you for your feedback!

> First of all, I do agree that the comment should go away or get fixed up.
> 
> But...
> 
> On 10/6/25 07:54, Joshua Hahn wrote:
> > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which
> > leaves pcp->batch mistuned from the original intent when it was
> > introduced. This is made worse by the fact that pcp lists are generally
> > larger today than they were in 2013, meaning batch sizes should have
> > increased, not decreased.
> 
> pcp->batch and pcp->high do very different things. pcp->high is a limit
> on the amount of memory that can be tied up. pcp->batch balances
> throughput with latency. I'm not sure I buy the idea that a higher
> pcp->high means we should necessarily do larger batches.

I agree with your observation that a higher pcp->high doesn't mean we should
do larger batches. I think what I was trying to get at here was that if
pcp lists are bigger, some other values might want to scale.

For instance, in nr_pcp_free, pcp->batch is used to determine how many
pages should be left in the pcplist (and the rest be freed). Should this
value scale with a bigger pcp? (This is not a rhetorical question, I really
do want to understand what the implications are here).

Another thing that I would like to note is that pcp->high is actually at
least in part a function of pcp->batch. In decay_pcp_high, we set

pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...)

So here, it seems like a higher batch value would actually lead to a much
lower pcp->high instead. This actually seems actively harmful to the system.
So I'll do a take two of this patch and take your advice below and instead
of getting rid of the /= 4, just fold it in (or add a better explanation)
as to why we do this. Another candidate place to do this seems to be
where we do the rounddown_pow_of_two.

> So I dunno... f someone wanted to alter the initial batch size, they'd
> ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm:
> restrict the pcp batch scale factor to avoid too long latency").

I ran a few very naive and quick tests on kernel builds, and it seems like
for larger machines (1TB memory, 316 processors), this leads to a very
significant speedup in system time during a kernel compilation (~10%).

But for smaller machines (250G memory, 176 processors) and (62G memory and 36
processors), this leads to quite a regression (~5%).

So maybe the answer is that this should actually be defined by the machine's
size. In zone_batchsize, we set the value of the batch to: 

min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE)

But maybe it makes sense to let this value grow bigger for larger machines? If
anything, I think that the experiment results above do show that batch size does
have an impact on the performance, and the effect can either be positive or
negative based on the machine's size. I can run some more experiments to 
see if there's an opportunity to better tune pcp->batch.

> Better yet, just absorb the /=4 into the two existing batch assignments.
> It will probably compile to exactly the same code and have no functional
> changes and get rid of the comment.
> 
> Wouldn't this compile to the same thing?
> 
>         batch = zone->managed_pages / 4096;
>         if (batch * PAGE_SIZE > 128 * 1024)
>                 batch = (128 * 1024) / PAGE_SIZE;

But for now, this seems good to me. I'll get rid of the confusing comment,
and try to fold in the batch value and leave a new comment leaving this
as an explanation. 

Thank you for your thoughtful review, Dave. I hope you have a great day!
Joshua

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning
  2025-10-08 19:36   ` Joshua Hahn
@ 2025-10-09  2:57     ` Huang, Ying
  2025-10-09 14:41       ` Joshua Hahn
  0 siblings, 1 reply; 5+ messages in thread
From: Huang, Ying @ 2025-10-09  2:57 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Dave Hansen, Andrew Morton, Brendan Jackman, Johannes Weiner,
	Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm

Hi, Joshua,

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@intel.com> wrote:
>
> Hello Dave, thank you for your feedback!
>
>> First of all, I do agree that the comment should go away or get fixed up.
>> 
>> But...
>> 
>> On 10/6/25 07:54, Joshua Hahn wrote:
>> > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which
>> > leaves pcp->batch mistuned from the original intent when it was
>> > introduced. This is made worse by the fact that pcp lists are generally
>> > larger today than they were in 2013, meaning batch sizes should have
>> > increased, not decreased.
>> 
>> pcp->batch and pcp->high do very different things. pcp->high is a limit
>> on the amount of memory that can be tied up. pcp->batch balances
>> throughput with latency. I'm not sure I buy the idea that a higher
>> pcp->high means we should necessarily do larger batches.
>
> I agree with your observation that a higher pcp->high doesn't mean we should
> do larger batches. I think what I was trying to get at here was that if
> pcp lists are bigger, some other values might want to scale.
>
> For instance, in nr_pcp_free, pcp->batch is used to determine how many
> pages should be left in the pcplist (and the rest be freed). Should this
> value scale with a bigger pcp? (This is not a rhetorical question, I really
> do want to understand what the implications are here).
>
> Another thing that I would like to note is that pcp->high is actually at
> least in part a function of pcp->batch. In decay_pcp_high, we set
>
> pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...)
>
> So here, it seems like a higher batch value would actually lead to a much
> lower pcp->high instead. This actually seems actively harmful to the system.

Batch here is used to control the latency to free the pages from PCP to
buddy.  Larger batch will lead to larger latency, however it helps to
reduce the size of PCP more quickly when it becomes idle.  So, we need
to balance here.

> So I'll do a take two of this patch and take your advice below and instead
> of getting rid of the /= 4, just fold it in (or add a better explanation)
> as to why we do this. Another candidate place to do this seems to be
> where we do the rounddown_pow_of_two.
>
>> So I dunno... f someone wanted to alter the initial batch size, they'd
>> ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm:
>> restrict the pcp batch scale factor to avoid too long latency").
>
> I ran a few very naive and quick tests on kernel builds, and it seems like
> for larger machines (1TB memory, 316 processors), this leads to a very
> significant speedup in system time during a kernel compilation (~10%).
>
> But for smaller machines (250G memory, 176 processors) and (62G memory and 36
> processors), this leads to quite a regression (~5%).
>
> So maybe the answer is that this should actually be defined by the machine's
> size. In zone_batchsize, we set the value of the batch to: 
>
> min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE)
>
> But maybe it makes sense to let this value grow bigger for larger machines? If
> anything, I think that the experiment results above do show that batch size does
> have an impact on the performance, and the effect can either be positive or
> negative based on the machine's size. I can run some more experiments to 
> see if there's an opportunity to better tune pcp->batch.

In fact, we do have some mechanism to scale batch size dynamically
already, via pcp->alloc_factor and pcp->free_count.

You could further tune them.  Per my understanding, it should be a
balance between throughput and latency.

>> Better yet, just absorb the /=4 into the two existing batch assignments.
>> It will probably compile to exactly the same code and have no functional
>> changes and get rid of the comment.
>> 
>> Wouldn't this compile to the same thing?
>> 
>>         batch = zone->managed_pages / 4096;
>>         if (batch * PAGE_SIZE > 128 * 1024)
>>                 batch = (128 * 1024) / PAGE_SIZE;
>
> But for now, this seems good to me. I'll get rid of the confusing comment,
> and try to fold in the batch value and leave a new comment leaving this
> as an explanation. 
>
> Thank you for your thoughtful review, Dave. I hope you have a great day!
> Joshua

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning
  2025-10-09  2:57     ` Huang, Ying
@ 2025-10-09 14:41       ` Joshua Hahn
  0 siblings, 0 replies; 5+ messages in thread
From: Joshua Hahn @ 2025-10-09 14:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Dave Hansen, Andrew Morton, Brendan Jackman, Johannes Weiner,
	Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm

On Thu, 09 Oct 2025 10:57:05 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Hi, Joshua,
> 
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > Hello Dave, thank you for your feedback!
> >
> >> First of all, I do agree that the comment should go away or get fixed up.
> >> 
> >> But...
> >> 
> >> On 10/6/25 07:54, Joshua Hahn wrote:
> >> > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which
> >> > leaves pcp->batch mistuned from the original intent when it was
> >> > introduced. This is made worse by the fact that pcp lists are generally
> >> > larger today than they were in 2013, meaning batch sizes should have
> >> > increased, not decreased.
> >> 
> >> pcp->batch and pcp->high do very different things. pcp->high is a limit
> >> on the amount of memory that can be tied up. pcp->batch balances
> >> throughput with latency. I'm not sure I buy the idea that a higher
> >> pcp->high means we should necessarily do larger batches.
> >
> > I agree with your observation that a higher pcp->high doesn't mean we should
> > do larger batches. I think what I was trying to get at here was that if
> > pcp lists are bigger, some other values might want to scale.
> >
> > For instance, in nr_pcp_free, pcp->batch is used to determine how many
> > pages should be left in the pcplist (and the rest be freed). Should this
> > value scale with a bigger pcp? (This is not a rhetorical question, I really
> > do want to understand what the implications are here).
> >
> > Another thing that I would like to note is that pcp->high is actually at
> > least in part a function of pcp->batch. In decay_pcp_high, we set
> >
> > pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...)
> >
> > So here, it seems like a higher batch value would actually lead to a much
> > lower pcp->high instead. This actually seems actively harmful to the system.

Hi Ying, thank you for your feedback, as always!

> Batch here is used to control the latency to free the pages from PCP to
> buddy.  Larger batch will lead to larger latency, however it helps to
> reduce the size of PCP more quickly when it becomes idle.  So, we need
> to balance here.

Yes, this makes sense to me. I think one thing that I overlooked when I
initially submitted this patch was that even though the pcp size may have
grown in recent times, the tolerance for the latency associated with freeing
it may have not.

> > So I'll do a take two of this patch and take your advice below and instead
> > of getting rid of the /= 4, just fold it in (or add a better explanation)
> > as to why we do this. Another candidate place to do this seems to be
> > where we do the rounddown_pow_of_two.
> >
> >> So I dunno... f someone wanted to alter the initial batch size, they'd
> >> ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm:
> >> restrict the pcp batch scale factor to avoid too long latency").
> >
> > I ran a few very naive and quick tests on kernel builds, and it seems like
> > for larger machines (1TB memory, 316 processors), this leads to a very
> > significant speedup in system time during a kernel compilation (~10%).
> >
> > But for smaller machines (250G memory, 176 processors) and (62G memory and 36
> > processors), this leads to quite a regression (~5%).
> >
> > So maybe the answer is that this should actually be defined by the machine's
> > size. In zone_batchsize, we set the value of the batch to: 
> >
> > min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE)
> >
> > But maybe it makes sense to let this value grow bigger for larger machines? If
> > anything, I think that the experiment results above do show that batch size does
> > have an impact on the performance, and the effect can either be positive or
> > negative based on the machine's size. I can run some more experiments to 
> > see if there's an opportunity to better tune pcp->batch.
> 
> In fact, we do have some mechanism to scale batch size dynamically
> already, via pcp->alloc_factor and pcp->free_count.
> 
> You could further tune them.  Per my understanding, it should be a
> balance between throughput and latency.

Sounds good with me! I can try to do some tuning to change alloc_factor
and free_count, or see how they currently behave in the system to see if it
is already providing a good balance of throughput and latency.

> >> Better yet, just absorb the /=4 into the two existing batch assignments.
> >> It will probably compile to exactly the same code and have no functional
> >> changes and get rid of the comment.
> >> 
> >> Wouldn't this compile to the same thing?
> >> 
> >>         batch = zone->managed_pages / 4096;
> >>         if (batch * PAGE_SIZE > 128 * 1024)
> >>                 batch = (128 * 1024) / PAGE_SIZE;
> >
> > But for now, this seems good to me. I'll get rid of the confusing comment,
> > and try to fold in the batch value and leave a new comment leaving this
> > as an explanation. 
> >
> > Thank you for your thoughtful review, Dave. I hope you have a great day!
> > Joshua
> 
> ---
> Best Regards,
> Huang, Ying

Thank you, Ying. For now, I'll just submit a new version of this patch that
doesn't drop the /= 4, but just fold it into the lines below so that there
is no more confusion about the comment.

I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-09 14:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-06 14:54 [RFC] [PATCH] mm/page_alloc: pcp->batch tuning Joshua Hahn
2025-10-08 15:34 ` Dave Hansen
2025-10-08 19:36   ` Joshua Hahn
2025-10-09  2:57     ` Huang, Ying
2025-10-09 14:41       ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox