Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Brendan Jackman <jackmanb@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning
Date: Thu,  9 Oct 2025 07:41:51 -0700	[thread overview]
Message-ID: <20251009144152.909709-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <87ms60wzni.fsf@DESKTOP-5N7EMDA>

On Thu, 09 Oct 2025 10:57:05 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Hi, Joshua,
> 
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > Hello Dave, thank you for your feedback!
> >
> >> First of all, I do agree that the comment should go away or get fixed up.
> >> 
> >> But...
> >> 
> >> On 10/6/25 07:54, Joshua Hahn wrote:
> >> > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which
> >> > leaves pcp->batch mistuned from the original intent when it was
> >> > introduced. This is made worse by the fact that pcp lists are generally
> >> > larger today than they were in 2013, meaning batch sizes should have
> >> > increased, not decreased.
> >> 
> >> pcp->batch and pcp->high do very different things. pcp->high is a limit
> >> on the amount of memory that can be tied up. pcp->batch balances
> >> throughput with latency. I'm not sure I buy the idea that a higher
> >> pcp->high means we should necessarily do larger batches.
> >
> > I agree with your observation that a higher pcp->high doesn't mean we should
> > do larger batches. I think what I was trying to get at here was that if
> > pcp lists are bigger, some other values might want to scale.
> >
> > For instance, in nr_pcp_free, pcp->batch is used to determine how many
> > pages should be left in the pcplist (and the rest be freed). Should this
> > value scale with a bigger pcp? (This is not a rhetorical question, I really
> > do want to understand what the implications are here).
> >
> > Another thing that I would like to note is that pcp->high is actually at
> > least in part a function of pcp->batch. In decay_pcp_high, we set
> >
> > pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...)
> >
> > So here, it seems like a higher batch value would actually lead to a much
> > lower pcp->high instead. This actually seems actively harmful to the system.

Hi Ying, thank you for your feedback, as always!

> Batch here is used to control the latency to free the pages from PCP to
> buddy.  Larger batch will lead to larger latency, however it helps to
> reduce the size of PCP more quickly when it becomes idle.  So, we need
> to balance here.

Yes, this makes sense to me. I think one thing that I overlooked when I
initially submitted this patch was that even though the pcp size may have
grown in recent times, the tolerance for the latency associated with freeing
it may have not.

> > So I'll do a take two of this patch and take your advice below and instead
> > of getting rid of the /= 4, just fold it in (or add a better explanation)
> > as to why we do this. Another candidate place to do this seems to be
> > where we do the rounddown_pow_of_two.
> >
> >> So I dunno... f someone wanted to alter the initial batch size, they'd
> >> ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm:
> >> restrict the pcp batch scale factor to avoid too long latency").
> >
> > I ran a few very naive and quick tests on kernel builds, and it seems like
> > for larger machines (1TB memory, 316 processors), this leads to a very
> > significant speedup in system time during a kernel compilation (~10%).
> >
> > But for smaller machines (250G memory, 176 processors) and (62G memory and 36
> > processors), this leads to quite a regression (~5%).
> >
> > So maybe the answer is that this should actually be defined by the machine's
> > size. In zone_batchsize, we set the value of the batch to: 
> >
> > min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE)
> >
> > But maybe it makes sense to let this value grow bigger for larger machines? If
> > anything, I think that the experiment results above do show that batch size does
> > have an impact on the performance, and the effect can either be positive or
> > negative based on the machine's size. I can run some more experiments to 
> > see if there's an opportunity to better tune pcp->batch.
> 
> In fact, we do have some mechanism to scale batch size dynamically
> already, via pcp->alloc_factor and pcp->free_count.
> 
> You could further tune them.  Per my understanding, it should be a
> balance between throughput and latency.

Sounds good with me! I can try to do some tuning to change alloc_factor
and free_count, or see how they currently behave in the system to see if it
is already providing a good balance of throughput and latency.

> >> Better yet, just absorb the /=4 into the two existing batch assignments.
> >> It will probably compile to exactly the same code and have no functional
> >> changes and get rid of the comment.
> >> 
> >> Wouldn't this compile to the same thing?
> >> 
> >>         batch = zone->managed_pages / 4096;
> >>         if (batch * PAGE_SIZE > 128 * 1024)
> >>                 batch = (128 * 1024) / PAGE_SIZE;
> >
> > But for now, this seems good to me. I'll get rid of the confusing comment,
> > and try to fold in the batch value and leave a new comment leaving this
> > as an explanation. 
> >
> > Thank you for your thoughtful review, Dave. I hope you have a great day!
> > Joshua
> 
> ---
> Best Regards,
> Huang, Ying

Thank you, Ying. For now, I'll just submit a new version of this patch that
doesn't drop the /= 4, but just fold it into the lines below so that there
is no more confusion about the comment.

I hope you have a great day!
Joshua

     prev parent reply	other threads:[~2025-10-09 14:41 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 14:54 Joshua Hahn
2025-10-08 15:34 ` Dave Hansen
2025-10-08 19:36   ` Joshua Hahn
2025-10-09  2:57     ` Huang, Ying
2025-10-09 14:41       ` Joshua Hahn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251009144152.909709-1-joshua.hahnjy@gmail.com \
    --to=joshua.hahnjy@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox