* [RFC] [PATCH] mm/page_alloc: pcp->batch tuning @ 2025-10-06 14:54 Joshua Hahn 2025-10-08 15:34 ` Dave Hansen 0 siblings, 1 reply; 5+ messages in thread From: Joshua Hahn @ 2025-10-06 14:54 UTC (permalink / raw) To: Dave Hansen Cc: Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel, linux-mm Recently while working on another patch about batching free_pcppages_bulk [1], I was curious why pcp->batch was always 63 on my machine. This led me to zone_batchsize(), where I found this set of lines to determine what the batch size should be for the host: batch = min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE); batch /= 4; /* We effectively *= 4 below */ if (batch < 1) batch = 1; All of this is good, except the comment above which says "We effectively *= 4 below". Nowhere else in the function zone_batchsize(), is there a corresponding multipliation by 4. Looking into the history of this, it seems like Dave Hansen had also noticed this back in 2013 [1]. Turns out there *used* to be a corresponding *= 4, which was turned into a *= 6 later on to be used in pageset_setup_from_batch_size(), which no longer exists. This leaves us with a /= 4 with no corresponding *= 4 anywhere, which leaves pcp->batch mistuned from the original intent when it was introduced. This is made worse by the fact that pcp lists are generally larger today than they were in 2013, meaning batch sizes should have increased, not decreased. While the obvious solution is to remove this /= 4 to restore the original tuning heuristics, I think this discovery opens up a discussion on what pcp->batch should be, and whether this is something that should be dynamically tuned based on the system's usage, like pcp->high. Naively removing the /= 4 also changes the tuning for the entire system, so I am a bit hesitant to just simply remove this, even though I believe having a larger batch size (this means the new default batch size will be the # of pages it takes to get 1M) can be helpful for the general scale of machines running today, as opposed to 12 years ago. I've left this patch as an RFC to see what folks have to say about this decision. [1] https://lore.kernel.org/all/20251002204636.4016712-1-joshua.hahnjy@gmail.com/ [2] https://lore.kernel.org/linux-mm/20131015203547.8724C69C@viggo.jf.intel.com/ Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> --- mm/page_alloc.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d1d037f97c5f..b4db0d09d145 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5815,7 +5815,6 @@ static int zone_batchsize(struct zone *zone) * and zone lock contention. */ batch = min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE); - batch /= 4; /* We effectively *= 4 below */ if (batch < 1) batch = 1; base-commit: 097a6c336d0080725c626fda118ecfec448acd0f -- 2.47.3 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning 2025-10-06 14:54 [RFC] [PATCH] mm/page_alloc: pcp->batch tuning Joshua Hahn @ 2025-10-08 15:34 ` Dave Hansen 2025-10-08 19:36 ` Joshua Hahn 0 siblings, 1 reply; 5+ messages in thread From: Dave Hansen @ 2025-10-08 15:34 UTC (permalink / raw) To: Joshua Hahn Cc: Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, ying.huang First of all, I do agree that the comment should go away or get fixed up. But... On 10/6/25 07:54, Joshua Hahn wrote: > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which > leaves pcp->batch mistuned from the original intent when it was > introduced. This is made worse by the fact that pcp lists are generally > larger today than they were in 2013, meaning batch sizes should have > increased, not decreased. pcp->batch and pcp->high do very different things. pcp->high is a limit on the amount of memory that can be tied up. pcp->batch balances throughput with latency. I'm not sure I buy the idea that a higher pcp->high means we should necessarily do larger batches. So I dunno... f someone wanted to alter the initial batch size, they'd ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long latency"). Better yet, just absorb the /=4 into the two existing batch assignments. It will probably compile to exactly the same code and have no functional changes and get rid of the comment. Wouldn't this compile to the same thing? batch = zone->managed_pages / 4096; if (batch * PAGE_SIZE > 128 * 1024) batch = (128 * 1024) / PAGE_SIZE; ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning 2025-10-08 15:34 ` Dave Hansen @ 2025-10-08 19:36 ` Joshua Hahn 2025-10-09 2:57 ` Huang, Ying 0 siblings, 1 reply; 5+ messages in thread From: Joshua Hahn @ 2025-10-08 19:36 UTC (permalink / raw) To: Dave Hansen Cc: Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, ying.huang On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@intel.com> wrote: Hello Dave, thank you for your feedback! > First of all, I do agree that the comment should go away or get fixed up. > > But... > > On 10/6/25 07:54, Joshua Hahn wrote: > > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which > > leaves pcp->batch mistuned from the original intent when it was > > introduced. This is made worse by the fact that pcp lists are generally > > larger today than they were in 2013, meaning batch sizes should have > > increased, not decreased. > > pcp->batch and pcp->high do very different things. pcp->high is a limit > on the amount of memory that can be tied up. pcp->batch balances > throughput with latency. I'm not sure I buy the idea that a higher > pcp->high means we should necessarily do larger batches. I agree with your observation that a higher pcp->high doesn't mean we should do larger batches. I think what I was trying to get at here was that if pcp lists are bigger, some other values might want to scale. For instance, in nr_pcp_free, pcp->batch is used to determine how many pages should be left in the pcplist (and the rest be freed). Should this value scale with a bigger pcp? (This is not a rhetorical question, I really do want to understand what the implications are here). Another thing that I would like to note is that pcp->high is actually at least in part a function of pcp->batch. In decay_pcp_high, we set pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...) So here, it seems like a higher batch value would actually lead to a much lower pcp->high instead. This actually seems actively harmful to the system. So I'll do a take two of this patch and take your advice below and instead of getting rid of the /= 4, just fold it in (or add a better explanation) as to why we do this. Another candidate place to do this seems to be where we do the rounddown_pow_of_two. > So I dunno... f someone wanted to alter the initial batch size, they'd > ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm: > restrict the pcp batch scale factor to avoid too long latency"). I ran a few very naive and quick tests on kernel builds, and it seems like for larger machines (1TB memory, 316 processors), this leads to a very significant speedup in system time during a kernel compilation (~10%). But for smaller machines (250G memory, 176 processors) and (62G memory and 36 processors), this leads to quite a regression (~5%). So maybe the answer is that this should actually be defined by the machine's size. In zone_batchsize, we set the value of the batch to: min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE) But maybe it makes sense to let this value grow bigger for larger machines? If anything, I think that the experiment results above do show that batch size does have an impact on the performance, and the effect can either be positive or negative based on the machine's size. I can run some more experiments to see if there's an opportunity to better tune pcp->batch. > Better yet, just absorb the /=4 into the two existing batch assignments. > It will probably compile to exactly the same code and have no functional > changes and get rid of the comment. > > Wouldn't this compile to the same thing? > > batch = zone->managed_pages / 4096; > if (batch * PAGE_SIZE > 128 * 1024) > batch = (128 * 1024) / PAGE_SIZE; But for now, this seems good to me. I'll get rid of the confusing comment, and try to fold in the batch value and leave a new comment leaving this as an explanation. Thank you for your thoughtful review, Dave. I hope you have a great day! Joshua ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning 2025-10-08 19:36 ` Joshua Hahn @ 2025-10-09 2:57 ` Huang, Ying 2025-10-09 14:41 ` Joshua Hahn 0 siblings, 1 reply; 5+ messages in thread From: Huang, Ying @ 2025-10-09 2:57 UTC (permalink / raw) To: Joshua Hahn Cc: Dave Hansen, Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel, linux-mm Hi, Joshua, Joshua Hahn <joshua.hahnjy@gmail.com> writes: > On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > > Hello Dave, thank you for your feedback! > >> First of all, I do agree that the comment should go away or get fixed up. >> >> But... >> >> On 10/6/25 07:54, Joshua Hahn wrote: >> > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which >> > leaves pcp->batch mistuned from the original intent when it was >> > introduced. This is made worse by the fact that pcp lists are generally >> > larger today than they were in 2013, meaning batch sizes should have >> > increased, not decreased. >> >> pcp->batch and pcp->high do very different things. pcp->high is a limit >> on the amount of memory that can be tied up. pcp->batch balances >> throughput with latency. I'm not sure I buy the idea that a higher >> pcp->high means we should necessarily do larger batches. > > I agree with your observation that a higher pcp->high doesn't mean we should > do larger batches. I think what I was trying to get at here was that if > pcp lists are bigger, some other values might want to scale. > > For instance, in nr_pcp_free, pcp->batch is used to determine how many > pages should be left in the pcplist (and the rest be freed). Should this > value scale with a bigger pcp? (This is not a rhetorical question, I really > do want to understand what the implications are here). > > Another thing that I would like to note is that pcp->high is actually at > least in part a function of pcp->batch. In decay_pcp_high, we set > > pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...) > > So here, it seems like a higher batch value would actually lead to a much > lower pcp->high instead. This actually seems actively harmful to the system. Batch here is used to control the latency to free the pages from PCP to buddy. Larger batch will lead to larger latency, however it helps to reduce the size of PCP more quickly when it becomes idle. So, we need to balance here. > So I'll do a take two of this patch and take your advice below and instead > of getting rid of the /= 4, just fold it in (or add a better explanation) > as to why we do this. Another candidate place to do this seems to be > where we do the rounddown_pow_of_two. > >> So I dunno... f someone wanted to alter the initial batch size, they'd >> ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm: >> restrict the pcp batch scale factor to avoid too long latency"). > > I ran a few very naive and quick tests on kernel builds, and it seems like > for larger machines (1TB memory, 316 processors), this leads to a very > significant speedup in system time during a kernel compilation (~10%). > > But for smaller machines (250G memory, 176 processors) and (62G memory and 36 > processors), this leads to quite a regression (~5%). > > So maybe the answer is that this should actually be defined by the machine's > size. In zone_batchsize, we set the value of the batch to: > > min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE) > > But maybe it makes sense to let this value grow bigger for larger machines? If > anything, I think that the experiment results above do show that batch size does > have an impact on the performance, and the effect can either be positive or > negative based on the machine's size. I can run some more experiments to > see if there's an opportunity to better tune pcp->batch. In fact, we do have some mechanism to scale batch size dynamically already, via pcp->alloc_factor and pcp->free_count. You could further tune them. Per my understanding, it should be a balance between throughput and latency. >> Better yet, just absorb the /=4 into the two existing batch assignments. >> It will probably compile to exactly the same code and have no functional >> changes and get rid of the comment. >> >> Wouldn't this compile to the same thing? >> >> batch = zone->managed_pages / 4096; >> if (batch * PAGE_SIZE > 128 * 1024) >> batch = (128 * 1024) / PAGE_SIZE; > > But for now, this seems good to me. I'll get rid of the confusing comment, > and try to fold in the batch value and leave a new comment leaving this > as an explanation. > > Thank you for your thoughtful review, Dave. I hope you have a great day! > Joshua --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] [PATCH] mm/page_alloc: pcp->batch tuning 2025-10-09 2:57 ` Huang, Ying @ 2025-10-09 14:41 ` Joshua Hahn 0 siblings, 0 replies; 5+ messages in thread From: Joshua Hahn @ 2025-10-09 14:41 UTC (permalink / raw) To: Huang, Ying Cc: Dave Hansen, Andrew Morton, Brendan Jackman, Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel, linux-mm On Thu, 09 Oct 2025 10:57:05 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote: > Hi, Joshua, > > Joshua Hahn <joshua.hahnjy@gmail.com> writes: > > > On Wed, 8 Oct 2025 08:34:21 -0700 Dave Hansen <dave.hansen@intel.com> wrote: > > > > Hello Dave, thank you for your feedback! > > > >> First of all, I do agree that the comment should go away or get fixed up. > >> > >> But... > >> > >> On 10/6/25 07:54, Joshua Hahn wrote: > >> > This leaves us with a /= 4 with no corresponding *= 4 anywhere, which > >> > leaves pcp->batch mistuned from the original intent when it was > >> > introduced. This is made worse by the fact that pcp lists are generally > >> > larger today than they were in 2013, meaning batch sizes should have > >> > increased, not decreased. > >> > >> pcp->batch and pcp->high do very different things. pcp->high is a limit > >> on the amount of memory that can be tied up. pcp->batch balances > >> throughput with latency. I'm not sure I buy the idea that a higher > >> pcp->high means we should necessarily do larger batches. > > > > I agree with your observation that a higher pcp->high doesn't mean we should > > do larger batches. I think what I was trying to get at here was that if > > pcp lists are bigger, some other values might want to scale. > > > > For instance, in nr_pcp_free, pcp->batch is used to determine how many > > pages should be left in the pcplist (and the rest be freed). Should this > > value scale with a bigger pcp? (This is not a rhetorical question, I really > > do want to understand what the implications are here). > > > > Another thing that I would like to note is that pcp->high is actually at > > least in part a function of pcp->batch. In decay_pcp_high, we set > > > > pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), ...) > > > > So here, it seems like a higher batch value would actually lead to a much > > lower pcp->high instead. This actually seems actively harmful to the system. Hi Ying, thank you for your feedback, as always! > Batch here is used to control the latency to free the pages from PCP to > buddy. Larger batch will lead to larger latency, however it helps to > reduce the size of PCP more quickly when it becomes idle. So, we need > to balance here. Yes, this makes sense to me. I think one thing that I overlooked when I initially submitted this patch was that even though the pcp size may have grown in recent times, the tolerance for the latency associated with freeing it may have not. > > So I'll do a take two of this patch and take your advice below and instead > > of getting rid of the /= 4, just fold it in (or add a better explanation) > > as to why we do this. Another candidate place to do this seems to be > > where we do the rounddown_pow_of_two. > > > >> So I dunno... f someone wanted to alter the initial batch size, they'd > >> ideally repeat some of Ying's experiments from: 52166607ecc9 ("mm: > >> restrict the pcp batch scale factor to avoid too long latency"). > > > > I ran a few very naive and quick tests on kernel builds, and it seems like > > for larger machines (1TB memory, 316 processors), this leads to a very > > significant speedup in system time during a kernel compilation (~10%). > > > > But for smaller machines (250G memory, 176 processors) and (62G memory and 36 > > processors), this leads to quite a regression (~5%). > > > > So maybe the answer is that this should actually be defined by the machine's > > size. In zone_batchsize, we set the value of the batch to: > > > > min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE) > > > > But maybe it makes sense to let this value grow bigger for larger machines? If > > anything, I think that the experiment results above do show that batch size does > > have an impact on the performance, and the effect can either be positive or > > negative based on the machine's size. I can run some more experiments to > > see if there's an opportunity to better tune pcp->batch. > > In fact, we do have some mechanism to scale batch size dynamically > already, via pcp->alloc_factor and pcp->free_count. > > You could further tune them. Per my understanding, it should be a > balance between throughput and latency. Sounds good with me! I can try to do some tuning to change alloc_factor and free_count, or see how they currently behave in the system to see if it is already providing a good balance of throughput and latency. > >> Better yet, just absorb the /=4 into the two existing batch assignments. > >> It will probably compile to exactly the same code and have no functional > >> changes and get rid of the comment. > >> > >> Wouldn't this compile to the same thing? > >> > >> batch = zone->managed_pages / 4096; > >> if (batch * PAGE_SIZE > 128 * 1024) > >> batch = (128 * 1024) / PAGE_SIZE; > > > > But for now, this seems good to me. I'll get rid of the confusing comment, > > and try to fold in the batch value and leave a new comment leaving this > > as an explanation. > > > > Thank you for your thoughtful review, Dave. I hope you have a great day! > > Joshua > > --- > Best Regards, > Huang, Ying Thank you, Ying. For now, I'll just submit a new version of this patch that doesn't drop the /= 4, but just fold it into the lines below so that there is no more confusion about the comment. I hope you have a great day! Joshua ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-10-09 14:41 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-10-06 14:54 [RFC] [PATCH] mm/page_alloc: pcp->batch tuning Joshua Hahn 2025-10-08 15:34 ` Dave Hansen 2025-10-08 19:36 ` Joshua Hahn 2025-10-09 2:57 ` Huang, Ying 2025-10-09 14:41 ` Joshua Hahn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox