From: Zi Yan <ziy@nvidia.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.cz>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Rik van Riel <riel@surriel.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC 0/2] mm: page_alloc: pcp buddy allocator
Date: Mon, 06 Apr 2026 22:42:50 -0400 [thread overview]
Message-ID: <E057A972-42B4-4EA5-B46E-3663FB676A9C@nvidia.com> (raw)
In-Reply-To: <adPQJfmbpYr3-uzX@cmpxchg.org>
On 6 Apr 2026, at 11:24, Johannes Weiner wrote:
> On Fri, Apr 03, 2026 at 10:27:36PM -0400, Zi Yan wrote:
>> On 3 Apr 2026, at 15:40, Johannes Weiner wrote:
>>> this is an RFC for making the page allocator scale better with higher
>>> thread counts and larger memory quantities.
>>>
>>> In Meta production, we're seeing increasing zone->lock contention that
>>> was traced back to a few different paths. A prominent one is the
>>> userspace allocator, jemalloc. Allocations happen from page faults on
>>> all CPUs running the workload. Frees are cached for reuse, but the
>>> caches are periodically purged back to the kernel from a handful of
>>> purger threads. This breaks affinity between allocations and frees:
>>> Both sides use their own PCPs - one side depletes them, the other one
>>> overfills them. Both sides routinely hit the zone->locked slowpath.
>>>
>>> My understanding is that tcmalloc has a similar architecture.
>>>
>>> Another contributor to contention is process exits, where large
>>> numbers of pages are freed at once. The current PCP can only reduce
>>> lock time when pages are reused. Reuse is unlikely because it's an
>>> avalanche of free pages on a CPU busy walking page tables. Every time
>>> the PCP overflows, the drain acquires the zone->lock and frees pages
>>> one by one, trying to merge buddies together.
>>
>> IIUC, zone->lock held time is mostly spent on free page merging.
>> Have you tried to let PCP do the free page merging before holding
>> zone->lock and returning free pages to buddy? That is a much smaller
>> change than what you proposed. This method might not work if
>> physically contiguous free pages are allocated by separate CPUs,
>> so that PCP merging cannot be done. But this might be rare?
>
> On my 32G system, pcp->high_min for zone Normal is 988. That's one
> block and a half. The rmqueue_smallest policy means the next CPU will
> prefer the remainder of that partial block. So if there is
> concurrency, every other block is shared. Not exactly uncommon. The
> effect lessens the larger the machine is, of course.
>
> But let's assume it's not an issue. How do you know you can safely
> merge with a buddy pfn? You need to establish that it's on that same
> PCP's list. Short of *scanning* the list, it seems something like
> PagePCPBuddy() and page->pcp_cpu is inevitably needed. But of course a
> per-page cpu field is tough to come by.
>
> So the block ownership is more natural, and then you might as well use
> that for affinity routing to increase the odds of merges.
>
> IOW, I'm having a hard time seeing what could be taken away and still
> have it work.
You are right. I was assuming that pages that can be merged are freed
via the same CPU. That rarely happens.
>
>>> The idea proposed here is this: instead of single pages, make the PCP
>>> grab entire pageblocks, split them outside the zone->lock. That CPU
>>> then takes ownership of the block, and all frees route back to that
>>> PCP instead of the freeing CPU's local one.
>>
>> This is basically distributed buddy allocators, right? Instead of
>> relying on a single zone->lock, PCP locks are used. The worst case
>> it can face is that physically contiguous free pages are allocated
>> across all CPUs, so that all CPUs are competing a single PCP lock.
>
> The worst case is one CPU allocating for everybody else in the system,
> so that all freers route to that PCP.
>
> I've played with microbenchmarks to provoke this, but it looks mostly
> neutral over baseline, at least at the scale of this machine.
>
> In this scenario, baseline will have the affinity mismatch problem:
> the allocating CPU routinely hits zone->lock to refill, and the
> freeing CPUs routinely hit zone->lock to drain and merge.
>
> In the new scheme, they would hit the pcp->lock instead of the
> zone->lock. So not necessarily an improvement in lock breaking. BUT
> because freers refill the allocator's cache, merging is deferred;
> that's a net reduction of work performed under the contended lock.
This makes sense to me.
>
>> It seems that you have not hit this. So I wonder if what I proposed
>> above might work as a simpler approach. Let me know if I miss anything.
>>
>> I wonder how this distributed buddy allocators would work if anyone
>> wants to allocate >pageblock free pages, like alloc_contig_range().
>> Multiple PCP locks need to be taken one by one. Maybe it is better
>> than taking and dropping zone->lock repeatedly. Have you benchmarked
>> alloc_contig_range(), like hugetlb allocation?
>
> I didn't change that aspect.
>
> The PCPs are still the same size, and PCP pages are still skipped by
> the isolation code.
>
> IOW it's not a purely distributed buddy allocator. It's still just a
> per-cpu cache of limited size. The only thing I'm doing is provide a
> mechanism for splitting and pre-merging at the cache level, and
> setting up affinity/routing rules to increase the chances of
> success. But the impact on alloc_contig should be the same.
Got it. Thanks for the explanation.
>
>>> This has several benefits:
>>>
>>> 1. It's right away coarser/fewer allocations transactions under the
>>> zone->lock.
>>>
>>> 1a. Even if no full free blocks are available (memory pressure or
>>> small zone), with splitting available at the PCP level means the
>>> PCP can still grab chunks larger than the requested order from the
>>> zone->lock freelists, and dole them out on its own time.
>>>
>>> 2. The pages free back to where the allocations happen, increasing the
>>> odds of reuse and reducing the chances of zone->lock slowpaths.
>>>
>>> 3. The page buddies come back into one place, allowing upfront merging
>>> under the local pcp->lock. This makes coarser/fewer freeing
>>> transactions under the zone->lock.
>>
>> I wonder if we could go more radical by moving buddy allocator out of
>> zone->lock completely to PCP lock. If one PCP runs out of free pages,
>> it can steal another PCP's whole pageblock. I probably should do some
>> literature investigation on this. Some research must have been done
>> on this.
>
> This is an interesting idea. Make the zone buddy a pure block economy
> and remove all buddy code from it. Slowpath allocs and frees would
> always be in whole blocks.
>
> You'd have to come up with a natural stealing order. If one CPU needs
> something it doesn't have, which CPUs, and which order, do you look at
> for stealing.
One naive idea is to make zone buddy keep track of PCP free lists
for stealing.
>
> I think you'd still have to route back frees to the nominal owner of
> the block, or stealing could scatter pages all over the place and we'd
> never be able to merge them back up.
Basically, we want to keep free pages to be merged as much as possible.
Something like free page compaction across all PCPs.
>
> I think you'd also need to pull accounting (NR_FREE_PAGES) to the
> per-cpu level, and inform compaction/isolation to deal with these
> pages, since the majority default is now distributed.
>
> But the scenario where one CPU needs what another one has is an
> interesting one. I didn't invent anything new for this for now, but
> rather rely on how we have been handling this through the zone
> freelists. But I do think it's a little silly: right now, if a CPU
> needs something another CPU might have, we ask EVERY CPU in the system
> to drain their cache into the shared pool - simultaneously - running
> the full buddy merge algorithm on everything that comes in. The CPU
> grabs a small handful of these pages, most likely having to split
> again. All other CPUs are now cache cold on the next request.
Yes, a better way might be that when a CPU wants something, it should be
able to ask the other CPUs to drain the minimal amount of free pages.
But I do not have a good idea on how to do that yet.
It sounds to me that your current approach is a good first step towards
distributed buddy allocator. I will check the code and think about it
more and ask questions later.
Thank you for the explanation.
Best Regards,
Yan, Zi
prev parent reply other threads:[~2026-04-07 2:43 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:40 Johannes Weiner
2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
2026-04-04 1:43 ` Rik van Riel
2026-04-03 19:40 ` [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator Johannes Weiner
2026-04-04 1:42 ` Rik van Riel
2026-04-06 16:12 ` Johannes Weiner
2026-04-06 17:31 ` Frank van der Linden
2026-04-06 21:58 ` Johannes Weiner
2026-04-04 2:27 ` [RFC 0/2] mm: page_alloc: pcp " Zi Yan
2026-04-06 15:24 ` Johannes Weiner
2026-04-07 2:42 ` Zi Yan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=E057A972-42B4-4EA5-B46E-3663FB676A9C@nvidia.com \
--to=ziy@nvidia.com \
--cc=Liam.Howlett@oracle.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@surriel.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox