From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>, linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Rik van Riel <riel@surriel.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator
Date: Fri, 10 Apr 2026 11:48:21 +0200 [thread overview]
Message-ID: <45f3a5ba-9f61-4ee7-bc9a-af50057c0865@kernel.org> (raw)
In-Reply-To: <20260403194526.477775-3-hannes@cmpxchg.org>
On 4/3/26 21:40, Johannes Weiner wrote:
> On large machines, zone->lock is a scaling bottleneck for page
> allocation. Two common patterns drive contention:
>
> 1. Affinity violations: pages are allocated on one CPU but freed on
> another (jemalloc, exit, reclaim). The freeing CPU's PCP drains to
> zone buddy, and the allocating CPU refills from zone buddy -- both
> under zone->lock, defeating PCP batching entirely.
>
> 2. Concurrent exits: processes tearing down large address spaces
> simultaneously overwhelm per-CPU PCP capacity, serializing on
> zone->lock for overflow.
>
> Solution
>
> Extend the PCP to operate on whole pageblocks with ownership tracking.
Hi Johannes,
interesting ideas, as usual from you :) I'll try to point out some things
that immediately came to mind, although it's not a thorough review.
> Each CPU claims pageblocks from the zone buddy and splits them
> locally. Pages are tagged with their owning CPU, so frees route back
> to the owner's PCP regardless of which CPU frees. This eliminates
> affinity violations: the owner CPU's PCP absorbs both allocations and
> frees for its blocks without touching zone->lock.
Details differ a lot of course (i.e. slab has no buddy merging) but I can
see some parallel with SLUB's cpu slabs and these "cpu owned pageblocks".
However SLUB moved into the direction of today's pcplists with replacing
that with sheaves, and this is moving in the opposite direction :)
> It also shortens zone->lock hold time during drain and refill
> cycles. Whole blocks are acquired under zone->lock and then split
> outside of it. Affinity routing to the owning PCP on free enables
> buddy merging outside the zone->lock as well; a bottom-up merge pass
> runs under pcp->lock on drain, freeing larger chunks under zone->lock.
>
> PCP refill uses a four-phase approach:
>
> Phase 0: recover owned fragments previously drained to zone buddy.
Note this is done using pfn scanning under zone lock. Is there a risk of
defeating the short lock hold time goal?
> Phase 1: claim whole pageblocks from zone buddy.
> Phase 2: grab sub-pageblock chunks without migratetype stealing.
> Phase 3: traditional __rmqueue() with migratetype fallback.
>
> Phase 0/1 pages are owned and marked PagePCPBuddy, making them
> eligible for PCP-level merging. Phase 2/3 pages are cached on PCP for
> batching only -- no ownership, no merging.
> However, Phase 2 still
> benefits from chunky zone transactions: it pulls higher-order entries
> from zone free lists under zone->lock and splits them on the PCP
> outside of it, rather than acquiring zone->lock per page.
I think this particular benefit could be possible to do even today without
the other changes. Should we try it first?
> When PCP batch sizes are small (small machines with few CPUs) or the
> zone is fragmented and no whole pageblocks are available, refill falls
> through to Phase 2/3 naturally. The allocator degrades gracefully to
> the original page-at-a-time behavior.
>
> When owned blocks accumulate long-lived allocations (e.g. a mix of
> anonymous and file cache pages), partial block drains send the free
> fragments to zone buddy and remember the block, so Phase 0 can recover
> them on the next refill. This allows the allocator to pack new
> allocations next to existing ones in already-committed blocks rather
> than consuming fresh pageblocks, keeping fragmentation contained.
So this reads like there could be multiple owned blocks (is there any
limit?) with only a bunch of free pages each, increasing my concern about
pfn scanning under zone lock.
> Data structures:
>
> - per_cpu_pages: +owned_blocks list head, +PCPF_CPU_DEAD flag to gate
> enqueuing on offline CPUs.
> - pageblock_data: +cpu (owner), +block_pfn, +cpu_node (recovery list
> linkage). 32 bytes per pageblock, ~16KB per GB with 2MB pageblocks.
> - PagePCPBuddy page type marks pages eligible for PCP-level merging.
>
> [riel@surriel.com: fix ownership clearing on direct block frees]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> /*
> @@ -2907,9 +3205,11 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> {
> unsigned long UP_flags;
> struct per_cpu_pages *pcp;
> + struct pageblock_data *pbd;
> struct zone *zone;
> unsigned long pfn = page_to_pfn(page);
> int migratetype;
> + int owner_cpu, cache_cpu;
>
> if (!pcp_allowed_order(order)) {
> __free_pages_ok(page, order, fpi_flags);
> @@ -2927,7 +3227,8 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> * excessively into the page allocator
> */
> zone = page_zone(page);
> - migratetype = get_pfnblock_migratetype(page, pfn);
> + pbd = pfn_to_pageblock(page, pfn);
> + migratetype = pbd_migratetype(pbd);
> if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
> if (unlikely(is_migrate_isolate(migratetype))) {
> free_one_page(zone, page, pfn, order, fpi_flags);
> @@ -2941,15 +3242,45 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> add_page_to_zone_llist(zone, page, order);
> return;
> }
> - pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags);
> - if (pcp) {
> - if (!free_frozen_page_commit(zone, pcp, page, migratetype,
> - order, fpi_flags, &UP_flags))
> +
> + /*
> + * Route page to the owning CPU's PCP for merging, or to
> + * the local PCP for batching (zone-owned pages). Zone-owned
> + * pages are cached without PagePCPBuddy -- the merge pass
> + * skips them, so they're inert on any PCP list and drain
> + * individually to zone buddy.
> + *
> + * Ownership is stable here: it can only change when the
> + * pageblock is complete -- either fully free in zone buddy
> + * (Phase 1 claims) or fully merged on PCP (drain disowns).
> + * Since we hold this page, neither can happen.
> + */
> + owner_cpu = pbd->cpu - 1;
> + cache_cpu = owner_cpu;
> + if (cache_cpu < 0)
> + cache_cpu = raw_smp_processor_id();
> +
> + pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu);
> + if (unlikely(fpi_flags & FPI_TRYLOCK) || !in_task()) {
> + if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) {
> + free_one_page(zone, page, pfn, order, fpi_flags);
> return;
> - pcp_spin_unlock(pcp, UP_flags);
> + }
> } else {
> + spin_lock_irqsave(&pcp->lock, UP_flags);
Hm was it necessary to replace the pcp trylock scheme with
spin_lock_irqsave() here?
> + }
> +
> + if (unlikely(pcp->flags & PCPF_CPU_DEAD)) {
> + spin_unlock_irqrestore(&pcp->lock, UP_flags);
> free_one_page(zone, page, pfn, order, fpi_flags);
> + return;
> }
> +
> + free_frozen_page_commit(zone, pcp, page,
> + migratetype, order, fpi_flags,
> + cache_cpu == owner_cpu);
> +
> + spin_unlock_irqrestore(&pcp->lock, UP_flags);
> }
>
> void free_frozen_pages(struct page *page, unsigned int order)
next prev parent reply other threads:[~2026-04-10 9:48 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:40 [RFC 0/2] mm: page_alloc: pcp " Johannes Weiner
2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
2026-04-04 1:43 ` Rik van Riel
2026-04-03 19:40 ` [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator Johannes Weiner
2026-04-04 1:42 ` Rik van Riel
2026-04-06 16:12 ` Johannes Weiner
2026-04-06 17:31 ` Frank van der Linden
2026-04-06 21:58 ` Johannes Weiner
2026-04-10 9:48 ` Vlastimil Babka (SUSE) [this message]
2026-04-04 2:27 ` [RFC 0/2] mm: page_alloc: pcp " Zi Yan
2026-04-06 15:24 ` Johannes Weiner
2026-04-07 2:42 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45f3a5ba-9f61-4ee7-bc9a-af50057c0865@kernel.org \
--to=vbabka@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@surriel.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox