[RFC 0/2] mm: page_alloc: pcp buddy allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Rik van Riel <riel@surriel.com>,
	linux-kernel@vger.kernel.org
Subject: [RFC 0/2] mm: page_alloc: pcp buddy allocator
Date: Fri,  3 Apr 2026 15:40:33 -0400	[thread overview]
Message-ID: <20260403194526.477775-1-hannes@cmpxchg.org> (raw)

Hi,

this is an RFC for making the page allocator scale better with higher
thread counts and larger memory quantities.

In Meta production, we're seeing increasing zone->lock contention that
was traced back to a few different paths. A prominent one is the
userspace allocator, jemalloc. Allocations happen from page faults on
all CPUs running the workload. Frees are cached for reuse, but the
caches are periodically purged back to the kernel from a handful of
purger threads. This breaks affinity between allocations and frees:
Both sides use their own PCPs - one side depletes them, the other one
overfills them. Both sides routinely hit the zone->locked slowpath.

My understanding is that tcmalloc has a similar architecture.

Another contributor to contention is process exits, where large
numbers of pages are freed at once. The current PCP can only reduce
lock time when pages are reused. Reuse is unlikely because it's an
avalanche of free pages on a CPU busy walking page tables. Every time
the PCP overflows, the drain acquires the zone->lock and frees pages
one by one, trying to merge buddies together.

The idea proposed here is this: instead of single pages, make the PCP
grab entire pageblocks, split them outside the zone->lock. That CPU
then takes ownership of the block, and all frees route back to that
PCP instead of the freeing CPU's local one.

This has several benefits:

1. It's right away coarser/fewer allocations transactions under the
   zone->lock.

1a. Even if no full free blocks are available (memory pressure or
    small zone), with splitting available at the PCP level means the
    PCP can still grab chunks larger than the requested order from the
    zone->lock freelists, and dole them out on its own time.

2. The pages free back to where the allocations happen, increasing the
   odds of reuse and reducing the chances of zone->lock slowpaths.

3. The page buddies come back into one place, allowing upfront merging
   under the local pcp->lock. This makes coarser/fewer freeing
   transactions under the zone->lock.

The big concern is fragmentation. Movable allocations tend to be a mix
of short-lived anon and long-lived file cache pages. By the time the
PCP needs to drain due to thresholds or pressure, the blocks might not
be fully re-assembled yet. To prevent gobbling up and fragmenting ever
more blocks, partial blocks are remembered on drain and their pages
queued last on the zone freelist. When a PCP refills, it first tries
to recover any such fragment blocks.

On small or pressured machines, the PCP degrades to its previous
behavior. If a whole block doesn't fit the pcp->high limit, or a whole
block isn't available, the refill grabs smaller chunks that aren't
marked for ownership. The free side will use the local PCP as before.

I still need to run broader benchmarks, but I've been consistently
seeing a 3-4% reduction in %sys time for simple kernel builds on my
32-way, 32G RAM test machine.

A synthetic test on the same machine that allocates on many CPUs and
frees on just a few sees a consistent 1% increase in throughput.

I would expect those numbers to increase with higher concurrency and
larger memory volumes, but verifying that is TBD.

Sending an RFC to get an early gauge on direction.

Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a.

 include/linux/mmzone.h     |  38 ++-
 include/linux/page-flags.h |   9 +
 mm/debug.c                 |   1 +
 mm/internal.h              |  17 +
 mm/mm_init.c               |  25 +-
 mm/page_alloc.c            | 784 +++++++++++++++++++++++++++++++------------
 mm/sparse.c                |   3 +-
 7 files changed, 622 insertions(+), 255 deletions(-)

next             reply	other threads:[~2026-04-03 19:45 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-03 19:40 Johannes Weiner [this message]
2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
2026-04-04  1:43   ` Rik van Riel
2026-04-03 19:40 ` [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator Johannes Weiner
2026-04-04  1:42   ` Rik van Riel
2026-04-06 16:12     ` Johannes Weiner
2026-04-06 17:31   ` Frank van der Linden
2026-04-06 21:58     ` Johannes Weiner
2026-04-04  2:27 ` [RFC 0/2] mm: page_alloc: pcp " Zi Yan
2026-04-06 15:24   ` Johannes Weiner
2026-04-07  2:42     ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260403194526.477775-1-hannes@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=david@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox