From: Harry Yoo <harry.yoo@oracle.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, Hao Li <hao.li@linux.dev>,
surenb@google.com
Subject: Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
Date: Tue, 24 Feb 2026 14:00:15 +0900 [thread overview]
Message-ID: <aZ0wX_QuxNTxXHMj@hyeyoo> (raw)
In-Reply-To: <aZ0SbIqaIkwoW2mB@fedora>
On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,
Hi Ming, thanks for the report!
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
>
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.
Ouch. Why did it crash?
> Reproducer
> ==========
>
> Hardware: NUMA machine with >= 32 CPUs
> Kernel: v7.0-rc (with slab/for-7.0/sheaves merged)
>
> # build kublk selftest
> make -C tools/testing/selftests/ublk/
>
> # create ublk null target device with 16 queues
> tools/testing/selftests/ublk/kublk add -t null -q 16
>
> # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
>
> # cleanup
> tools/testing/selftests/ublk/kublk del -n 0
>
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
Thanks for such a detailed steps to reproduce :)
> perf profile (bad kernel)
> =========================
>
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
>
> + 47.65% 1.21% io_uring [k] bio_alloc_bioset
> - 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof
> - 44.41% kmem_cache_alloc_noprof
> - 43.89% ___slab_alloc
> + 41.16% get_from_any_partial
> 0.91% get_from_partial_node
> + 0.87% alloc_from_new_slab
> + 0.65% allocate_slab
> - 44.70% 0.21% io_uring [k] mempool_alloc_noprof
> - 44.49% mempool_alloc_noprof
> - 44.43% kmem_cache_alloc_noprof
> - 43.90% ___slab_alloc
> + 41.18% get_from_any_partial
> 0.90% get_from_partial_node
> + 0.87% alloc_from_new_slab
> + 0.65% allocate_slab
> + 41.23% 0.10% io_uring [k] get_from_any_partial
> + 40.82% 0.48% io_uring [k] __raw_spin_lock_irqsave
> - 40.75% 0.20% io_uring [k] get_from_partial_node
> - 40.56% get_from_partial_node
> - 38.83% __raw_spin_lock_irqsave
> 38.65% native_queued_spin_lock_slowpath
That's pretty severe contention. Interestingly, the profile shows
a severe contention on the alloc path, but I don't see free path here.
wondering why only the alloc path is suffering, hmm...
Anyway, I think there may be two pieces contributing to this contention:
Part 1) We probably made the portion of slowpath bigger,
by caching a smaller number of objects per CPU
after transitioning to sheaves.
Part 2) We probably made the slowpath much slower.
We need to investigate those parts separately.
Regarding Part 1:
# Point 1. The CPU slab was not considered in the sheaf capacity calculation
calculate_sheaf_capacity() does not take into account that the CPU slab
was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
calculation to cache a number of objects similar to the CPU slab + percpu
partial slab list layers that SLUB previously had?
# Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
# and that probably means we're caching less objects per CPU.
Because SLUB previously assumed "slabs are half-full" when calculating
the number of slabs to cache per CPU, that could actually cache as twice
as many objects than intended when slabs are mostly empty.
Because sheaves track the number of objects precisely, that inaccuracy
is gone. If the workload was previously benefiting from the inaccuracy,
sheaves can make CPUs cache a smaller number of objects per CPU compared
to the percpu slab caching layer.
Anyway, I guess we need to check how many objects are actually
cached per CPU w/ and w/o sheaves, during the benchmark.
After making sure the number of objects cached per CPU is the same as
before, we could further investigate how much Part 2 plays into it.
Slightly off-topic, by the way, slab currently doesn't let system admins
set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
the default capacity. I think we need to allow sys admins to set a custom
sheaf_capacity in the very near future.
> Analysis
> ========
>
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.
Ok, so a submitter CPU keeps allocating objects, while a completion CPU
keeps freeing objects.
> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
>
> **The following is from AI, just for reference**
>
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.
Ok.
> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
>
> Tier 1: CPU slab freelist lock-free (cmpxchg)
> Tier 2: CPU partial slab list lock-free (per-CPU local_lock)
> Tier 3: Node partial list kmem_cache_node->list_lock
>
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
Well, the sheaves layer is supposed to provide a similar lock-free pool
of objects per CPU. The percpu slab layer was supposed to cache a certain
number of objects (from multiple slabs), which is translated to the
sheaf capacity now.
> The sheaves architecture replaces this with a 2-tier hierarchy:
>
> Tier 1: Per-CPU sheaf lock-free (local_lock)
> Tier 2: Node partial list kmem_cache_node->list_lock
>
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
>
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.
Again, the sheaves layer is supposed to cache a similar number of
objects previously covered by Tier 1 + Tier 2... oh, wait.
The sheaf capacity calculation logic does not take "Tier 1 CPU slab
freelist" into account.
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
>
> Thanks,
> Ming
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2026-02-24 5:00 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 2:52 Ming Lei
2026-02-24 5:00 ` Harry Yoo [this message]
2026-02-24 6:51 ` Hao Li
2026-02-24 7:10 ` Harry Yoo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aZ0wX_QuxNTxXHMj@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=hao.li@linux.dev \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ming.lei@redhat.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox