linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
@ 2026-02-24  2:52 Ming Lei
  2026-02-24  5:00 ` Harry Yoo
  2026-02-24  6:51 ` Hao Li
  0 siblings, 2 replies; 4+ messages in thread
From: Ming Lei @ 2026-02-24  2:52 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: ming.lei, linux-mm, linux-kernel, linux-block

Hello Vlastimil and MM guys,

The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
performance regression for workloads with persistent cross-CPU
alloc/free patterns. ublk null target benchmark IOPS drops
significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
drop).

Bisecting within the sheaves series is blocked by a kernel panic at
17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
paths"), so the exact first bad commit could not be identified.

Reproducer
==========

Hardware: NUMA machine with >= 32 CPUs
Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)

    # build kublk selftest
    make -C tools/testing/selftests/ublk/

    # create ublk null target device with 16 queues
    tools/testing/selftests/ublk/kublk add -t null -q 16

    # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
    taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0

    # cleanup
    tools/testing/selftests/ublk/kublk del -n 0

Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

perf profile (bad kernel)
=========================

~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
with massive spinlock contention on the node partial list lock:

+   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
-   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
   - 44.41% kmem_cache_alloc_noprof
      - 43.89% ___slab_alloc
         + 41.16% get_from_any_partial
           0.91% get_from_partial_node
         + 0.87% alloc_from_new_slab
         + 0.65% allocate_slab
-   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
   - 44.49% mempool_alloc_noprof
      - 44.43% kmem_cache_alloc_noprof
         - 43.90% ___slab_alloc
            + 41.18% get_from_any_partial
              0.90% get_from_partial_node
            + 0.87% alloc_from_new_slab
            + 0.65% allocate_slab
+   41.23%     0.10%  io_uring  [k] get_from_any_partial
+   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
-   40.75%     0.20%  io_uring  [k] get_from_partial_node
   - 40.56% get_from_partial_node
      - 38.83% __raw_spin_lock_irqsave
           38.65% native_queued_spin_lock_slowpath

Analysis
========

The ublk null target workload exposes a cross-CPU slab allocation
pattern: bios are allocated on the io_uring submitter CPU during block
layer submission, but freed on a different CPU — the ublk daemon thread
that runs the completion via io_uring_cmd_complete_in_task() task work.
And the completion CPU stays in same LLC or numa node with submission CPU.

This cross-CPU alloc/free pattern is not unique to ublk. The block
layer's default rq_affinity=1 setting completes requests on a CPU
sharing LLC with the submission CPU, which similarly causes bio freeing
on a different CPU than allocation. The ublk null target simply makes
this pattern more pronounced and measurable because all overhead is in
the bio alloc/free path with no actual I/O.

**The following is from AI, just for reference**

The result is that the allocating CPU's per-CPU slab caches are
continuously drained without being replenished by local frees. The bio
layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
leaving the submitter CPUs' caches empty and falling through to
mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

In v6.19, SLUB handled this with a 3-tier allocation hierarchy:

  Tier 1: CPU slab freelist         lock-free (cmpxchg)
  Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
  Tier 3: Node partial list         kmem_cache_node->list_lock

The CPU partial slab list (Tier 2) was the critical buffer. It was
populated during __slab_free() -> put_cpu_partial() and provided a
lock-free pool of partial slabs per CPU. Even when the CPU slab was
exhausted, the CPU partial list could supply more slabs without
touching any shared lock.

The sheaves architecture replaces this with a 2-tier hierarchy:

  Tier 1: Per-CPU sheaf             lock-free (local_lock)
  Tier 2: Node partial list         kmem_cache_node->list_lock

The intermediate lock-free tier is gone. When the per-CPU sheaf is
empty and the spare sheaf is also empty, every refill must go through
the node partial list, requiring kmem_cache_node->list_lock. With 16
CPUs simultaneously allocating bios and all hitting empty sheaves, this
creates a thundering herd on the node list_lock.

When the local node's partial list is also depleted (objects freed on
remote nodes accumulate there instead), get_from_any_partial() kicks in
to search other NUMA nodes, compounding the contention with cross-NUMA
list_lock acquisition — explaining the 41% in get_from_any_partial ->
native_queued_spin_lock_slowpath seen in the profile.

The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
__refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
does not address the fundamental architectural issue: the missing
lock-free intermediate caching tier that the CPU partial list provided.

Thanks,
Ming



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-02-24  7:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24  5:00 ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox