[Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: ming.lei@redhat.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org
Subject: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
Date: Tue, 24 Feb 2026 10:52:28 +0800	[thread overview]
Message-ID: <aZ0SbIqaIkwoW2mB@fedora> (raw)

Hello Vlastimil and MM guys,

The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
performance regression for workloads with persistent cross-CPU
alloc/free patterns. ublk null target benchmark IOPS drops
significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
drop).

Bisecting within the sheaves series is blocked by a kernel panic at
17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
paths"), so the exact first bad commit could not be identified.

Reproducer
==========

Hardware: NUMA machine with >= 32 CPUs
Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)

    # build kublk selftest
    make -C tools/testing/selftests/ublk/

    # create ublk null target device with 16 queues
    tools/testing/selftests/ublk/kublk add -t null -q 16

    # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
    taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0

    # cleanup
    tools/testing/selftests/ublk/kublk del -n 0

Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

perf profile (bad kernel)
=========================

~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
with massive spinlock contention on the node partial list lock:

+   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
-   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
   - 44.41% kmem_cache_alloc_noprof
      - 43.89% ___slab_alloc
         + 41.16% get_from_any_partial
           0.91% get_from_partial_node
         + 0.87% alloc_from_new_slab
         + 0.65% allocate_slab
-   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
   - 44.49% mempool_alloc_noprof
      - 44.43% kmem_cache_alloc_noprof
         - 43.90% ___slab_alloc
            + 41.18% get_from_any_partial
              0.90% get_from_partial_node
            + 0.87% alloc_from_new_slab
            + 0.65% allocate_slab
+   41.23%     0.10%  io_uring  [k] get_from_any_partial
+   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
-   40.75%     0.20%  io_uring  [k] get_from_partial_node
   - 40.56% get_from_partial_node
      - 38.83% __raw_spin_lock_irqsave
           38.65% native_queued_spin_lock_slowpath

Analysis
========

The ublk null target workload exposes a cross-CPU slab allocation
pattern: bios are allocated on the io_uring submitter CPU during block
layer submission, but freed on a different CPU — the ublk daemon thread
that runs the completion via io_uring_cmd_complete_in_task() task work.
And the completion CPU stays in same LLC or numa node with submission CPU.

This cross-CPU alloc/free pattern is not unique to ublk. The block
layer's default rq_affinity=1 setting completes requests on a CPU
sharing LLC with the submission CPU, which similarly causes bio freeing
on a different CPU than allocation. The ublk null target simply makes
this pattern more pronounced and measurable because all overhead is in
the bio alloc/free path with no actual I/O.

**The following is from AI, just for reference**

The result is that the allocating CPU's per-CPU slab caches are
continuously drained without being replenished by local frees. The bio
layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
leaving the submitter CPUs' caches empty and falling through to
mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

In v6.19, SLUB handled this with a 3-tier allocation hierarchy:

  Tier 1: CPU slab freelist         lock-free (cmpxchg)
  Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
  Tier 3: Node partial list         kmem_cache_node->list_lock

The CPU partial slab list (Tier 2) was the critical buffer. It was
populated during __slab_free() -> put_cpu_partial() and provided a
lock-free pool of partial slabs per CPU. Even when the CPU slab was
exhausted, the CPU partial list could supply more slabs without
touching any shared lock.

The sheaves architecture replaces this with a 2-tier hierarchy:

  Tier 1: Per-CPU sheaf             lock-free (local_lock)
  Tier 2: Node partial list         kmem_cache_node->list_lock

The intermediate lock-free tier is gone. When the per-CPU sheaf is
empty and the spare sheaf is also empty, every refill must go through
the node partial list, requiring kmem_cache_node->list_lock. With 16
CPUs simultaneously allocating bios and all hitting empty sheaves, this
creates a thundering herd on the node list_lock.

When the local node's partial list is also depleted (objects freed on
remote nodes accumulate there instead), get_from_any_partial() kicks in
to search other NUMA nodes, compounding the contention with cross-NUMA
list_lock acquisition — explaining the 41% in get_from_any_partial ->
native_queued_spin_lock_slowpath seen in the profile.

The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
__refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
does not address the fundamental architectural issue: the missing
lock-free intermediate caching tier that the CPU partial list provided.

Thanks,
Ming

next             reply	other threads:[~2026-02-24  2:52 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24  2:52 Ming Lei [this message]
2026-02-24  5:00 ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ0SbIqaIkwoW2mB@fedora \
    --to=ming.lei@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox