[Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
@ 2026-02-24  2:52 Ming Lei
  2026-02-24  5:00 ` Harry Yoo
  2026-02-24  6:51 ` Hao Li
  0 siblings, 2 replies; 5+ messages in thread
From: Ming Lei @ 2026-02-24  2:52 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: ming.lei, linux-mm, linux-kernel, linux-block

Hello Vlastimil and MM guys,

The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
performance regression for workloads with persistent cross-CPU
alloc/free patterns. ublk null target benchmark IOPS drops
significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
drop).

Bisecting within the sheaves series is blocked by a kernel panic at
17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
paths"), so the exact first bad commit could not be identified.

Reproducer
==========

Hardware: NUMA machine with >= 32 CPUs
Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)

    # build kublk selftest
    make -C tools/testing/selftests/ublk/

    # create ublk null target device with 16 queues
    tools/testing/selftests/ublk/kublk add -t null -q 16

    # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
    taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0

    # cleanup
    tools/testing/selftests/ublk/kublk del -n 0

Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

perf profile (bad kernel)
=========================

~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
with massive spinlock contention on the node partial list lock:

+   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
-   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
   - 44.41% kmem_cache_alloc_noprof
      - 43.89% ___slab_alloc
         + 41.16% get_from_any_partial
           0.91% get_from_partial_node
         + 0.87% alloc_from_new_slab
         + 0.65% allocate_slab
-   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
   - 44.49% mempool_alloc_noprof
      - 44.43% kmem_cache_alloc_noprof
         - 43.90% ___slab_alloc
            + 41.18% get_from_any_partial
              0.90% get_from_partial_node
            + 0.87% alloc_from_new_slab
            + 0.65% allocate_slab
+   41.23%     0.10%  io_uring  [k] get_from_any_partial
+   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
-   40.75%     0.20%  io_uring  [k] get_from_partial_node
   - 40.56% get_from_partial_node
      - 38.83% __raw_spin_lock_irqsave
           38.65% native_queued_spin_lock_slowpath

Analysis
========

The ublk null target workload exposes a cross-CPU slab allocation
pattern: bios are allocated on the io_uring submitter CPU during block
layer submission, but freed on a different CPU — the ublk daemon thread
that runs the completion via io_uring_cmd_complete_in_task() task work.
And the completion CPU stays in same LLC or numa node with submission CPU.

This cross-CPU alloc/free pattern is not unique to ublk. The block
layer's default rq_affinity=1 setting completes requests on a CPU
sharing LLC with the submission CPU, which similarly causes bio freeing
on a different CPU than allocation. The ublk null target simply makes
this pattern more pronounced and measurable because all overhead is in
the bio alloc/free path with no actual I/O.

**The following is from AI, just for reference**

The result is that the allocating CPU's per-CPU slab caches are
continuously drained without being replenished by local frees. The bio
layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
leaving the submitter CPUs' caches empty and falling through to
mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

In v6.19, SLUB handled this with a 3-tier allocation hierarchy:

  Tier 1: CPU slab freelist         lock-free (cmpxchg)
  Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
  Tier 3: Node partial list         kmem_cache_node->list_lock

The CPU partial slab list (Tier 2) was the critical buffer. It was
populated during __slab_free() -> put_cpu_partial() and provided a
lock-free pool of partial slabs per CPU. Even when the CPU slab was
exhausted, the CPU partial list could supply more slabs without
touching any shared lock.

The sheaves architecture replaces this with a 2-tier hierarchy:

  Tier 1: Per-CPU sheaf             lock-free (local_lock)
  Tier 2: Node partial list         kmem_cache_node->list_lock

The intermediate lock-free tier is gone. When the per-CPU sheaf is
empty and the spare sheaf is also empty, every refill must go through
the node partial list, requiring kmem_cache_node->list_lock. With 16
CPUs simultaneously allocating bios and all hitting empty sheaves, this
creates a thundering herd on the node list_lock.

When the local node's partial list is also depleted (objects freed on
remote nodes accumulate there instead), get_from_any_partial() kicks in
to search other NUMA nodes, compounding the contention with cross-NUMA
list_lock acquisition — explaining the 41% in get_from_any_partial ->
native_queued_spin_lock_slowpath seen in the profile.

The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
__refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
does not address the fundamental architectural issue: the missing
lock-free intermediate caching tier that the CPU partial list provided.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
@ 2026-02-24  5:00 ` Harry Yoo
  2026-02-24  6:51 ` Hao Li
  1 sibling, 0 replies; 5+ messages in thread
From: Harry Yoo @ 2026-02-24  5:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, surenb

On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,

Hi Ming, thanks for the report!

> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.

Ouch. Why did it crash?

> Reproducer
> ==========
> 
> Hardware: NUMA machine with >= 32 CPUs
> Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> 
>     # build kublk selftest
>     make -C tools/testing/selftests/ublk/
> 
>     # create ublk null target device with 16 queues
>     tools/testing/selftests/ublk/kublk add -t null -q 16
> 
>     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
>     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> 
>     # cleanup
>     tools/testing/selftests/ublk/kublk del -n 0
> 
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

Thanks for such a detailed steps to reproduce :)

> perf profile (bad kernel)
> =========================
> 
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
> 
> +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
>    - 44.41% kmem_cache_alloc_noprof
>       - 43.89% ___slab_alloc
>          + 41.16% get_from_any_partial
>            0.91% get_from_partial_node
>          + 0.87% alloc_from_new_slab
>          + 0.65% allocate_slab
> -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
>    - 44.49% mempool_alloc_noprof
>       - 44.43% kmem_cache_alloc_noprof
>          - 43.90% ___slab_alloc
>             + 41.18% get_from_any_partial
>               0.90% get_from_partial_node
>             + 0.87% alloc_from_new_slab
>             + 0.65% allocate_slab
> +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> -   40.75%     0.20%  io_uring  [k] get_from_partial_node
>    - 40.56% get_from_partial_node
>       - 38.83% __raw_spin_lock_irqsave
>            38.65% native_queued_spin_lock_slowpath

That's pretty severe contention. Interestingly, the profile shows
a severe contention on the alloc path, but I don't see free path here.
wondering why only the alloc path is suffering, hmm...

Anyway, I think there may be two pieces contributing to this contention:

Part 1) We probably made the portion of slowpath bigger,
        by caching a smaller number of objects per CPU
	after transitioning to sheaves.

Part 2) We probably made the slowpath much slower.

We need to investigate those parts separately.

Regarding Part 1:

# Point 1. The CPU slab was not considered in the sheaf capacity calculation

calculate_sheaf_capacity() does not take into account that the CPU slab
was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
calculation to cache a number of objects similar to the CPU slab + percpu
partial slab list layers that SLUB previously had?

# Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
# and that probably means we're caching less objects per CPU.

Because SLUB previously assumed "slabs are half-full" when calculating
the number of slabs to cache per CPU, that could actually cache as twice
as many objects than intended when slabs are mostly empty.

Because sheaves track the number of objects precisely, that inaccuracy
is gone. If the workload was previously benefiting from the inaccuracy,
sheaves can make CPUs cache a smaller number of objects per CPU compared
to the percpu slab caching layer.

Anyway, I guess we need to check how many objects are actually
cached per CPU w/ and w/o sheaves, during the benchmark.

After making sure the number of objects cached per CPU is the same as
before, we could further investigate how much Part 2 plays into it.

Slightly off-topic, by the way, slab currently doesn't let system admins
set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
the default capacity. I think we need to allow sys admins to set a custom
sheaf_capacity in the very near future.

> Analysis
> ========
> 
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.

Ok, so a submitter CPU keeps allocating objects, while a completion CPU
keeps freeing objects.

> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
>
> **The following is from AI, just for reference**
> 
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

Ok.

> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.

Well, the sheaves layer is supposed to provide a similar lock-free pool
of objects per CPU. The percpu slab layer was supposed to cache a certain
number of objects (from multiple slabs), which is translated to the
sheaf capacity now.

> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
>
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.

Again, the sheaves layer is supposed to cache a similar number of
objects previously covered by Tier 1 + Tier 2... oh, wait.
The sheaf capacity calculation logic does not take "Tier 1 CPU slab
freelist" into account.

> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
  2026-02-24  5:00 ` Harry Yoo
@ 2026-02-24  6:51 ` Hao Li
  2026-02-24  7:10   ` Harry Yoo
  1 sibling, 1 reply; 5+ messages in thread
From: Hao Li @ 2026-02-24  6:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo

On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,
> 
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).

Thanks for testing.

> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"),

As Harry said, this is odd. Could you post crash logs?

> so the exact first bad commit could not be identified.

Based on my earlier test results, this performance regression (more precisely, I
suspect it is an expected return to the previous baseline - see below) should have
been introduced by two patches:

slab: add optimized sheaf refill from partial list
slab: remove SLUB_CPU_PARTIAL

https://lore.kernel.org/linux-mm/imzzlzuzjmlkhxc7hszxh5ba7jksvqcieg5rzyryijkkdhai5q@l2t4ye5quozb/

> 
> Reproducer
> ==========
> 
[...]
> 
> the result is that the allocating cpu's per-cpu slab caches are
> continuously drained without being replenished by local frees. the bio
> layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> leaving the submitter cpus' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> 
> in v6.19, slub handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
> 
> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
> 
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.

The purpose of introducing sheaves was to fully replace the percpu partial slabs
mechanism with sheaves. During this process, we first added the sheaves caching
layer and only later removed the percpu partial slabs layer, so it's expected
that performance could first improve and then return to the previous level.

Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
`815c8e35511d`)? If those two results are close, then the ~64% performance
regression we're currently discussing might be better interpreted as returning
to the previous baseline (i.e. a reversion), rather than a true regression.

The link below contains my previous test results. According to will-it-scale,
the performance of "no sheaves at all" and "only the sheaves layer exists" is
close:
https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/


-- 
Thanks,
Hao

> 
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  6:51 ` Hao Li
@ 2026-02-24  7:10   ` Harry Yoo
  2026-02-24  7:41     ` Hao Li
  0 siblings, 1 reply; 5+ messages in thread
From: Harry Yoo @ 2026-02-24  7:10 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block

On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Reproducer
> > ==========
> > 
> [...]
> > 
> > the result is that the allocating cpu's per-cpu slab caches are
> > continuously drained without being replenished by local frees. the bio
> > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > leaving the submitter cpus' caches empty and falling through to
> > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> > 
> > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> > 
> >   Tier 1: CPU slab freelist         lock-free (cmpxchg)
> >   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
> >   Tier 3: Node partial list         kmem_cache_node->list_lock
> > 
> > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > populated during __slab_free() -> put_cpu_partial() and provided a
> > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > exhausted, the CPU partial list could supply more slabs without
> > touching any shared lock.
> > 
> > The sheaves architecture replaces this with a 2-tier hierarchy:
> > 
> >   Tier 1: Per-CPU sheaf             lock-free (local_lock)
> >   Tier 2: Node partial list         kmem_cache_node->list_lock
> > 
> > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > empty and the spare sheaf is also empty, every refill must go through
> > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > creates a thundering herd on the node list_lock.
> > 
> > When the local node's partial list is also depleted (objects freed on
> > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > to search other NUMA nodes, compounding the contention with cross-NUMA
> > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > native_queued_spin_lock_slowpath seen in the profile.
> 
> The purpose of introducing sheaves was to fully replace the percpu partial slabs
> mechanism with sheaves. During this process, we first added the sheaves caching
> layer and only later removed the percpu partial slabs layer, so it's expected
> that performance could first improve and then return to the previous level.

There's one difference here; you used will-it-scale mmap2 test case that
involves maple tree node and vm_area_struct cache that already has
sheaves enabled in v6.19.

And Ming's benchmark stresses bio-<size> caches.

Since other caches don't have sheaves in v6.19, they're not supposed to
have performance gain by having additional sheaves layer on top of cpu
slab + percpu partial slab list.

> Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> `815c8e35511d`)? If those two results are close, then the ~64% performance
> regression we're currently discussing might be better interpreted as returning
> to the previous baseline (i.e. a reversion), rather than a true regression.
>
> The link below contains my previous test results. According to will-it-scale,
> the performance of "no sheaves at all" and "only the sheaves layer exists" is
> close:
> https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  7:10   ` Harry Yoo
@ 2026-02-24  7:41     ` Hao Li
  0 siblings, 0 replies; 5+ messages in thread
From: Hao Li @ 2026-02-24  7:41 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block

On Tue, Feb 24, 2026 at 04:10:43PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Reproducer
> > > ==========
> > > 
> > [...]
> > > 
> > > the result is that the allocating cpu's per-cpu slab caches are
> > > continuously drained without being replenished by local frees. the bio
> > > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > > leaving the submitter cpus' caches empty and falling through to
> > > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> > > 
> > > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> > > 
> > >   Tier 1: CPU slab freelist         lock-free (cmpxchg)
> > >   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
> > >   Tier 3: Node partial list         kmem_cache_node->list_lock
> > > 
> > > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > > populated during __slab_free() -> put_cpu_partial() and provided a
> > > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > > exhausted, the CPU partial list could supply more slabs without
> > > touching any shared lock.
> > > 
> > > The sheaves architecture replaces this with a 2-tier hierarchy:
> > > 
> > >   Tier 1: Per-CPU sheaf             lock-free (local_lock)
> > >   Tier 2: Node partial list         kmem_cache_node->list_lock
> > > 
> > > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > > empty and the spare sheaf is also empty, every refill must go through
> > > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > > creates a thundering herd on the node list_lock.
> > > 
> > > When the local node's partial list is also depleted (objects freed on
> > > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > > to search other NUMA nodes, compounding the contention with cross-NUMA
> > > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > > native_queued_spin_lock_slowpath seen in the profile.
> > 
> > The purpose of introducing sheaves was to fully replace the percpu partial slabs
> > mechanism with sheaves. During this process, we first added the sheaves caching
> > layer and only later removed the percpu partial slabs layer, so it's expected
> > that performance could first improve and then return to the previous level.
> 
> There's one difference here; you used will-it-scale mmap2 test case that
> involves maple tree node and vm_area_struct cache that already has
> sheaves enabled in v6.19.
> 
> And Ming's benchmark stresses bio-<size> caches.
> 
> Since other caches don't have sheaves in v6.19, they're not supposed to
> have performance gain by having additional sheaves layer on top of cpu
> slab + percpu partial slab list.

Oh, yes-you're right. That distinction is important!
I think I've gotten a bit stuck in a fixed way of thinking...
Thanks for pointing it out!

> 
> > Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> > commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> > `815c8e35511d`)? If those two results are close, then the ~64% performance
> > regression we're currently discussing might be better interpreted as returning
> > to the previous baseline (i.e. a reversion), rather than a true regression.
> >
> > The link below contains my previous test results. According to will-it-scale,
> > the performance of "no sheaves at all" and "only the sheaves layer exists" is
> > close:
> > https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-24  7:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24  5:00 ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo
2026-02-24  7:41     ` Hao Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox