[Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
@ 2026-02-24  2:52 Ming Lei
  2026-02-24  5:00 ` Harry Yoo
                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Ming Lei @ 2026-02-24  2:52 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: ming.lei, linux-mm, linux-kernel, linux-block

Hello Vlastimil and MM guys,

The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
performance regression for workloads with persistent cross-CPU
alloc/free patterns. ublk null target benchmark IOPS drops
significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
drop).

Bisecting within the sheaves series is blocked by a kernel panic at
17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
paths"), so the exact first bad commit could not be identified.

Reproducer
==========

Hardware: NUMA machine with >= 32 CPUs
Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)

    # build kublk selftest
    make -C tools/testing/selftests/ublk/

    # create ublk null target device with 16 queues
    tools/testing/selftests/ublk/kublk add -t null -q 16

    # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
    taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0

    # cleanup
    tools/testing/selftests/ublk/kublk del -n 0

Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

perf profile (bad kernel)
=========================

~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
with massive spinlock contention on the node partial list lock:

+   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
-   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
   - 44.41% kmem_cache_alloc_noprof
      - 43.89% ___slab_alloc
         + 41.16% get_from_any_partial
           0.91% get_from_partial_node
         + 0.87% alloc_from_new_slab
         + 0.65% allocate_slab
-   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
   - 44.49% mempool_alloc_noprof
      - 44.43% kmem_cache_alloc_noprof
         - 43.90% ___slab_alloc
            + 41.18% get_from_any_partial
              0.90% get_from_partial_node
            + 0.87% alloc_from_new_slab
            + 0.65% allocate_slab
+   41.23%     0.10%  io_uring  [k] get_from_any_partial
+   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
-   40.75%     0.20%  io_uring  [k] get_from_partial_node
   - 40.56% get_from_partial_node
      - 38.83% __raw_spin_lock_irqsave
           38.65% native_queued_spin_lock_slowpath

Analysis
========

The ublk null target workload exposes a cross-CPU slab allocation
pattern: bios are allocated on the io_uring submitter CPU during block
layer submission, but freed on a different CPU — the ublk daemon thread
that runs the completion via io_uring_cmd_complete_in_task() task work.
And the completion CPU stays in same LLC or numa node with submission CPU.

This cross-CPU alloc/free pattern is not unique to ublk. The block
layer's default rq_affinity=1 setting completes requests on a CPU
sharing LLC with the submission CPU, which similarly causes bio freeing
on a different CPU than allocation. The ublk null target simply makes
this pattern more pronounced and measurable because all overhead is in
the bio alloc/free path with no actual I/O.

**The following is from AI, just for reference**

The result is that the allocating CPU's per-CPU slab caches are
continuously drained without being replenished by local frees. The bio
layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
leaving the submitter CPUs' caches empty and falling through to
mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

In v6.19, SLUB handled this with a 3-tier allocation hierarchy:

  Tier 1: CPU slab freelist         lock-free (cmpxchg)
  Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
  Tier 3: Node partial list         kmem_cache_node->list_lock

The CPU partial slab list (Tier 2) was the critical buffer. It was
populated during __slab_free() -> put_cpu_partial() and provided a
lock-free pool of partial slabs per CPU. Even when the CPU slab was
exhausted, the CPU partial list could supply more slabs without
touching any shared lock.

The sheaves architecture replaces this with a 2-tier hierarchy:

  Tier 1: Per-CPU sheaf             lock-free (local_lock)
  Tier 2: Node partial list         kmem_cache_node->list_lock

The intermediate lock-free tier is gone. When the per-CPU sheaf is
empty and the spare sheaf is also empty, every refill must go through
the node partial list, requiring kmem_cache_node->list_lock. With 16
CPUs simultaneously allocating bios and all hitting empty sheaves, this
creates a thundering herd on the node list_lock.

When the local node's partial list is also depleted (objects freed on
remote nodes accumulate there instead), get_from_any_partial() kicks in
to search other NUMA nodes, compounding the contention with cross-NUMA
list_lock acquisition — explaining the 41% in get_from_any_partial ->
native_queued_spin_lock_slowpath seen in the profile.

The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
__refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
does not address the fundamental architectural issue: the missing
lock-free intermediate caching tier that the CPU partial list provided.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
@ 2026-02-24  5:00 ` Harry Yoo
  2026-02-24  9:07   ` Ming Lei
  2026-02-24  6:51 ` Hao Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Harry Yoo @ 2026-02-24  5:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, surenb

On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,

Hi Ming, thanks for the report!

> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.

Ouch. Why did it crash?

> Reproducer
> ==========
> 
> Hardware: NUMA machine with >= 32 CPUs
> Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> 
>     # build kublk selftest
>     make -C tools/testing/selftests/ublk/
> 
>     # create ublk null target device with 16 queues
>     tools/testing/selftests/ublk/kublk add -t null -q 16
> 
>     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
>     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> 
>     # cleanup
>     tools/testing/selftests/ublk/kublk del -n 0
> 
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

Thanks for such a detailed steps to reproduce :)

> perf profile (bad kernel)
> =========================
> 
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
> 
> +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
>    - 44.41% kmem_cache_alloc_noprof
>       - 43.89% ___slab_alloc
>          + 41.16% get_from_any_partial
>            0.91% get_from_partial_node
>          + 0.87% alloc_from_new_slab
>          + 0.65% allocate_slab
> -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
>    - 44.49% mempool_alloc_noprof
>       - 44.43% kmem_cache_alloc_noprof
>          - 43.90% ___slab_alloc
>             + 41.18% get_from_any_partial
>               0.90% get_from_partial_node
>             + 0.87% alloc_from_new_slab
>             + 0.65% allocate_slab
> +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> -   40.75%     0.20%  io_uring  [k] get_from_partial_node
>    - 40.56% get_from_partial_node
>       - 38.83% __raw_spin_lock_irqsave
>            38.65% native_queued_spin_lock_slowpath

That's pretty severe contention. Interestingly, the profile shows
a severe contention on the alloc path, but I don't see free path here.
wondering why only the alloc path is suffering, hmm...

Anyway, I think there may be two pieces contributing to this contention:

Part 1) We probably made the portion of slowpath bigger,
        by caching a smaller number of objects per CPU
	after transitioning to sheaves.

Part 2) We probably made the slowpath much slower.

We need to investigate those parts separately.

Regarding Part 1:

# Point 1. The CPU slab was not considered in the sheaf capacity calculation

calculate_sheaf_capacity() does not take into account that the CPU slab
was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
calculation to cache a number of objects similar to the CPU slab + percpu
partial slab list layers that SLUB previously had?

# Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
# and that probably means we're caching less objects per CPU.

Because SLUB previously assumed "slabs are half-full" when calculating
the number of slabs to cache per CPU, that could actually cache as twice
as many objects than intended when slabs are mostly empty.

Because sheaves track the number of objects precisely, that inaccuracy
is gone. If the workload was previously benefiting from the inaccuracy,
sheaves can make CPUs cache a smaller number of objects per CPU compared
to the percpu slab caching layer.

Anyway, I guess we need to check how many objects are actually
cached per CPU w/ and w/o sheaves, during the benchmark.

After making sure the number of objects cached per CPU is the same as
before, we could further investigate how much Part 2 plays into it.

Slightly off-topic, by the way, slab currently doesn't let system admins
set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
the default capacity. I think we need to allow sys admins to set a custom
sheaf_capacity in the very near future.

> Analysis
> ========
> 
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.

Ok, so a submitter CPU keeps allocating objects, while a completion CPU
keeps freeing objects.

> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
>
> **The following is from AI, just for reference**
> 
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

Ok.

> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.

Well, the sheaves layer is supposed to provide a similar lock-free pool
of objects per CPU. The percpu slab layer was supposed to cache a certain
number of objects (from multiple slabs), which is translated to the
sheaf capacity now.

> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
>
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.

Again, the sheaves layer is supposed to cache a similar number of
objects previously covered by Tier 1 + Tier 2... oh, wait.
The sheaf capacity calculation logic does not take "Tier 1 CPU slab
freelist" into account.

> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  5:00 ` Harry Yoo
@ 2026-02-24  9:07   ` Ming Lei
  2026-02-25  5:32     ` Hao Li
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2026-02-24  9:07 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, surenb

Hi Harry,

On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> 
> Hi Ming, thanks for the report!
> 
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> > 
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> 
> Ouch. Why did it crash?

[   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
[   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
[   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
[   16.162430] RIP: 0010:__put_partials+0x2f/0x140
[   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
[   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
[   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
[   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
[   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
[   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
[   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
[   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
[   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
[   16.162447] PKRU: 55555554
[   16.162448] Call Trace:
[   16.162450]  <TASK>
[   16.162452]  kmem_cache_free+0x410/0x490
[   16.162454]  do_readlinkat+0x14e/0x180
[   16.162459]  __x64_sys_readlinkat+0x1c/0x30
[   16.162461]  do_syscall_64+0x7e/0x6b0
[   16.162465]  ? post_alloc_hook+0xb9/0x140
[   16.162468]  ? get_page_from_freelist+0x478/0x720
[   16.162470]  ? path_openat+0xb3/0x2a0
[   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
[   16.162474]  ? count_memcg_events+0xd6/0x210
[   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
[   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
[   16.162481]  ? charge_memcg+0x48/0x80
[   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
[   16.162484]  ? __folio_mod_stat+0x2d/0x90
[   16.162487]  ? set_ptes.isra.0+0x36/0x80
[   16.162490]  ? do_anonymous_page+0x100/0x4a0
[   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
[   16.162493]  ? count_memcg_events+0xd6/0x210
[   16.162494]  ? handle_mm_fault+0x212/0x340
[   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
[   16.162500]  ? irqentry_exit+0x6d/0x540
[   16.162502]  ? exc_page_fault+0x7e/0x1a0
[   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

> 
> > Reproducer
> > ==========
> > 
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > 
> >     # build kublk selftest
> >     make -C tools/testing/selftests/ublk/
> > 
> >     # create ublk null target device with 16 queues
> >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > 
> >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> >     # cleanup
> >     tools/testing/selftests/ublk/kublk del -n 0
> > 
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> 
> Thanks for such a detailed steps to reproduce :)
> 
> > perf profile (bad kernel)
> > =========================
> > 
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> > 
> > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> >    - 44.41% kmem_cache_alloc_noprof
> >       - 43.89% ___slab_alloc
> >          + 41.16% get_from_any_partial
> >            0.91% get_from_partial_node
> >          + 0.87% alloc_from_new_slab
> >          + 0.65% allocate_slab
> > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> >    - 44.49% mempool_alloc_noprof
> >       - 44.43% kmem_cache_alloc_noprof
> >          - 43.90% ___slab_alloc
> >             + 41.18% get_from_any_partial
> >               0.90% get_from_partial_node
> >             + 0.87% alloc_from_new_slab
> >             + 0.65% allocate_slab
> > +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> > +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> > -   40.75%     0.20%  io_uring  [k] get_from_partial_node
> >    - 40.56% get_from_partial_node
> >       - 38.83% __raw_spin_lock_irqsave
> >            38.65% native_queued_spin_lock_slowpath
> 
> That's pretty severe contention. Interestingly, the profile shows
> a severe contention on the alloc path, but I don't see free path here.
> wondering why only the alloc path is suffering, hmm...

free path looks fine.

+    2.84%     0.16%  kublk            [kernel.kallsyms]       [k] mempool_free
+    2.66%     0.17%  kublk            [kernel.kallsyms]       [k] security_uring_cmd
+    2.57%     0.36%  kublk            [kernel.kallsyms]       [k] __slab_free

> 
> Anyway, I think there may be two pieces contributing to this contention:
> 
> Part 1) We probably made the portion of slowpath bigger,
>         by caching a smaller number of objects per CPU
> 	after transitioning to sheaves.
> 
> Part 2) We probably made the slowpath much slower.
> 
> We need to investigate those parts separately.
> 
> Regarding Part 1:
> 
> # Point 1. The CPU slab was not considered in the sheaf capacity calculation
> 
> calculate_sheaf_capacity() does not take into account that the CPU slab
> was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
> calculation to cache a number of objects similar to the CPU slab + percpu
> partial slab list layers that SLUB previously had?
> 
> # Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
> # and that probably means we're caching less objects per CPU.
> 
> Because SLUB previously assumed "slabs are half-full" when calculating
> the number of slabs to cache per CPU, that could actually cache as twice
> as many objects than intended when slabs are mostly empty.
> 
> Because sheaves track the number of objects precisely, that inaccuracy
> is gone. If the workload was previously benefiting from the inaccuracy,
> sheaves can make CPUs cache a smaller number of objects per CPU compared
> to the percpu slab caching layer.
> 
> Anyway, I guess we need to check how many objects are actually
> cached per CPU w/ and w/o sheaves, during the benchmark.

In the workload `fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0`, queue depth
is 128, so there should be 128 inflight bios on these 16 tasks/cpus.

> 
> After making sure the number of objects cached per CPU is the same as
> before, we could further investigate how much Part 2 plays into it.
> 
> Slightly off-topic, by the way, slab currently doesn't let system admins
> set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
> the default capacity. I think we need to allow sys admins to set a custom
> sheaf_capacity in the very near future.
> 
> > Analysis
> > ========
> > 
> > The ublk null target workload exposes a cross-CPU slab allocation
> > pattern: bios are allocated on the io_uring submitter CPU during block
> > layer submission, but freed on a different CPU — the ublk daemon thread
> > that runs the completion via io_uring_cmd_complete_in_task() task work.
> > And the completion CPU stays in same LLC or numa node with submission CPU.
> 
> Ok, so a submitter CPU keeps allocating objects, while a completion CPU
> keeps freeing objects.

Yes.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  9:07   ` Ming Lei
@ 2026-02-25  5:32     ` Hao Li
  2026-02-25  6:54       ` Harry Yoo
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Li @ 2026-02-25  5:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Harry Yoo, Vlastimil Babka, Andrew Morton, linux-mm,
	linux-kernel, linux-block, surenb

On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> Hi Harry,
> 
> On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Hello Vlastimil and MM guys,
> > 
> > Hi Ming, thanks for the report!
> > 
> > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > performance regression for workloads with persistent cross-CPU
> > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > drop).
> > > 
> > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > paths"), so the exact first bad commit could not be identified.
> > 
> > Ouch. Why did it crash?
> 
> [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> [   16.162447] PKRU: 55555554
> [   16.162448] Call Trace:
> [   16.162450]  <TASK>
> [   16.162452]  kmem_cache_free+0x410/0x490
> [   16.162454]  do_readlinkat+0x14e/0x180
> [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> [   16.162461]  do_syscall_64+0x7e/0x6b0
> [   16.162465]  ? post_alloc_hook+0xb9/0x140
> [   16.162468]  ? get_page_from_freelist+0x478/0x720
> [   16.162470]  ? path_openat+0xb3/0x2a0
> [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> [   16.162474]  ? count_memcg_events+0xd6/0x210
> [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> [   16.162481]  ? charge_memcg+0x48/0x80
> [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> [   16.162493]  ? count_memcg_events+0xd6/0x210
> [   16.162494]  ? handle_mm_fault+0x212/0x340
> [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> [   16.162500]  ? irqentry_exit+0x6d/0x540
> [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

For this problem, I have a hypothesis which is inspired by a comment in the
patch "slab: remove cpu (partial) slabs usage from allocation paths":

/*
 * get a single object from the slab. This might race against __slab_free(),
 * which however has to take the list_lock if it's about to make the slab fully
 * free.
 */

My understanding is that this comment is pointing out a possible race between
__slab_free() and get_from_partial_node(). Since __slab_free() takes
n->list_lock when it is about to make the slab fully free, and
get_from_partial_node() also takes the same lock, the two paths should be
mutually excluded by the lock and thus safe.

However, I'm wondering if there could be another race window. Suppose CPU0's
get_from_partial_node() has already finished __slab_update_freelist(), but has
not yet reached remove_partial(). In that gap, another CPU1 could free an object
to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
__slab_free() will call put_cpu_partial(s, slab, 1) without holding
n->list_lock, trying to add this slab to the CPU partial list. In that case,
both paths would operate on the same union field in struct slab, which might
lead to list corruption.

> 
> > 
> > > Reproducer
> > > ==========
> > > 
> > > Hardware: NUMA machine with >= 32 CPUs
> > > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > > 
> > >     # build kublk selftest
> > >     make -C tools/testing/selftests/ublk/
> > > 
> > >     # create ublk null target device with 16 queues
> > >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > > 
> > >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > 
> > >     # cleanup
> > >     tools/testing/selftests/ublk/kublk del -n 0
> > > 
> > > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > 
> > Thanks for such a detailed steps to reproduce :)
> > 
> > > perf profile (bad kernel)
> > > =========================
> > > 
> > > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > > with massive spinlock contention on the node partial list lock:
> > > 
> > > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> > >    - 44.41% kmem_cache_alloc_noprof
> > >       - 43.89% ___slab_alloc
> > >          + 41.16% get_from_any_partial
> > >            0.91% get_from_partial_node
> > >          + 0.87% alloc_from_new_slab
> > >          + 0.65% allocate_slab
> > > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> > >    - 44.49% mempool_alloc_noprof
> > >       - 44.43% kmem_cache_alloc_noprof
> > >          - 43.90% ___slab_alloc
> > >             + 41.18% get_from_any_partial
> > >               0.90% get_from_partial_node
> > >             + 0.87% alloc_from_new_slab
> > >             + 0.65% allocate_slab
> > > +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> > > +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> > > -   40.75%     0.20%  io_uring  [k] get_from_partial_node
> > >    - 40.56% get_from_partial_node
> > >       - 38.83% __raw_spin_lock_irqsave
> > >            38.65% native_queued_spin_lock_slowpath
> > 
> > That's pretty severe contention. Interestingly, the profile shows
> > a severe contention on the alloc path, but I don't see free path here.
> > wondering why only the alloc path is suffering, hmm...
> 
> free path looks fine.
> 
> +    2.84%     0.16%  kublk            [kernel.kallsyms]       [k] mempool_free
> +    2.66%     0.17%  kublk            [kernel.kallsyms]       [k] security_uring_cmd
> +    2.57%     0.36%  kublk            [kernel.kallsyms]       [k] __slab_free
> 
> > 
> > Anyway, I think there may be two pieces contributing to this contention:
> > 
> > Part 1) We probably made the portion of slowpath bigger,
> >         by caching a smaller number of objects per CPU
> > 	after transitioning to sheaves.
> > 
> > Part 2) We probably made the slowpath much slower.
> > 
> > We need to investigate those parts separately.
> > 
> > Regarding Part 1:
> > 
> > # Point 1. The CPU slab was not considered in the sheaf capacity calculation
> > 
> > calculate_sheaf_capacity() does not take into account that the CPU slab
> > was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
> > calculation to cache a number of objects similar to the CPU slab + percpu
> > partial slab list layers that SLUB previously had?
> > 
> > # Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
> > # and that probably means we're caching less objects per CPU.
> > 
> > Because SLUB previously assumed "slabs are half-full" when calculating
> > the number of slabs to cache per CPU, that could actually cache as twice
> > as many objects than intended when slabs are mostly empty.
> > 
> > Because sheaves track the number of objects precisely, that inaccuracy
> > is gone. If the workload was previously benefiting from the inaccuracy,
> > sheaves can make CPUs cache a smaller number of objects per CPU compared
> > to the percpu slab caching layer.
> > 
> > Anyway, I guess we need to check how many objects are actually
> > cached per CPU w/ and w/o sheaves, during the benchmark.
> 
> In the workload `fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0`, queue depth
> is 128, so there should be 128 inflight bios on these 16 tasks/cpus.
> 
> > 
> > After making sure the number of objects cached per CPU is the same as
> > before, we could further investigate how much Part 2 plays into it.
> > 
> > Slightly off-topic, by the way, slab currently doesn't let system admins
> > set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
> > the default capacity. I think we need to allow sys admins to set a custom
> > sheaf_capacity in the very near future.
> > 
> > > Analysis
> > > ========
> > > 
> > > The ublk null target workload exposes a cross-CPU slab allocation
> > > pattern: bios are allocated on the io_uring submitter CPU during block
> > > layer submission, but freed on a different CPU — the ublk daemon thread
> > > that runs the completion via io_uring_cmd_complete_in_task() task work.
> > > And the completion CPU stays in same LLC or numa node with submission CPU.
> > 
> > Ok, so a submitter CPU keeps allocating objects, while a completion CPU
> > keeps freeing objects.
> 
> Yes.
> 
> 
> Thanks, 
> Ming
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  5:32     ` Hao Li
@ 2026-02-25  6:54       ` Harry Yoo
  2026-02-25  7:06         ` Hao Li
  0 siblings, 1 reply; 40+ messages in thread
From: Harry Yoo @ 2026-02-25  6:54 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > Hi Harry,
> > 
> > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > Hello Vlastimil and MM guys,
> > > 
> > > Hi Ming, thanks for the report!
> > > 
> > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > performance regression for workloads with persistent cross-CPU
> > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > drop).
> > > > 
> > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > paths"), so the exact first bad commit could not be identified.
> > > 
> > > Ouch. Why did it crash?
> > 
> > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > [   16.162447] PKRU: 55555554
> > [   16.162448] Call Trace:
> > [   16.162450]  <TASK>
> > [   16.162452]  kmem_cache_free+0x410/0x490
> > [   16.162454]  do_readlinkat+0x14e/0x180
> > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > [   16.162470]  ? path_openat+0xb3/0x2a0
> > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > [   16.162481]  ? charge_memcg+0x48/0x80
> > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> For this problem, I have a hypothesis which is inspired by a comment in the
> patch "slab: remove cpu (partial) slabs usage from allocation paths":
> 
> /*
>  * get a single object from the slab. This might race against __slab_free(),
>  * which however has to take the list_lock if it's about to make the slab fully
>  * free.
>  */
> 
> My understanding is that this comment is pointing out a possible race between
> __slab_free() and get_from_partial_node(). Since __slab_free() takes
> n->list_lock when it is about to make the slab fully free, and
> get_from_partial_node() also takes the same lock, the two paths should be
> mutually excluded by the lock and thus safe.
> 
> However, I'm wondering if there could be another race window. Suppose CPU0's
> get_from_partial_node() has already finished __slab_update_freelist(), but has
> not yet reached remove_partial(). In that gap, another CPU1 could free an object
> to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
>
> __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> n->list_lock, trying to add this slab to the CPU partial list.

If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
for CPU0 to release the lock. And CPU0 will remove the slab from the
partial list before releasing the lock. Or am I missing something?

> In that case,
> both paths would operate on the same union field in struct slab, which might
> lead to list corruption.

Not sure how the scenario you describe could happen:

CPU 0					CPU1
- get_from_partial_node()		
  -> spin_lock(&n->list_lock)		
					- __slab_free()
  -> __slab_update_freelist(),
     slab becomes full
					-> was_full == 1
					-> spin_lock(&n->list_lock)
					// starts spining
  -> if (!new.freelist)
  ->   remove_partial()
  -> spin_unlock()
					-> spin_lock(&n->list_lock)
					   // acquired!
					-> slab_update_freelist()
					-> spin_unlock(&n->list_lock)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  6:54       ` Harry Yoo
@ 2026-02-25  7:06         ` Hao Li
  2026-02-25  7:19           ` Harry Yoo
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Li @ 2026-02-25  7:06 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > Hi Harry,
> > > 
> > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > Hello Vlastimil and MM guys,
> > > > 
> > > > Hi Ming, thanks for the report!
> > > > 
> > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > performance regression for workloads with persistent cross-CPU
> > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > drop).
> > > > > 
> > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > paths"), so the exact first bad commit could not be identified.
> > > > 
> > > > Ouch. Why did it crash?
> > > 
> > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > [   16.162447] PKRU: 55555554
> > > [   16.162448] Call Trace:
> > > [   16.162450]  <TASK>
> > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > For this problem, I have a hypothesis which is inspired by a comment in the
> > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > 
> > /*
> >  * get a single object from the slab. This might race against __slab_free(),
> >  * which however has to take the list_lock if it's about to make the slab fully
> >  * free.
> >  */
> > 
> > My understanding is that this comment is pointing out a possible race between
> > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > n->list_lock when it is about to make the slab fully free, and
> > get_from_partial_node() also takes the same lock, the two paths should be
> > mutually excluded by the lock and thus safe.
> > 
> > However, I'm wondering if there could be another race window. Suppose CPU0's
> > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> >
> > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > n->list_lock, trying to add this slab to the CPU partial list.
> 
> If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> for CPU0 to release the lock. And CPU0 will remove the slab from the
> partial list before releasing the lock. Or am I missing something?
> 
> > In that case,
> > both paths would operate on the same union field in struct slab, which might
> > lead to list corruption.
> 
> Not sure how the scenario you describe could happen:
> 
> CPU 0					CPU1
> - get_from_partial_node()		
>   -> spin_lock(&n->list_lock)		
> 					- __slab_free()
>   -> __slab_update_freelist(),
>      slab becomes full
> 					-> was_full == 1
> 					-> spin_lock(&n->list_lock)

In __slab_free, if was_full == 1, then the condition
!(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
enter the "if" block and therefore n->list_lock is not acquired.
Does that sound right.

-- 
Thanks,
Hao

> 					// starts spining
>   -> if (!new.freelist)
>   ->   remove_partial()
>   -> spin_unlock()
> 					-> spin_lock(&n->list_lock)
> 					   // acquired!
> 					-> slab_update_freelist()
> 					-> spin_unlock(&n->list_lock)
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  7:06         ` Hao Li
@ 2026-02-25  7:19           ` Harry Yoo
  2026-02-25  8:19             ` Hao Li
  2026-02-25  8:21             ` Harry Yoo
  0 siblings, 2 replies; 40+ messages in thread
From: Harry Yoo @ 2026-02-25  7:19 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > Hi Harry,
> > > > 
> > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > Hello Vlastimil and MM guys,
> > > > > 
> > > > > Hi Ming, thanks for the report!
> > > > > 
> > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > drop).
> > > > > > 
> > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > 
> > > > > Ouch. Why did it crash?
> > > > 
> > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > [   16.162447] PKRU: 55555554
> > > > [   16.162448] Call Trace:
> > > > [   16.162450]  <TASK>
> > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > 
> > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > 
> > > /*
> > >  * get a single object from the slab. This might race against __slab_free(),
> > >  * which however has to take the list_lock if it's about to make the slab fully
> > >  * free.
> > >  */
> > > 
> > > My understanding is that this comment is pointing out a possible race between
> > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > n->list_lock when it is about to make the slab fully free, and
> > > get_from_partial_node() also takes the same lock, the two paths should be
> > > mutually excluded by the lock and thus safe.
> > > 
> > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > >
> > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > n->list_lock, trying to add this slab to the CPU partial list.
> > 
> > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > partial list before releasing the lock. Or am I missing something?
> > 
> > > In that case,
> > > both paths would operate on the same union field in struct slab, which might
> > > lead to list corruption.
> > 
> > Not sure how the scenario you describe could happen:
> > 
> > CPU 0					CPU1
> > - get_from_partial_node()		
> >   -> spin_lock(&n->list_lock)		
> > 					- __slab_free()
> >   -> __slab_update_freelist(),
> >      slab becomes full
> > 					-> was_full == 1
> > 					-> spin_lock(&n->list_lock)
> 
> In __slab_free, if was_full == 1, then the condition
> !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> enter the "if" block and therefore n->list_lock is not acquired.
> Does that sound right.

Nah, you're right. Just slipped my mind. No need to acquire the lock
if it was full, because that means it's not on the partial list.

Hmm... but the logic has been there for very long time.

Looks like we broke a premise for the percpu slab caching layer
to work correctly, while transitioning to sheaves.

I think the new behavior introduced during the sheaves transition is that
SLUB can now allocate objects from slabs without freezing it. Allocating
objects from slab without freezing it seems to confuse the free path...

But not sure if we could "fix" that because the percpu partial slab
caching layer is gone anyway :)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  7:19           ` Harry Yoo
@ 2026-02-25  8:19             ` Hao Li
  2026-02-25  8:41               ` Harry Yoo
  2026-02-25  8:21             ` Harry Yoo
  1 sibling, 1 reply; 40+ messages in thread
From: Hao Li @ 2026-02-25  8:19 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > Hi Harry,
> > > > > 
> > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > Hello Vlastimil and MM guys,
> > > > > > 
> > > > > > Hi Ming, thanks for the report!
> > > > > > 
> > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > drop).
> > > > > > > 
> > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > 
> > > > > > Ouch. Why did it crash?
> > > > > 
> > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > [   16.162447] PKRU: 55555554
> > > > > [   16.162448] Call Trace:
> > > > > [   16.162450]  <TASK>
> > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > 
> > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > 
> > > > /*
> > > >  * get a single object from the slab. This might race against __slab_free(),
> > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > >  * free.
> > > >  */
> > > > 
> > > > My understanding is that this comment is pointing out a possible race between
> > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > n->list_lock when it is about to make the slab fully free, and
> > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > mutually excluded by the lock and thus safe.
> > > > 
> > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > >
> > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > 
> > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > partial list before releasing the lock. Or am I missing something?
> > > 
> > > > In that case,
> > > > both paths would operate on the same union field in struct slab, which might
> > > > lead to list corruption.
> > > 
> > > Not sure how the scenario you describe could happen:
> > > 
> > > CPU 0					CPU1
> > > - get_from_partial_node()		
> > >   -> spin_lock(&n->list_lock)		
> > > 					- __slab_free()
> > >   -> __slab_update_freelist(),
> > >      slab becomes full
> > > 					-> was_full == 1
> > > 					-> spin_lock(&n->list_lock)
> > 
> > In __slab_free, if was_full == 1, then the condition
> > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > enter the "if" block and therefore n->list_lock is not acquired.
> > Does that sound right.
> 
> Nah, you're right. Just slipped my mind. No need to acquire the lock
> if it was full, because that means it's not on the partial list.

Exactly.

> 
> Hmm... but the logic has been there for very long time.

Yes.

> 
> Looks like we broke a premise for the percpu slab caching layer
> to work correctly, while transitioning to sheaves.
> 
> I think the new behavior introduced during the sheaves transition is that
> SLUB can now allocate objects from slabs without freezing it. Allocating
> objects from slab without freezing it seems to confuse the free path...

I feel it's not a big issue.

I think the root cause of this issue is as follows:

Before this commit, get_partial_node would first remove the slab from the node
list and then return the slab to the upper layer for freezing and object
allocation. Therefore, when __slab_free encounters a slab marked as was_full,
that slab would no longer be on the node list, avoiding race conditions with
list operations.

However, after this commit, get_from_partial_node first allocates an object
from the slab and then removes the slab from the node list. During the
interval between these two steps, __slab_free might encounter a slab marked as
was_full and then it want to add the slab to the CPU partial list, while at
the same time, another process is trying to remove the same slab from the node
list, leading to a race condition.

> 
> But not sure if we could "fix" that because the percpu partial slab
> caching layer is gone anyway :)

Yes, this bug has already disappeared with subsequent patches...

By the way, to allow Ming Lei to continue the bisect process, maybe we should
come up with a temporary workaround, such as:

} else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
	spin_lock_irqsave(&n->list_lock, flags);
	/*
	 * Let this empty critical section push back put_cpu_partial, ensuring
	 * its execution happens after the critical section of
	 * get_from_partial_node running in parallel.
	 */
	spin_unlock_irqrestore(&n->list_lock, flags);
	/*
	 * If we started with a full slab then put it onto the
	 * per cpu partial list.
	 */
	put_cpu_partial(s, slab, 1);
	stat(s, CPU_PARTIAL_FREE);
}

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  8:19             ` Hao Li
@ 2026-02-25  8:41               ` Harry Yoo
  2026-02-25  8:54                 ` Hao Li
  0 siblings, 1 reply; 40+ messages in thread
From: Harry Yoo @ 2026-02-25  8:41 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 04:19:49PM +0800, Hao Li wrote:
> On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> > On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > > Hi Harry,
> > > > > > 
> > > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > > Hello Vlastimil and MM guys,
> > > > > > > 
> > > > > > > Hi Ming, thanks for the report!
> > > > > > > 
> > > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > > drop).
> > > > > > > > 
> > > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > > 
> > > > > > > Ouch. Why did it crash?
> > > > > > 
> > > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > > [   16.162447] PKRU: 55555554
> > > > > > [   16.162448] Call Trace:
> > > > > > [   16.162450]  <TASK>
> > > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > 
> > > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > > 
> > > > > /*
> > > > >  * get a single object from the slab. This might race against __slab_free(),
> > > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > > >  * free.
> > > > >  */
> > > > > 
> > > > > My understanding is that this comment is pointing out a possible race between
> > > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > > n->list_lock when it is about to make the slab fully free, and
> > > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > > mutually excluded by the lock and thus safe.
> > > > > 
> > > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > > >
> > > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > > 
> > > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > > partial list before releasing the lock. Or am I missing something?
> > > > 
> > > > > In that case,
> > > > > both paths would operate on the same union field in struct slab, which might
> > > > > lead to list corruption.
> > > > 
> > > > Not sure how the scenario you describe could happen:
> > > > 
> > > > CPU 0					CPU1
> > > > - get_from_partial_node()		
> > > >   -> spin_lock(&n->list_lock)		
> > > > 					- __slab_free()
> > > >   -> __slab_update_freelist(),
> > > >      slab becomes full
> > > > 					-> was_full == 1
> > > > 					-> spin_lock(&n->list_lock)
> > > 
> > > In __slab_free, if was_full == 1, then the condition
> > > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > > enter the "if" block and therefore n->list_lock is not acquired.
> > > Does that sound right.
> > 
> > Nah, you're right. Just slipped my mind. No need to acquire the lock
> > if it was full, because that means it's not on the partial list.
> 
> Exactly.
> 
> > 
> > Hmm... but the logic has been there for very long time.
> 
> Yes.
> 
> > 
> > Looks like we broke a premise for the percpu slab caching layer
> > to work correctly, while transitioning to sheaves.
> > 
> > I think the new behavior introduced during the sheaves transition is that
> > SLUB can now allocate objects from slabs without freezing it. Allocating
> > objects from slab without freezing it seems to confuse the free path...
> 
> I feel it's not a big issue.
> 
> I think the root cause of this issue is as follows:
> 
> Before this commit, get_partial_node would first remove the slab from the node
> list and then return the slab to the upper layer for freezing and object
> allocation. Therefore, when __slab_free encounters a slab marked as was_full,
> that slab would no longer be on the node list, avoiding race conditions with
> list operations.

Right, that's an important point. Just realized that while elaborating
the analysis :), there was a race condition between you and I!

> However, after this commit, get_from_partial_node first allocates an object
> from the slab and then removes the slab from the node list.

Right.

> During the
> interval between these two steps, __slab_free might encounter a slab marked as
> was_full and then it want to add the slab to the CPU partial list,

Right.

> while at the same time, another process is trying to remove the same slab
> from the node list, leading to a race condition.

Exactly.

> > But not sure if we could "fix" that because the percpu partial slab
> > caching layer is gone anyway :)
> 
> Yes, this bug has already disappeared with subsequent patches...
> 
> By the way, to allow Ming Lei to continue the bisect process, maybe we should
> come up with a temporary workaround, such as:
>
> } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
> 	spin_lock_irqsave(&n->list_lock, flags);
> 	/*
> 	 * Let this empty critical section push back put_cpu_partial, ensuring
> 	 * its execution happens after the critical section of
> 	 * get_from_partial_node running in parallel.
> 	 */
> 	spin_unlock_irqrestore(&n->list_lock, flags);
> 	/*
> 	 * If we started with a full slab then put it onto the
> 	 * per cpu partial list.
> 	 */
> 	put_cpu_partial(s, slab, 1);
> 	stat(s, CPU_PARTIAL_FREE);
> }

Hmm but if that affects the performance (by always acquiring
n->list_lock), the result is probably not valid anyway.

I'd rather bet that Vlastimil's analysis is correct :)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  8:41               ` Harry Yoo
@ 2026-02-25  8:54                 ` Hao Li
  0 siblings, 0 replies; 40+ messages in thread
From: Hao Li @ 2026-02-25  8:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 05:41:15PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 04:19:49PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > > > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > > > Hi Harry,
> > > > > > > 
> > > > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > > > Hello Vlastimil and MM guys,
> > > > > > > > 
> > > > > > > > Hi Ming, thanks for the report!
> > > > > > > > 
> > > > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > > > drop).
> > > > > > > > > 
> > > > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > > > 
> > > > > > > > Ouch. Why did it crash?
> > > > > > > 
> > > > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > > > [   16.162447] PKRU: 55555554
> > > > > > > [   16.162448] Call Trace:
> > > > > > > [   16.162450]  <TASK>
> > > > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > > 
> > > > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > > > 
> > > > > > /*
> > > > > >  * get a single object from the slab. This might race against __slab_free(),
> > > > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > > > >  * free.
> > > > > >  */
> > > > > > 
> > > > > > My understanding is that this comment is pointing out a possible race between
> > > > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > > > n->list_lock when it is about to make the slab fully free, and
> > > > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > > > mutually excluded by the lock and thus safe.
> > > > > > 
> > > > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > > > >
> > > > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > > > 
> > > > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > > > partial list before releasing the lock. Or am I missing something?
> > > > > 
> > > > > > In that case,
> > > > > > both paths would operate on the same union field in struct slab, which might
> > > > > > lead to list corruption.
> > > > > 
> > > > > Not sure how the scenario you describe could happen:
> > > > > 
> > > > > CPU 0					CPU1
> > > > > - get_from_partial_node()		
> > > > >   -> spin_lock(&n->list_lock)		
> > > > > 					- __slab_free()
> > > > >   -> __slab_update_freelist(),
> > > > >      slab becomes full
> > > > > 					-> was_full == 1
> > > > > 					-> spin_lock(&n->list_lock)
> > > > 
> > > > In __slab_free, if was_full == 1, then the condition
> > > > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > > > enter the "if" block and therefore n->list_lock is not acquired.
> > > > Does that sound right.
> > > 
> > > Nah, you're right. Just slipped my mind. No need to acquire the lock
> > > if it was full, because that means it's not on the partial list.
> > 
> > Exactly.
> > 
> > > 
> > > Hmm... but the logic has been there for very long time.
> > 
> > Yes.
> > 
> > > 
> > > Looks like we broke a premise for the percpu slab caching layer
> > > to work correctly, while transitioning to sheaves.
> > > 
> > > I think the new behavior introduced during the sheaves transition is that
> > > SLUB can now allocate objects from slabs without freezing it. Allocating
> > > objects from slab without freezing it seems to confuse the free path...
> > 
> > I feel it's not a big issue.
> > 
> > I think the root cause of this issue is as follows:
> > 
> > Before this commit, get_partial_node would first remove the slab from the node
> > list and then return the slab to the upper layer for freezing and object
> > allocation. Therefore, when __slab_free encounters a slab marked as was_full,
> > that slab would no longer be on the node list, avoiding race conditions with
> > list operations.
> 
> Right, that's an important point. Just realized that while elaborating
> the analysis :), there was a race condition between you and I!

Haha, true race condition - we both sent emails within a minute :D

> 
> > However, after this commit, get_from_partial_node first allocates an object
> > from the slab and then removes the slab from the node list.
> 
> Right.
> 
> > During the
> > interval between these two steps, __slab_free might encounter a slab marked as
> > was_full and then it want to add the slab to the CPU partial list,
> 
> Right.
> 
> > while at the same time, another process is trying to remove the same slab
> > from the node list, leading to a race condition.
> 
> Exactly.
> 
> > > But not sure if we could "fix" that because the percpu partial slab
> > > caching layer is gone anyway :)
> > 
> > Yes, this bug has already disappeared with subsequent patches...
> > 
> > By the way, to allow Ming Lei to continue the bisect process, maybe we should
> > come up with a temporary workaround, such as:
> >
> > } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
> > 	spin_lock_irqsave(&n->list_lock, flags);
> > 	/*
> > 	 * Let this empty critical section push back put_cpu_partial, ensuring
> > 	 * its execution happens after the critical section of
> > 	 * get_from_partial_node running in parallel.
> > 	 */
> > 	spin_unlock_irqrestore(&n->list_lock, flags);
> > 	/*
> > 	 * If we started with a full slab then put it onto the
> > 	 * per cpu partial list.
> > 	 */
> > 	put_cpu_partial(s, slab, 1);
> > 	stat(s, CPU_PARTIAL_FREE);
> > }
> 
> Hmm but if that affects the performance (by always acquiring
> n->list_lock), the result is probably not valid anyway.
> 
> I'd rather bet that Vlastimil's analysis is correct :)

Indeed.
Let's look forward to the test results for Vlastimil's patch!

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  7:19           ` Harry Yoo
  2026-02-25  8:19             ` Hao Li
@ 2026-02-25  8:21             ` Harry Yoo
  1 sibling, 0 replies; 40+ messages in thread
From: Harry Yoo @ 2026-02-25  8:21 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > [   16.162447] PKRU: 55555554
> > > > > [   16.162448] Call Trace:
> > > > > [   16.162450]  <TASK>
> > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > 
> > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > 
> > > > /*
> > > >  * get a single object from the slab. This might race against __slab_free(),
> > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > >  * free.
> > > >  */
> > > > 
> > > > My understanding is that this comment is pointing out a possible race between
> > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > n->list_lock when it is about to make the slab fully free, and
> > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > mutually excluded by the lock and thus safe.
> > > > 
> > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > >
> > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > 
> > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > partial list before releasing the lock. Or am I missing something?
> > > 
> > > > In that case,
> > > > both paths would operate on the same union field in struct slab, which might
> > > > lead to list corruption.
> > > 
> > > Not sure how the scenario you describe could happen:
> > > 
> > > CPU 0					CPU1
> > > - get_from_partial_node()		
> > >   -> spin_lock(&n->list_lock)		
> > > 					- __slab_free()
> > >   -> __slab_update_freelist(),
> > >      slab becomes full
> > > 					-> was_full == 1
> > > 					-> spin_lock(&n->list_lock)
> > 
> > In __slab_free, if was_full == 1, then the condition
> > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > enter the "if" block and therefore n->list_lock is not acquired.
> > Does that sound right.
> 
> Nah, you're right. Just slipped my mind. No need to acquire the lock
> if it was full, because that means it's not on the partial list.

"because it's not on the partial list, and SLUB is going to add it
 to the percpu partial slab list (to avoid acquiring the lock)"

> Hmm... but the logic has been there for very long time.
> 
> Looks like we broke a premise for the percpu slab caching layer
> to work correctly, while transitioning to sheaves.
> 
> I think the new behavior introduced during the sheaves transition is that
> SLUB can now allocate objects from slabs without freezing it. Allocating
> objects from slab without freezing it seems to confuse the free path...

Just elaborating the analysis a bit:

Hao Li (thankfully!) analyzed that there's a race condition between
1) alloc path removes a slab from partial list when it transitions from
partial to full and 2) free path adds the slab to percpu partial slab list
when it transitions from full to partial.

The following race could occur:

CPU 0					CPU1
- get_from_partial_node()		
  -> spin_lock(&n->list_lock)		
					- __slab_free()
  -> __slab_update_freelist()
     // slab becomes full
					-> was_full == 1,
					   no lock acquired
					-> slab_update_freelist()
					-> if (was_frozen) // not frozen!
					->  else if (was_full)
					->    put_cpu_partial(slab)
					      // add the slab to percpu
					      // partial slabs
  -> if (!new.freelist)
  ->   remove_partial(slab)
       // CPU1's percpu partial slab list
          is now corrupted

And later when CPU1 calls __put_partials(), it crashes while
iterating over the percpu partial slab list.

The race condition did not exist before sheaves, because
1) slabs were not on the partial list when the alloc path allocates
objects and 2) the alloc path froze them before allocating objects.
When slabs are frozen, free path doesn't call put_cpu_partial().

Commit 17c38c88294d ("slab: remove cpu (partial) slabs usage from
allocation paths") changed both 1) and 2) and introduced the race
described above. Now, 1) slabs are on partial list when the alloc path
allocates objects, and 2) it does not freeze slabs.

Because the alloc path does not freeze slabs, the free path thinks
that it can always safely add slabs to the percpu partial slab list,
but it's now racy because there's a window between it becomes full
and it's removed from the partial list.

This should be have been fixed after removing cpu partial slabs layer
from the free path, though.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
  2026-02-24  5:00 ` Harry Yoo
@ 2026-02-24  6:51 ` Hao Li
  2026-02-24  7:10   ` Harry Yoo
  2026-02-24 20:27 ` Vlastimil Babka
  2026-03-12 11:26 ` Hao Li
  3 siblings, 1 reply; 40+ messages in thread
From: Hao Li @ 2026-02-24  6:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo

On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,
> 
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).

Thanks for testing.

> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"),

As Harry said, this is odd. Could you post crash logs?

> so the exact first bad commit could not be identified.

Based on my earlier test results, this performance regression (more precisely, I
suspect it is an expected return to the previous baseline - see below) should have
been introduced by two patches:

slab: add optimized sheaf refill from partial list
slab: remove SLUB_CPU_PARTIAL

https://lore.kernel.org/linux-mm/imzzlzuzjmlkhxc7hszxh5ba7jksvqcieg5rzyryijkkdhai5q@l2t4ye5quozb/

> 
> Reproducer
> ==========
> 
[...]
> 
> the result is that the allocating cpu's per-cpu slab caches are
> continuously drained without being replenished by local frees. the bio
> layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> leaving the submitter cpus' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> 
> in v6.19, slub handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
> 
> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
> 
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.

The purpose of introducing sheaves was to fully replace the percpu partial slabs
mechanism with sheaves. During this process, we first added the sheaves caching
layer and only later removed the percpu partial slabs layer, so it's expected
that performance could first improve and then return to the previous level.

Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
`815c8e35511d`)? If those two results are close, then the ~64% performance
regression we're currently discussing might be better interpreted as returning
to the previous baseline (i.e. a reversion), rather than a true regression.

The link below contains my previous test results. According to will-it-scale,
the performance of "no sheaves at all" and "only the sheaves layer exists" is
close:
https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/


-- 
Thanks,
Hao

> 
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming
> 
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  6:51 ` Hao Li
@ 2026-02-24  7:10   ` Harry Yoo
  2026-02-24  7:41     ` Hao Li
  0 siblings, 1 reply; 40+ messages in thread
From: Harry Yoo @ 2026-02-24  7:10 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block

On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Reproducer
> > ==========
> > 
> [...]
> > 
> > the result is that the allocating cpu's per-cpu slab caches are
> > continuously drained without being replenished by local frees. the bio
> > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > leaving the submitter cpus' caches empty and falling through to
> > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> > 
> > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> > 
> >   Tier 1: CPU slab freelist         lock-free (cmpxchg)
> >   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
> >   Tier 3: Node partial list         kmem_cache_node->list_lock
> > 
> > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > populated during __slab_free() -> put_cpu_partial() and provided a
> > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > exhausted, the CPU partial list could supply more slabs without
> > touching any shared lock.
> > 
> > The sheaves architecture replaces this with a 2-tier hierarchy:
> > 
> >   Tier 1: Per-CPU sheaf             lock-free (local_lock)
> >   Tier 2: Node partial list         kmem_cache_node->list_lock
> > 
> > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > empty and the spare sheaf is also empty, every refill must go through
> > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > creates a thundering herd on the node list_lock.
> > 
> > When the local node's partial list is also depleted (objects freed on
> > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > to search other NUMA nodes, compounding the contention with cross-NUMA
> > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > native_queued_spin_lock_slowpath seen in the profile.
> 
> The purpose of introducing sheaves was to fully replace the percpu partial slabs
> mechanism with sheaves. During this process, we first added the sheaves caching
> layer and only later removed the percpu partial slabs layer, so it's expected
> that performance could first improve and then return to the previous level.

There's one difference here; you used will-it-scale mmap2 test case that
involves maple tree node and vm_area_struct cache that already has
sheaves enabled in v6.19.

And Ming's benchmark stresses bio-<size> caches.

Since other caches don't have sheaves in v6.19, they're not supposed to
have performance gain by having additional sheaves layer on top of cpu
slab + percpu partial slab list.

> Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> `815c8e35511d`)? If those two results are close, then the ~64% performance
> regression we're currently discussing might be better interpreted as returning
> to the previous baseline (i.e. a reversion), rather than a true regression.
>
> The link below contains my previous test results. According to will-it-scale,
> the performance of "no sheaves at all" and "only the sheaves layer exists" is
> close:
> https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  7:10   ` Harry Yoo
@ 2026-02-24  7:41     ` Hao Li
  0 siblings, 0 replies; 40+ messages in thread
From: Hao Li @ 2026-02-24  7:41 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block

On Tue, Feb 24, 2026 at 04:10:43PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Reproducer
> > > ==========
> > > 
> > [...]
> > > 
> > > the result is that the allocating cpu's per-cpu slab caches are
> > > continuously drained without being replenished by local frees. the bio
> > > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > > leaving the submitter cpus' caches empty and falling through to
> > > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> > > 
> > > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> > > 
> > >   Tier 1: CPU slab freelist         lock-free (cmpxchg)
> > >   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
> > >   Tier 3: Node partial list         kmem_cache_node->list_lock
> > > 
> > > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > > populated during __slab_free() -> put_cpu_partial() and provided a
> > > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > > exhausted, the CPU partial list could supply more slabs without
> > > touching any shared lock.
> > > 
> > > The sheaves architecture replaces this with a 2-tier hierarchy:
> > > 
> > >   Tier 1: Per-CPU sheaf             lock-free (local_lock)
> > >   Tier 2: Node partial list         kmem_cache_node->list_lock
> > > 
> > > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > > empty and the spare sheaf is also empty, every refill must go through
> > > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > > creates a thundering herd on the node list_lock.
> > > 
> > > When the local node's partial list is also depleted (objects freed on
> > > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > > to search other NUMA nodes, compounding the contention with cross-NUMA
> > > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > > native_queued_spin_lock_slowpath seen in the profile.
> > 
> > The purpose of introducing sheaves was to fully replace the percpu partial slabs
> > mechanism with sheaves. During this process, we first added the sheaves caching
> > layer and only later removed the percpu partial slabs layer, so it's expected
> > that performance could first improve and then return to the previous level.
> 
> There's one difference here; you used will-it-scale mmap2 test case that
> involves maple tree node and vm_area_struct cache that already has
> sheaves enabled in v6.19.
> 
> And Ming's benchmark stresses bio-<size> caches.
> 
> Since other caches don't have sheaves in v6.19, they're not supposed to
> have performance gain by having additional sheaves layer on top of cpu
> slab + percpu partial slab list.

Oh, yes-you're right. That distinction is important!
I think I've gotten a bit stuck in a fixed way of thinking...
Thanks for pointing it out!

> 
> > Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> > commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> > `815c8e35511d`)? If those two results are close, then the ~64% performance
> > regression we're currently discussing might be better interpreted as returning
> > to the previous baseline (i.e. a reversion), rather than a true regression.
> >
> > The link below contains my previous test results. According to will-it-scale,
> > the performance of "no sheaves at all" and "only the sheaves layer exists" is
> > close:
> > https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
  2026-02-24  5:00 ` Harry Yoo
  2026-02-24  6:51 ` Hao Li
@ 2026-02-24 20:27 ` Vlastimil Babka
  2026-02-25  5:24   ` Harry Yoo
  2026-02-25  8:45   ` Vlastimil Babka (SUSE)
  2026-03-12 11:26 ` Hao Li
  3 siblings, 2 replies; 40+ messages in thread
From: Vlastimil Babka @ 2026-02-24 20:27 UTC (permalink / raw)
  To: Ming Lei, Vlastimil Babka, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-block, Harry Yoo, Hao Li,
	Christoph Hellwig

On 2/24/26 3:52 AM, Ming Lei wrote:
> Hello Vlastimil and MM guys,
> 
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.
> 
> Reproducer
> ==========
> 
> Hardware: NUMA machine with >= 32 CPUs
> Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> 
>     # build kublk selftest
>     make -C tools/testing/selftests/ublk/
> 
>     # create ublk null target device with 16 queues
>     tools/testing/selftests/ublk/kublk add -t null -q 16
> 
>     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
>     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> 
>     # cleanup
>     tools/testing/selftests/ublk/kublk del -n 0
> 
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> 
> perf profile (bad kernel)
> =========================
> 
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
> 
> +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
>    - 44.41% kmem_cache_alloc_noprof
>       - 43.89% ___slab_alloc
>          + 41.16% get_from_any_partial

So this function is not used in the sheaf refill path, but in the
fallback slowpath when alloc_from_pcs() fastpath fails.

>            0.91% get_from_partial_node
>          + 0.87% alloc_from_new_slab
>          + 0.65% allocate_slab
> -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
>    - 44.49% mempool_alloc_noprof
>       - 44.43% kmem_cache_alloc_noprof

And I'd guess alloc_from_pcs() fails because in
__pcs_replace_empty_main() we have gfpflags_allow_blocking() false,
because mempool_alloc_noprof() tries the first attempt without
__GFP_DIRECT_RECLAIM. So that will succeed, but we end up relying on the
slowpath all the time and performance will drop.

It made sense to me not to refill sheaves when we can't reclaim, but I
didn't anticipate this interaction with mempools. We could change them
but there might be others using a similar pattern. Maybe it would be for
the best to just drop that heuristic from __pcs_replace_empty_main()
(but carefully as some deadlock avoidance depends on it, we might need
to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
tomorrow to test this theory, unless someone beats me to it (feel free to).

Until then IMHO we can dismiss the AI explanation and also the
insufficient sheaf capacity theories.

>          - 43.90% ___slab_alloc
>             + 41.18% get_from_any_partial
>               0.90% get_from_partial_node
>             + 0.87% alloc_from_new_slab
>             + 0.65% allocate_slab
> +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> -   40.75%     0.20%  io_uring  [k] get_from_partial_node
>    - 40.56% get_from_partial_node
>       - 38.83% __raw_spin_lock_irqsave
>            38.65% native_queued_spin_lock_slowpath
> 
> Analysis
> ========
> 
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.
> 
> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
> 
> **The following is from AI, just for reference**
> 
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.
> 
> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
> 
> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
> 
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.
> 
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24 20:27 ` Vlastimil Babka
@ 2026-02-25  5:24   ` Harry Yoo
  2026-02-25  8:45   ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 40+ messages in thread
From: Harry Yoo @ 2026-02-25  5:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ming Lei, Andrew Morton, linux-mm, linux-kernel, linux-block,
	Hao Li, Christoph Hellwig

On Tue, Feb 24, 2026 at 09:27:40PM +0100, Vlastimil Babka wrote:
> On 2/24/26 3:52 AM, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> > 
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> > 
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> > 
> > Reproducer
> > ==========
> > 
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > 
> >     # build kublk selftest
> >     make -C tools/testing/selftests/ublk/
> > 
> >     # create ublk null target device with 16 queues
> >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > 
> >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> >     # cleanup
> >     tools/testing/selftests/ublk/kublk del -n 0
> > 
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > 
> > perf profile (bad kernel)
> > =========================
> > 
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> > 
> > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> >    - 44.41% kmem_cache_alloc_noprof
> >       - 43.89% ___slab_alloc
> >          + 41.16% get_from_any_partial
> 
> So this function is not used in the sheaf refill path, but in the
> fallback slowpath when alloc_from_pcs() fastpath fails.

Good point.

> >            0.91% get_from_partial_node
> >          + 0.87% alloc_from_new_slab
> >          + 0.65% allocate_slab
> > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> >    - 44.49% mempool_alloc_noprof
> >       - 44.43% kmem_cache_alloc_noprof
> 
> And I'd guess alloc_from_pcs() fails because in
> __pcs_replace_empty_main() we have gfpflags_allow_blocking() false,
> because mempool_alloc_noprof() tries the first attempt without
> __GFP_DIRECT_RECLAIM. So that will succeed, but we end up relying on the
> slowpath all the time and performance will drop.

That's very good point. I was missing that aspect.

> It made sense to me not to refill sheaves when we can't reclaim, but I
> didn't anticipate this interaction with mempools.

Me neither :)

> We could change them but there might be others using a similar pattern.

Probably, yes.

> Maybe it would be for the best to just drop that heuristic from
> __pcs_replace_empty_main()

Sounds fair.

> (but carefully as some deadlock avoidance depends on it, we might need
> to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> tomorrow to test this theory, unless someone beats me to it (feel free to).

I think your point is valid. Let's give it a try.

> Until then IMHO we can dismiss the AI explanation and also the
> insufficient sheaf capacity theories.

Yeah :) let's first see how it performs after addressing your point.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24 20:27 ` Vlastimil Babka
  2026-02-25  5:24   ` Harry Yoo
@ 2026-02-25  8:45   ` Vlastimil Babka (SUSE)
  2026-02-25  9:31     ` Ming Lei
  1 sibling, 1 reply; 40+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-25  8:45 UTC (permalink / raw)
  To: Vlastimil Babka, Ming Lei, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-block, Harry Yoo, Hao Li,
	Christoph Hellwig

On 2/24/26 21:27, Vlastimil Babka wrote:
> 
> It made sense to me not to refill sheaves when we can't reclaim, but I
> didn't anticipate this interaction with mempools. We could change them
> but there might be others using a similar pattern. Maybe it would be for
> the best to just drop that heuristic from __pcs_replace_empty_main()
> (but carefully as some deadlock avoidance depends on it, we might need
> to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> tomorrow to test this theory, unless someone beats me to it (feel free to).
Could you try this then, please? Thanks!

----8<----
From b04dad02eb72feb1736241518dd4d3dd64aadc0e Mon Sep 17 00:00:00 2001
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Date: Wed, 25 Feb 2026 09:40:22 +0100
Subject: [PATCH] mm/slab: allow sheaf refill if blocking is not allowed

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 862642c165ed..258307270442 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4526,7 +4526,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4547,8 +4547,9 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	allow_spin = gfpflags_allow_spinning(gfp);
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
@@ -4558,9 +4559,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 
 	stat(s, BARN_GET_FAIL);
 
-	can_alloc = gfpflags_allow_blocking(gfp);
-
-	if (can_alloc) {
+	if (allow_spin) {
 		if (pcs->spare) {
 			empty = pcs->spare;
 			pcs->spare = NULL;
@@ -4571,7 +4570,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 
 	local_unlock(&s->cpu_sheaves->lock);
 
-	if (!can_alloc)
+	if (!allow_spin)
 		return NULL;
 
 	if (empty) {
@@ -4591,11 +4590,8 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	if (!full)
 		return NULL;
 
-	/*
-	 * we can reach here only when gfpflags_allow_blocking
-	 * so this must not be an irq
-	 */
-	local_lock(&s->cpu_sheaves->lock);
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto barn_put;
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	/*
@@ -4626,6 +4622,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return pcs;
 	}
 
+barn_put:
 	barn_put_full_sheaf(barn, full);
 	stat(s, BARN_PUT);
 
-- 
2.53.0




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  8:45   ` Vlastimil Babka (SUSE)
@ 2026-02-25  9:31     ` Ming Lei
  2026-02-25 11:29       ` Vlastimil Babka (SUSE)
  2026-02-26 18:02       ` Vlastimil Babka (SUSE)
  0 siblings, 2 replies; 40+ messages in thread
From: Ming Lei @ 2026-02-25  9:31 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

Hi Vlastimil,

On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/24/26 21:27, Vlastimil Babka wrote:
> > 
> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > didn't anticipate this interaction with mempools. We could change them
> > but there might be others using a similar pattern. Maybe it would be for
> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > (but carefully as some deadlock avoidance depends on it, we might need
> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> Could you try this then, please? Thanks!

Thanks for working on this issue!

Unfortunately the patch doesn't make a difference on IOPS in the perf test,
follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):

```
04cb971e2d28 (HEAD -> master) mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
a5a9cf3f020f mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
7dff99b35460 (origin/master) Remove WARN_ALL_UNSEEDED_RANDOM kernel config option
551d44200152 default_gfp(): avoid using the "newfangled" __VA_OPT__ trick
6de23f81a5e0 (tag: v7.0-rc1) Linux 7.0-rc1
```

+   49.03%     2.00%  io_uring         [kernel.kallsyms]       [k] __blkdev_direct_IO_async
-   38.66%     1.16%  io_uring         [kernel.kallsyms]       [k] bio_alloc_bioset
   - 37.51% bio_alloc_bioset
      - 34.98% mempool_alloc_noprof
         - 34.87% kmem_cache_alloc_noprof
            - 33.82% ___slab_alloc
               - 30.25% get_from_any_partial
                  - 29.59% get_from_partial_node
                     - 28.42% __raw_spin_lock_irqsave
                          native_queued_spin_lock_slowpath
               + 2.16% allocate_slab
               + 0.60% alloc_from_new_slab
              0.51% __pcs_replace_empty_main
        1.58% bio_associate_blkg
   + 1.16% submitter_uring_fn
+   35.16%     0.30%  io_uring         [kernel.kallsyms]       [k] kmem_cache_alloc_noprof
+   35.13%     0.12%  io_uring         [kernel.kallsyms]       [k] mempool_alloc_noprof


Thanks,
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  9:31     ` Ming Lei
@ 2026-02-25 11:29       ` Vlastimil Babka (SUSE)
  2026-02-25 12:24         ` Ming Lei
  2026-02-26 18:02       ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 40+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-25 11:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

On 2/25/26 10:31, Ming Lei wrote:
> Hi Vlastimil,
> 
> On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/24/26 21:27, Vlastimil Babka wrote:
>> > 
>> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> > didn't anticipate this interaction with mempools. We could change them
>> > but there might be others using a similar pattern. Maybe it would be for
>> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> > (but carefully as some deadlock avoidance depends on it, we might need
>> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> Could you try this then, please? Thanks!
> 
> Thanks for working on this issue!
> 
> Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):

Hm that's weird, still the slowpath is prominent in your profile.

I followed your reproducer instructions, although only with a small
virtme-ng based setup. What's the output of "numactl -H" on yours, btw?

Anyway what I saw is my patch raised the IOPS substantially, and with
CONFIG_SLUB_STATS=y enabled I could see that
/sys/kernel/slab/bio-248/alloc_slowpath had substantial values before the
patch and zero afterwards.

Maybe if you could also enable CONFIG_SLUB_STATS=y and see in which cache(s)
there's significant alloc_slowpath even after the patch, it could help.

Thanks!


> ```
> 04cb971e2d28 (HEAD -> master) mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
> a5a9cf3f020f mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
> 7dff99b35460 (origin/master) Remove WARN_ALL_UNSEEDED_RANDOM kernel config option
> 551d44200152 default_gfp(): avoid using the "newfangled" __VA_OPT__ trick
> 6de23f81a5e0 (tag: v7.0-rc1) Linux 7.0-rc1
> ```
> 
> +   49.03%     2.00%  io_uring         [kernel.kallsyms]       [k] __blkdev_direct_IO_async
> -   38.66%     1.16%  io_uring         [kernel.kallsyms]       [k] bio_alloc_bioset
>    - 37.51% bio_alloc_bioset
>       - 34.98% mempool_alloc_noprof
>          - 34.87% kmem_cache_alloc_noprof
>             - 33.82% ___slab_alloc
>                - 30.25% get_from_any_partial
>                   - 29.59% get_from_partial_node
>                      - 28.42% __raw_spin_lock_irqsave
>                           native_queued_spin_lock_slowpath
>                + 2.16% allocate_slab
>                + 0.60% alloc_from_new_slab
>               0.51% __pcs_replace_empty_main
>         1.58% bio_associate_blkg
>    + 1.16% submitter_uring_fn
> +   35.16%     0.30%  io_uring         [kernel.kallsyms]       [k] kmem_cache_alloc_noprof
> +   35.13%     0.12%  io_uring         [kernel.kallsyms]       [k] mempool_alloc_noprof
> 
> 
> Thanks,
> Ming
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25 11:29       ` Vlastimil Babka (SUSE)
@ 2026-02-25 12:24         ` Ming Lei
  2026-02-25 13:22           ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2026-02-25 12:24 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 3057 bytes --]

On Wed, Feb 25, 2026 at 12:29:26PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/25/26 10:31, Ming Lei wrote:
> > Hi Vlastimil,
> > 
> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> > 
> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> > didn't anticipate this interaction with mempools. We could change them
> >> > but there might be others using a similar pattern. Maybe it would be for
> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> Could you try this then, please? Thanks!
> > 
> > Thanks for working on this issue!
> > 
> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> 
> Hm that's weird, still the slowpath is prominent in your profile.
> 
> I followed your reproducer instructions, although only with a small
> virtme-ng based setup. What's the output of "numactl -H" on yours, btw?

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 32 33 34 35
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 4 5 6 7 36 37 38 39
node 1 size: 31906 MB
node 1 free: 30572 MB
node 2 cpus: 8 9 10 11 40 41 42 43
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 12 13 14 15 44 45 46 47
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 16 17 18 19 48 49 50 51
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 20 21 22 23 52 53 54 55
node 5 size: 32135 MB
node 5 free: 31086 MB
node 6 cpus: 24 25 26 27 56 57 58 59
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 28 29 30 31 60 61 62 63
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node     0    1    2    3    4    5    6    7 
   0:   10   12   12   12   32   32   32   32 
   1:   12   10   12   12   32   32   32   32 
   2:   12   12   10   12   32   32   32   32 
   3:   12   12   12   10   32   32   32   32 
   4:   32   32   32   32   10   12   12   12 
   5:   32   32   32   32   12   10   12   12 
   6:   32   32   32   32   12   12   10   12 
   7:   32   32   32   32   12   12   12   10 

> 
> Anyway what I saw is my patch raised the IOPS substantially, and with
> CONFIG_SLUB_STATS=y enabled I could see that
> /sys/kernel/slab/bio-248/alloc_slowpath had substantial values before the
> patch and zero afterwards.
> 
> Maybe if you could also enable CONFIG_SLUB_STATS=y and see in which cache(s)
> there's significant alloc_slowpath even after the patch, it could help.

Patched:

/sys/kernel/slab/bio-264
./alloc_slowpath:83555260 C0=33 C1=6717992 C2=9 C3=6611030 C8=128 C9=6802316 C11=6934363 C13=6721479 C14=66 C15=6694472 C16=96 C17=7286868 C18=128 C19=7369091 C24=128 C25=7288673 C26=51 C27=6800502 C28=129 C29=7095073 C31=7232628 C43=4 C56=1

Also config.tar.gz is attached.

Thanks, 
Ming

[-- Attachment #2: config.tar.gz --]
[-- Type: application/gzip, Size: 42945 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25 12:24         ` Ming Lei
@ 2026-02-25 13:22           ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 40+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-25 13:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

On 2/25/26 13:24, Ming Lei wrote:
> On Wed, Feb 25, 2026 at 12:29:26PM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/25/26 10:31, Ming Lei wrote:
>> > Hi Vlastimil,
>> > 
>> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> >> On 2/24/26 21:27, Vlastimil Babka wrote:
>> >> > 
>> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> >> > didn't anticipate this interaction with mempools. We could change them
>> >> > but there might be others using a similar pattern. Maybe it would be for
>> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> >> > (but carefully as some deadlock avoidance depends on it, we might need
>> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> >> Could you try this then, please? Thanks!
>> > 
>> > Thanks for working on this issue!
>> > 
>> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
>> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
>> 
>> Hm that's weird, still the slowpath is prominent in your profile.
>> 
>> I followed your reproducer instructions, although only with a small
>> virtme-ng based setup. What's the output of "numactl -H" on yours, btw?
> 
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 32 33 34 35
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 4 5 6 7 36 37 38 39
> node 1 size: 31906 MB
> node 1 free: 30572 MB
> node 2 cpus: 8 9 10 11 40 41 42 43
> node 2 size: 0 MB
> node 2 free: 0 MB
> node 3 cpus: 12 13 14 15 44 45 46 47
> node 3 size: 0 MB
> node 3 free: 0 MB
> node 4 cpus: 16 17 18 19 48 49 50 51
> node 4 size: 0 MB
> node 4 free: 0 MB
> node 5 cpus: 20 21 22 23 52 53 54 55
> node 5 size: 32135 MB
> node 5 free: 31086 MB
> node 6 cpus: 24 25 26 27 56 57 58 59
> node 6 size: 0 MB
> node 6 free: 0 MB
> node 7 cpus: 28 29 30 31 60 61 62 63
> node 7 size: 0 MB
> node 7 free: 0 MB
> node distances:
> node     0    1    2    3    4    5    6    7 
>    0:   10   12   12   12   32   32   32   32 
>    1:   12   10   12   12   32   32   32   32 
>    2:   12   12   10   12   32   32   32   32 
>    3:   12   12   12   10   32   32   32   32 
>    4:   32   32   32   32   10   12   12   12 
>    5:   32   32   32   32   12   10   12   12 
>    6:   32   32   32   32   12   12   10   12 
>    7:   32   32   32   32   12   12   12   10 

Oh right, memory-less nodes, of course. Always so much fun.

>> 
>> Anyway what I saw is my patch raised the IOPS substantially, and with
>> CONFIG_SLUB_STATS=y enabled I could see that
>> /sys/kernel/slab/bio-248/alloc_slowpath had substantial values before the
>> patch and zero afterwards.
>> 
>> Maybe if you could also enable CONFIG_SLUB_STATS=y and see in which cache(s)
>> there's significant alloc_slowpath even after the patch, it could help.
> 
> Patched:
> 
> /sys/kernel/slab/bio-264
> ./alloc_slowpath:83555260 C0=33 C1=6717992 C2=9 C3=6611030 C8=128 C9=6802316 C11=6934363 C13=6721479 C14=66 C15=6694472 C16=96 C17=7286868 C18=128 C19=7369091 C24=128 C25=7288673 C26=51 C27=6800502 C28=129 C29=7095073 C31=7232628 C43=4 C56=1

Yean, no slowpath allocations from cpus that are *not* on a memoryless node.
Thanks, that will help to focus what to look at.

> 
> Also config.tar.gz is attached.
> 
> Thanks, 
> Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  9:31     ` Ming Lei
  2026-02-25 11:29       ` Vlastimil Babka (SUSE)
@ 2026-02-26 18:02       ` Vlastimil Babka (SUSE)
  2026-02-27  9:23         ` Ming Lei
  2026-03-06  4:55         ` Harry Yoo
  1 sibling, 2 replies; 40+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-26 18:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

On 2/25/26 10:31, Ming Lei wrote:
> Hi Vlastimil,
> 
> On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/24/26 21:27, Vlastimil Babka wrote:
>> > 
>> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> > didn't anticipate this interaction with mempools. We could change them
>> > but there might be others using a similar pattern. Maybe it would be for
>> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> > (but carefully as some deadlock avoidance depends on it, we might need
>> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> Could you try this then, please? Thanks!
> 
> Thanks for working on this issue!
> 
> Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):

what about this patch in addition to the previous one? Thanks.

----8<----
From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Date: Thu, 26 Feb 2026 18:59:56 +0100
Subject: [PATCH] mm/slab: put barn on every online node

Including memoryless nodes.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h |   7 ++-
 mm/slub.c | 146 ++++++++++++++++++++++++++++++++----------------------
 2 files changed, 94 insertions(+), 59 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 71c7261bf822..5b5e3ed6adae 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -191,6 +191,11 @@ struct kmem_cache_order_objects {
 	unsigned int x;
 };
 
+struct kmem_cache_per_node_ptrs {
+	struct node_barn *barn;
+	struct kmem_cache_node *node;
+};
+
 /*
  * Slab cache management.
  */
@@ -247,7 +252,7 @@ struct kmem_cache {
 	struct kmem_cache_stats __percpu *cpu_stats;
 #endif
 
-	struct kmem_cache_node *node[MAX_NUMNODES];
+	struct kmem_cache_per_node_ptrs per_node[MAX_NUMNODES];
 };
 
 /*
diff --git a/mm/slub.c b/mm/slub.c
index 258307270442..24f1f12d6a37 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -59,7 +59,7 @@
  *   0.  cpu_hotplug_lock
  *   1.  slab_mutex (Global Mutex)
  *   2a. kmem_cache->cpu_sheaves->lock (Local trylock)
- *   2b. node->barn->lock (Spinlock)
+ *   2b. barn->lock (Spinlock)
  *   2c. node->list_lock (Spinlock)
  *   3.  slab_lock(slab) (Only on some arches)
  *   4.  object_map_lock (Only for debugging)
@@ -136,7 +136,7 @@
  *   or spare sheaf can handle the allocation or free, there is no other
  *   overhead.
  *
- *   node->barn->lock (spinlock)
+ *   barn->lock (spinlock)
  *
  *   This lock protects the operations on per-NUMA-node barn. It can quickly
  *   serve an empty or full sheaf if available, and avoid more expensive refill
@@ -436,26 +436,24 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
-	struct node_barn *barn;
 };
 
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 {
-	return s->node[node];
+	return s->per_node[node].node;
+}
+
+static inline struct node_barn *get_barn_node(struct kmem_cache *s, int node)
+{
+	return s->per_node[node].barn;
 }
 
 /*
- * Get the barn of the current cpu's closest memory node. It may not exist on
- * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
+ * Get the barn of the current cpu's memory node. It may be a memoryless node.
  */
 static inline struct node_barn *get_barn(struct kmem_cache *s)
 {
-	struct kmem_cache_node *n = get_node(s, numa_mem_id());
-
-	if (!n)
-		return NULL;
-
-	return n->barn;
+	return get_barn_node(s, numa_node_id());
 }
 
 /*
@@ -474,6 +472,12 @@ static inline struct node_barn *get_barn(struct kmem_cache *s)
  */
 static nodemask_t slab_nodes;
 
+/*
+ * Similar to slab_nodes but for where we have node_barn allocated.
+ * Corresponds to N_ONLINE nodes.
+ */
+static nodemask_t slab_barn_nodes;
+
 /*
  * Workqueue used for flushing cpu and kfree_rcu sheaves.
  */
@@ -5744,7 +5748,6 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
 
 static void rcu_free_sheaf(struct rcu_head *head)
 {
-	struct kmem_cache_node *n;
 	struct slab_sheaf *sheaf;
 	struct node_barn *barn = NULL;
 	struct kmem_cache *s;
@@ -5767,12 +5770,10 @@ static void rcu_free_sheaf(struct rcu_head *head)
 	if (__rcu_free_sheaf_prepare(s, sheaf))
 		goto flush;
 
-	n = get_node(s, sheaf->node);
-	if (!n)
+	barn = get_barn_node(s, sheaf->node);
+	if (!barn)
 		goto flush;
 
-	barn = n->barn;
-
 	/* due to slab_free_hook() */
 	if (unlikely(sheaf->size == 0))
 		goto empty;
@@ -5894,7 +5895,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 		rcu_sheaf = NULL;
 	} else {
 		pcs->rcu_free = NULL;
-		rcu_sheaf->node = numa_mem_id();
+		rcu_sheaf->node = numa_node_id();
 	}
 
 	/*
@@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
+	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
+			|| !node_isset(slab_nid(slab), slab_nodes))
 	    && likely(!slab_test_pfmemalloc(slab))) {
 		if (likely(free_to_pcs(s, object, true)))
 			return;
@@ -7383,7 +7385,7 @@ static inline int calculate_order(unsigned int size)
 }
 
 static void
-init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
+init_kmem_cache_node(struct kmem_cache_node *n)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
@@ -7393,9 +7395,6 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
-	n->barn = barn;
-	if (barn)
-		barn_init(barn);
 }
 
 #ifdef CONFIG_SLUB_STATS
@@ -7490,8 +7489,8 @@ static void early_kmem_cache_node_alloc(int node)
 	n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
-	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n, NULL);
+	kmem_cache_node->per_node[node].node = n;
+	init_kmem_cache_node(n);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
@@ -7506,15 +7505,20 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 	int node;
 	struct kmem_cache_node *n;
 
-	for_each_kmem_cache_node(s, node, n) {
-		if (n->barn) {
-			WARN_ON(n->barn->nr_full);
-			WARN_ON(n->barn->nr_empty);
-			kfree(n->barn);
-			n->barn = NULL;
-		}
+	for_each_node(node) {
+		struct node_barn *barn = get_barn_node(s, node);
+
+		if (!barn)
+			continue;
 
-		s->node[node] = NULL;
+		WARN_ON(barn->nr_full);
+		WARN_ON(barn->nr_empty);
+		kfree(barn);
+		s->per_node[node].barn = NULL;
+	}
+
+	for_each_kmem_cache_node(s, node, n) {
+		s->per_node[node].node = NULL;
 		kmem_cache_free(kmem_cache_node, n);
 	}
 }
@@ -7535,31 +7539,36 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_node_mask(node, slab_nodes) {
 		struct kmem_cache_node *n;
-		struct node_barn *barn = NULL;
 
 		if (slab_state == DOWN) {
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
 
-		if (cache_has_sheaves(s)) {
-			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
-
-			if (!barn)
-				return 0;
-		}
-
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
-		if (!n) {
-			kfree(barn);
+		if (!n)
 			return 0;
-		}
 
-		init_kmem_cache_node(n, barn);
+		init_kmem_cache_node(n);
+		s->per_node[node].node = n;
+	}
+
+	if (slab_state == DOWN || !cache_has_sheaves(s))
+		return 1;
+
+	for_each_node_mask(node, slab_barn_nodes) {
+		struct node_barn *barn;
+
+		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+		if (!barn)
+			return 0;
 
-		s->node[node] = n;
+		barn_init(barn);
+		s->per_node[node].barn = barn;
 	}
+
 	return 1;
 }
 
@@ -7848,10 +7857,15 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	if (cache_has_sheaves(s))
 		rcu_barrier();
 
+	for_each_node(node) {
+		struct node_barn *barn = get_barn_node(s, node);
+
+		if (barn)
+			barn_shrink(s, barn);
+	}
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
-		if (n->barn)
-			barn_shrink(s, n->barn);
 		free_partial(s, n);
 		if (n->nr_partial || node_nr_slabs(n))
 			return 1;
@@ -8061,14 +8075,18 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 	unsigned long flags;
 	int ret = 0;
 
+	for_each_node(node) {
+		struct node_barn *barn = get_barn_node(s, node);
+
+		if (barn)
+			barn_shrink(s, barn);
+	}
+
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 			INIT_LIST_HEAD(promote + i);
 
-		if (n->barn)
-			barn_shrink(s, n->barn);
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -8157,7 +8175,11 @@ static int slab_mem_going_online_callback(int nid)
 		if (get_node(s, nid))
 			continue;
 
-		if (cache_has_sheaves(s)) {
+		/*
+		 * barn might already exist if the node was online but
+		 * memoryless
+		 */
+		if (cache_has_sheaves(s) && !node_isset(nid, slab_barn_nodes)) {
 			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
 
 			if (!barn) {
@@ -8178,15 +8200,20 @@ static int slab_mem_going_online_callback(int nid)
 			goto out;
 		}
 
-		init_kmem_cache_node(n, barn);
+		init_kmem_cache_node(n);
+		s->per_node[nid].node = n;
 
-		s->node[nid] = n;
+		if (barn) {
+			barn_init(barn);
+			s->per_node[nid].barn = barn;
+		}
 	}
 	/*
 	 * Any cache created after this point will also have kmem_cache_node
 	 * initialized for the new node.
 	 */
 	node_set(nid, slab_nodes);
+	node_set(nid, slab_barn_nodes);
 out:
 	mutex_unlock(&slab_mutex);
 	return ret;
@@ -8265,7 +8292,7 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 	if (!capacity)
 		return;
 
-	for_each_node_mask(node, slab_nodes) {
+	for_each_node_mask(node, slab_barn_nodes) {
 		struct node_barn *barn;
 
 		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
@@ -8276,7 +8303,7 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 		}
 
 		barn_init(barn);
-		get_node(s, node)->barn = barn;
+		s->per_node[node].barn = barn;
 	}
 
 	for_each_possible_cpu(cpu) {
@@ -8337,6 +8364,9 @@ void __init kmem_cache_init(void)
 	for_each_node_state(node, N_MEMORY)
 		node_set(node, slab_nodes);
 
+	for_each_online_node(node)
+		node_set(node, slab_barn_nodes);
+
 	create_boot_cache(kmem_cache_node, "kmem_cache_node",
 			sizeof(struct kmem_cache_node),
 			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
@@ -8347,8 +8377,8 @@ void __init kmem_cache_init(void)
 	slab_state = PARTIAL;
 
 	create_boot_cache(kmem_cache, "kmem_cache",
-			offsetof(struct kmem_cache, node) +
-				nr_node_ids * sizeof(struct kmem_cache_node *),
+			offsetof(struct kmem_cache, per_node) +
+				nr_node_ids * sizeof(struct kmem_cache_per_node_ptrs),
 			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
 
 	kmem_cache = bootstrap(&boot_kmem_cache);
-- 
2.53.0




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-26 18:02       ` Vlastimil Babka (SUSE)
@ 2026-02-27  9:23         ` Ming Lei
  2026-03-05 13:05           ` Vlastimil Babka (SUSE)
  2026-03-06  4:55         ` Harry Yoo
  1 sibling, 1 reply; 40+ messages in thread
From: Ming Lei @ 2026-02-27  9:23 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/25/26 10:31, Ming Lei wrote:
> > Hi Vlastimil,
> > 
> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> > 
> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> > didn't anticipate this interaction with mempools. We could change them
> >> > but there might be others using a similar pattern. Maybe it would be for
> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> Could you try this then, please? Thanks!
> > 
> > Thanks for working on this issue!
> > 
> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> 
> what about this patch in addition to the previous one? Thanks.

With the two patches, IOPS increases to 22M from 13M, but still much less than
36M which is obtained in v6.19-rc5, and slab-sheave PR follows v6.19-rc5.

Also alloc_slowpath can't be observed any more.

Follows perf profile with the two patches:


-    8.30%     0.19%  io_uring         [kernel.kallsyms]               [k] mempool_alloc_noprof
   - 8.11% mempool_alloc_noprof
      - 7.64% kmem_cache_alloc_noprof
         - 6.15% __pcs_replace_empty_main
            - 5.96% refill_sheaf
               + 5.95% refill_objects
+    8.06%     0.44%  io_uring         [kernel.kallsyms]               [k] kmem_cache_alloc_noprof
+    7.44%     0.00%  kublk            [ublk_drv]                      [k] 0xffffffffc140c71b
+    6.63%     0.03%  kublk            [kernel.kallsyms]               [k] __io_run_local_work
+    6.19%     0.05%  io_uring         [kernel.kallsyms]               [k] __pcs_replace_empty_main
-    5.97%     0.01%  io_uring         [kernel.kallsyms]               [k] refill_sheaf
   - 5.96% refill_sheaf
      - 5.95% refill_objects
         - 4.87% __refill_objects_any
            - 4.76% __refill_objects_node
                 0.72% __slab_free
         - 1.00% allocate_slab
            - 0.80% __alloc_frozen_pages_noprof
               - 0.79% get_page_from_freelist
                  + 0.72% post_alloc_hook
+    5.96%     0.02%  io_uring         [kernel.kallsyms]               [k] refill_objects


thanks,
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-27  9:23         ` Ming Lei
@ 2026-03-05 13:05           ` Vlastimil Babka (SUSE)
  2026-03-05 15:48             ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-05 13:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

On 2/27/26 10:23, Ming Lei wrote:
> On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/25/26 10:31, Ming Lei wrote:
>> > Hi Vlastimil,
>> > 
>> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> >> On 2/24/26 21:27, Vlastimil Babka wrote:
>> >> > 
>> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> >> > didn't anticipate this interaction with mempools. We could change them
>> >> > but there might be others using a similar pattern. Maybe it would be for
>> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> >> > (but carefully as some deadlock avoidance depends on it, we might need
>> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> >> Could you try this then, please? Thanks!
>> > 
>> > Thanks for working on this issue!
>> > 
>> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
>> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
>> 
>> what about this patch in addition to the previous one? Thanks.
> 
> With the two patches, IOPS increases to 22M from 13M, but still much less than
> 36M which is obtained in v6.19-rc5, and slab-sheave PR follows v6.19-rc5.

OK thanks! Maybe now we're approching the original theories about effective
caching capacity etc...

> Also alloc_slowpath can't be observed any more.
> 
> Follows perf profile with the two patches:

What's the full perf profile of v6.19-rc5 and full profile of the patched
7.0-rc2 then? Thanks.

Also contents of all the files under /sys/kernel/slab/$cache (forgot which
particular one it was) with CONFIG_SLUB_STATS=y would be great, thanks.

> 
> 
> -    8.30%     0.19%  io_uring         [kernel.kallsyms]               [k] mempool_alloc_noprof
>    - 8.11% mempool_alloc_noprof
>       - 7.64% kmem_cache_alloc_noprof
>          - 6.15% __pcs_replace_empty_main
>             - 5.96% refill_sheaf
>                + 5.95% refill_objects
> +    8.06%     0.44%  io_uring         [kernel.kallsyms]               [k] kmem_cache_alloc_noprof
> +    7.44%     0.00%  kublk            [ublk_drv]                      [k] 0xffffffffc140c71b
> +    6.63%     0.03%  kublk            [kernel.kallsyms]               [k] __io_run_local_work
> +    6.19%     0.05%  io_uring         [kernel.kallsyms]               [k] __pcs_replace_empty_main
> -    5.97%     0.01%  io_uring         [kernel.kallsyms]               [k] refill_sheaf
>    - 5.96% refill_sheaf
>       - 5.95% refill_objects
>          - 4.87% __refill_objects_any
>             - 4.76% __refill_objects_node
>                  0.72% __slab_free
>          - 1.00% allocate_slab
>             - 0.80% __alloc_frozen_pages_noprof
>                - 0.79% get_page_from_freelist
>                   + 0.72% post_alloc_hook
> +    5.96%     0.02%  io_uring         [kernel.kallsyms]               [k] refill_objects
> 
> 
> thanks,
> Ming
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-05 13:05           ` Vlastimil Babka (SUSE)
@ 2026-03-05 15:48             ` Ming Lei
  2026-03-06  1:01               ` Ming Lei
  2026-03-06  4:17               ` Hao Li
  0 siblings, 2 replies; 40+ messages in thread
From: Ming Lei @ 2026-03-05 15:48 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

On Thu, Mar 05, 2026 at 02:05:20PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/27/26 10:23, Ming Lei wrote:
> > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/25/26 10:31, Ming Lei wrote:
> >> > Hi Vlastimil,
> >> > 
> >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> >> > 
> >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> >> > didn't anticipate this interaction with mempools. We could change them
> >> >> > but there might be others using a similar pattern. Maybe it would be for
> >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> >> Could you try this then, please? Thanks!
> >> > 
> >> > Thanks for working on this issue!
> >> > 
> >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> >> 
> >> what about this patch in addition to the previous one? Thanks.
> > 
> > With the two patches, IOPS increases to 22M from 13M, but still much less than
> > 36M which is obtained in v6.19-rc5, and slab-sheave PR follows v6.19-rc5.
> 
> OK thanks! Maybe now we're approching the original theories about effective
> caching capacity etc...
> 
> > Also alloc_slowpath can't be observed any more.
> > 
> > Follows perf profile with the two patches:
> 
> What's the full perf profile of v6.19-rc5 and full profile of the patched
> 7.0-rc2 then? Thanks.
> 
> Also contents of all the files under /sys/kernel/slab/$cache (forgot which
> particular one it was) with CONFIG_SLUB_STATS=y would be great, thanks.

Please see the following log, and let me know if any other info is needed.

1) v6.19-rc5

- IOPS: 34M

- perf profile

+ perf report --vmlinux=/root/git/linux/vmlinux --kallsyms=/proc/kallsyms --stdio --max-stack 0
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:P'
# Event count (approx.): 1045386603400
#
# Children      Self  Command          Shared Object             Symbol
# ........  ........  ...............  ........................  ..............................................
#
    14.41%    14.41%  kublk            [kernel.kallsyms]         [k] _copy_from_iter
    11.25%    11.25%  io_uring         [kernel.kallsyms]         [k] blk_mq_sched_bio_merge
     3.73%     3.73%  kublk            [kernel.kallsyms]         [k] slab_update_freelist.isra.0
     3.53%     3.53%  kublk            [kernel.kallsyms]         [k] ublk_dispatch_req
     3.33%     3.33%  io_uring         [kernel.kallsyms]         [k] blk_mq_rq_ctx_init.isra.0
     2.65%     2.65%  kublk            [kernel.kallsyms]         [k] blk_mq_free_request
     2.01%     2.01%  io_uring         [kernel.kallsyms]         [k] blkdev_read_iter
     1.92%     1.92%  io_uring         [kernel.kallsyms]         [k] __io_read
     1.67%     1.67%  io_uring         [kernel.kallsyms]         [k] blk_mq_submit_bio
     1.54%     1.54%  kublk            [kernel.kallsyms]         [k] ublk_ch_uring_cmd_local
     1.36%     1.36%  io_uring         [kernel.kallsyms]         [k] __fsnotify_parent
     1.30%     1.30%  io_uring         [kernel.kallsyms]         [k] clear_page_erms
     1.19%     1.19%  io_uring         [kernel.kallsyms]         [k] llist_reverse_order
     1.11%     1.11%  io_uring         [kernel.kallsyms]         [k] blk_cgroup_bio_start
     0.98%     0.98%  kublk            [kernel.kallsyms]         [k] __check_object_size
     0.98%     0.98%  kublk            kublk                     [.] ublk_queue_io_cmd
     0.97%     0.97%  io_uring         [kernel.kallsyms]         [k] __submit_bio
     0.97%     0.97%  kublk            [kernel.kallsyms]         [k] __slab_free
     0.96%     0.96%  io_uring         [kernel.kallsyms]         [k] submit_bio_noacct_nocheck
     0.92%     0.92%  kublk            [kernel.kallsyms]         [k] io_issue_sqe
     0.91%     0.91%  io_uring         io_uring                  [.] submitter_uring_fn
     0.88%     0.88%  io_uring         io_uring                  [.] get_offset.part.0
     0.86%     0.86%  io_uring         [kernel.kallsyms]         [k] kmem_cache_alloc_noprof
     0.85%     0.85%  kublk            [kernel.kallsyms]         [k] ublk_copy_user_pages.isra.0
     0.77%     0.77%  io_uring         [kernel.kallsyms]         [k] blk_mq_start_request
     0.74%     0.74%  kublk            kublk                     [.] ublk_null_queue_io
     0.74%     0.74%  io_uring         [kernel.kallsyms]         [k] io_import_reg_buf
     0.67%     0.67%  io_uring         [kernel.kallsyms]         [k] io_issue_sqe
     0.66%     0.66%  io_uring         [kernel.kallsyms]         [k] bio_alloc_bioset
     0.66%     0.66%  kublk            [kernel.kallsyms]         [k] kmem_cache_free
     0.66%     0.66%  io_uring         [kernel.kallsyms]         [k] __blkdev_direct_IO_async
     0.64%     0.64%  kublk            [kernel.kallsyms]         [k] __io_issue_sqe
     0.61%     0.61%  io_uring         [kernel.kallsyms]         [k] submit_bio
     0.59%     0.59%  kublk            [kernel.kallsyms]         [k] __io_uring_cmd_done
     0.58%     0.58%  io_uring         [kernel.kallsyms]         [k] blk_rq_merge_ok
     0.56%     0.56%  kublk            [kernel.kallsyms]         [k] __io_submit_flush_completions
     0.54%     0.54%  kublk            kublk                     [.] __ublk_io_handler_fn.isra.0
     0.53%     0.53%  kublk            [kernel.kallsyms]         [k] io_uring_cmd
     0.52%     0.52%  io_uring         [kernel.kallsyms]         [k] __io_prep_rw
     0.52%     0.52%  io_uring         [kernel.kallsyms]         [k] io_free_batch_list
     0.50%     0.50%  kublk            [kernel.kallsyms]         [k] io_uring_cmd_prep
     0.49%     0.49%  kublk            [kernel.kallsyms]         [k] blk_account_io_done.part.0
     0.49%     0.49%  io_uring         [kernel.kallsyms]         [k] __io_submit_flush_completions


- slab stat

# (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
./remote_node_defrag_ratio:100
./free_frozen:203789653 C0=13137513 C2=16103904 C4=5312681 C6=9805649 C8=14262027 C10=13676236 C12=8700700 C14=13041782 C16=11558292 C18=13258018 C19=2 C20=2813290 C22=7752577 C24=19173693 C26=16631916 C28=21707419 C29=2 C30=16853951 C31=1
./total_objects:6732 N1=3315 N5=3417
./cpuslab_flush:0
./alloc_fastpath:1284958471 C1=80252197 C3=80197810 C4=125 C5=82882536 C6=125 C7=83898247 C8=125 C9=81412735 C11=80400026 C12=125 C13=78664565 C14=44 C15=80954403 C17=80070327 C19=75310035 C20=125 C21=83788507 C22=81 C23=84943484 C25=78466239 C26=125 C27=78389061 C29=76890573 C31=78436849 C50=1 C60=1
./cpu_partial_free:37988123 C0=2275928 C2=2190868 C4=2789178 C6=2685497 C8=2282195 C10=2266792 C12=2340158 C14=2302589 C16=2359282 C18=2154683 C20=3028332 C22=2921916 C24=2103757 C26=2157902 C28=1972836 C30=2156210
./cpu_slabs:58 N1=28 N5=30
./objects:6167 N1=3092 N5=3075
./deactivate_full:0
./sheaf_return_slow:0
./objects_partial:608 N1=287 N5=321
./sheaf_return_fast:0
./cpu_partial:52
./cmpxchg_double_cpu_fail:1 C7=1
./free_slowpath:1361594822 C0=85109840 C2=85495921 C4=86775189 C6=88474098 C8=86495486 C10=85287670 C12=82701232 C14=85802194 C16=84711284 C18=79945983 C19=2 C20=87399505 C22=89361232 C24=84116440 C26=83560456 C28=82780090 C29=2 C30=83578197 C31=1
./barn_get_fail:0
./sheaf_prefill_oversize:0
./deactivate_to_tail:0
./skip_kfence:0
./min_partial:5
./order_fallback:0
./sheaf_capacity:0
./deactivate_empty:3616332 C0=269533 C2=262401 C4=116355 C6=112383 C8=271620 C10=266348 C12=278359 C14=271083 C16=264315 C18=242601 C20=170557 C22=159604 C24=231322 C26=240708 C28=220103 C30=239040
./sheaf_flush:0
./free_rcu_sheaf:0
./alloc_from_partial:11612237 C1=660211 C3=634301 C5=949155 C6=1 C7=914355 C9=661811 C11=658753 C13=679880 C15=669226 C17=684745 C19=624788 C20=1 C21=1037955 C22=1 C23=1002678 C25=611243 C27=625403 C29=571631 C31=626099
./sheaf_alloc:0
./sheaf_free:0
./sheaf_prefill_slow:0
./sheaf_prefill_fast:0
./poison:0
./red_zone:0
./free_cpu_sheaf:0
./free_slab:3616434 C0=269535 C2=262407 C4=116368 C6=112391 C8=271622 C10=266351 C12=278359 C14=271084 C16=264354 C18=242601 C20=170559 C22=159611 C24=231322 C26=240711 C28=220114 C30=239045
./slabs:132 N1=65 N5=67
./barn_get:0
./cpu_partial_node:22759400 C1=1312100 C3=1260562 C5=1821488 C6=2 C7=1752623 C9=1315094 C11=1309216 C13=1351244 C15=1329937 C17=1360857 C19=1241554 C20=2 C21=1968791 C22=2 C23=1898000 C25=1214784 C27=1242922 C29=1136091 C31=1244131
./alloc_slowpath:76640471 C1=4857913 C3=5298367 C4=3 C5=3892806 C6=3 C7=4575965 C8=3 C9=5082878 C11=4887906 C12=3 C13=4036796 C14=1 C15=4848003 C17=4641269 C19=4636149 C20=3 C21=3611116 C22=2 C23=4417922 C25=5650460 C26=3 C27=5171520 C29=5889792 C31=5141585 C50=1 C60=1 C62=1
./destroy_by_rcu:1
./free_rcu_sheaf_fail:0
./barn_put:0
./usersize:0
./sanity_checks:0
./barn_put_fail:0
./align:64
./alloc_node_mismatch:0
./deactivate_remote_frees:0
./alloc_slab:3616566 C1=303677 C3=296031 C4=3 C5=18366 C7=18301 C8=3 C9=305344 C11=298932 C12=3 C13=309156 C14=1 C15=303522 C17=313382 C19=288344 C21=21685 C23=21353 C25=277789 C26=3 C27=289631 C29=265057 C31=285980 C50=1 C60=1 C62=1
./free_remove_partial:102 C0=2 C2=6 C4=13 C6=8 C8=2 C10=3 C14=1 C16=39 C20=2 C22=7 C26=3 C28=11 C30=5
./aliases:0
./store_user:0
./trace:0
./reclaim_account:0
./order:2
./sheaf_refill:0
./object_size:256
./alloc_refill:38652283 C1=2581925 C3=3107474 C5=1103799 C7=1890686 C9=2800630 C11=2621006 C13=1696518 C15=2545318 C17=2282285 C19=2481464 C21=582686 C23=1495892 C25=3546646 C27=3013564 C29=3917013 C31=2985377
./alloc_cpu_sheaf:0
./cpu_partial_drain:12662698 C0=758642 C2=730289 C4=929725 C6=895165 C8=760731 C10=755597 C12=780052 C14=767529 C16=786427 C18=718227 C20=1009443 C22=973972 C24=701252 C26=719300 C28=657611 C30=718736
./free_fastpath:4 C1=2 C11=2
./hwcache_align:1
./cpu_partial_alloc:22759385 C1=1312100 C3=1260561 C5=1821486 C6=2 C7=1752623 C9=1315093 C11=1309215 C13=1351242 C15=1329937 C17=1360857 C19=1241553 C20=2 C21=1968790 C22=1 C23=1897999 C25=1214782 C27=1242922 C29=1136091 C31=1244129
./cmpxchg_double_fail:6247305 C0=396268 C1=16193 C2=484201 C3=11558 C4=198887 C5=7233 C6=336779 C7=7332 C8=444665 C9=11539 C10=403230 C11=10130 C12=258163 C13=6666 C14=389004 C15=9620 C16=357182 C17=9184 C18=378255 C19=9012 C20=103655 C21=2375 C22=260015 C23=6160 C24=552885 C25=22738 C26=464990 C27=11172 C28=592307 C29=23777 C30=451529 C31=10601
./deactivate_bypass:37988161 C1=2275987 C3=2190892 C4=2 C5=2789006 C6=2 C7=2685278 C8=2 C9=2282247 C11=2266899 C12=2 C13=2340277 C15=2302684 C17=2358983 C19=2154684 C20=2 C21=3028429 C22=1 C23=2922029 C25=2103813 C26=2 C27=2157955 C29=1972778 C31=2156207
./objs_per_slab:51
./partial:23 N1=10 N5=13
./slabs_cpu_partial:1122(44) C0=51(2) C2=25(1) C3=25(1) C4=76(3) C5=51(2) C6=51(2) C8=51(2) C9=25(1) C10=25(1) C11=25(1) C12=51(2) C13=51(2) C14=51(2) C16=25(1) C18=51(2) C19=25(1) C20=76(3) C21=25(1) C22=25(1) C23=25(1) C24=25(1) C25=51(2) C26=51(2) C28=76(3) C30=51(2) C31=51(2)
./free_add_partial:34371762 C0=2006393 C2=1928466 C4=2672820 C6=2573112 C8=2010573 C10=2000443 C12=2061797 C14=2031504 C16=2094966 C18=1912080 C20=2857772 C22=2762312 C24=1872434 C26=1917192 C28=1752730 C30=1917168
./slab_size:320
./cache_dma:0
./deactivate_to_head:0



2) v7.0-rc2(commit c107785c7e8d) + two patches


- IOPS: 23M

- perf profile

+ perf report --vmlinux=/root/git/linux/vmlinux --kallsyms=/proc/kallsyms --stdio --max-stack 0
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 858K of event 'cycles:P'
# Event count (approx.): 667558170118
#
# Children      Self  Command          Shared Object                       Symbol
# ........  ........  ...............  ..................................  ..............................................
#
    10.81%    10.81%  kublk            [kernel.kallsyms]                   [k] _copy_from_iter
     5.23%     5.23%  io_uring         [kernel.kallsyms]                   [k] blk_mq_submit_bio
     3.97%     3.97%  io_uring         [kernel.kallsyms]                   [k] __refill_objects_node
     2.69%     2.69%  io_uring         [kernel.kallsyms]                   [k] io_rw_init_file
     2.61%     2.61%  io_uring         [kernel.kallsyms]                   [k] blk_cgroup_bio_start
     2.55%     2.55%  io_uring         [kernel.kallsyms]                   [k] blk_mq_rq_ctx_init.isra.0
     2.52%     2.52%  kublk            [kernel.kallsyms]                   [k] blk_mq_free_request
     2.45%     2.45%  kublk            [kernel.kallsyms]                   [k] ublk_dispatch_req
     2.18%     2.18%  io_uring         [kernel.kallsyms]                   [k] __fsnotify_parent
     1.87%     1.87%  kublk            [kernel.kallsyms]                   [k] __slab_free
     1.82%     1.82%  io_uring         [kernel.kallsyms]                   [k] __io_read
     1.77%     1.77%  kublk            [kernel.kallsyms]                   [k] slab_update_freelist.isra.0
     1.72%     1.72%  kublk            [kernel.kallsyms]                   [k] __io_uring_cmd_done
     1.70%     1.70%  io_uring         [kernel.kallsyms]                   [k] security_file_permission
     1.68%     1.68%  io_uring         [kernel.kallsyms]                   [k] io_req_task_complete
     1.51%     1.51%  kublk            [kernel.kallsyms]                   [k] ublk_start_io
     1.32%     1.32%  io_uring         [kernel.kallsyms]                   [k] llist_reverse_order
     1.30%     1.30%  io_uring         [kernel.kallsyms]                   [k] submit_bio_noacct_nocheck
     1.22%     1.22%  kublk            [kernel.kallsyms]                   [k] blk_account_io_done.part.0
     1.15%     1.15%  io_uring         [kernel.kallsyms]                   [k] kernel_init_pages
     1.11%     1.11%  kublk            [kernel.kallsyms]                   [k] __local_bh_enable_ip
     1.03%     1.03%  io_uring         [kernel.kallsyms]                   [k] io_import_reg_buf
     1.03%     1.03%  kublk            [kernel.kallsyms]                   [k] ublk_ch_uring_cmd_local
     1.01%     1.01%  io_uring         [kernel.kallsyms]                   [k] wbt_issue
     0.97%     0.97%  io_uring         [kernel.kallsyms]                   [k] __submit_bio
     0.81%     0.81%  kublk            [kernel.kallsyms]                   [k] avc_has_perm
     0.80%     0.80%  io_uring         [kernel.kallsyms]                   [k] __rq_qos_issue
     0.76%     0.76%  kublk            [kernel.kallsyms]                   [k] __blk_mq_free_request
     0.73%     0.73%  kublk            kublk                               [.] ublk_queue_io_cmd
     0.73%     0.73%  io_uring         io_uring                            [.] submitter_uring_fn
     0.67%     0.67%  io_uring         [kernel.kallsyms]                   [k] kmem_cache_alloc_noprof
     0.65%     0.65%  kublk            [kernel.kallsyms]                   [k] __io_submit_flush_completions
     0.62%     0.62%  kublk            [kernel.kallsyms]                   [k] blk_stat_add
     0.62%     0.62%  kublk            [kernel.kallsyms]                   [k] __ublk_complete_rq
     0.61%     0.61%  kublk            [kernel.kallsyms]                   [k] blk_update_request
     0.60%     0.60%  kublk            [kernel.kallsyms]                   [k] __blk_mq_end_request
     0.58%     0.58%  io_uring         [kernel.kallsyms]                   [k] bio_alloc_bioset
     0.56%     0.56%  kublk            [kernel.kallsyms]                   [k] __rcu_read_lock
     0.54%     0.54%  io_uring         [kernel.kallsyms]                   [k] io_req_rw_complete
     0.54%     0.54%  io_uring         [kernel.kallsyms]                   [k] io_free_batch_list
     0.53%     0.53%  io_uring         [kernel.kallsyms]                   [k] __io_submit_flush_completions
     0.53%     0.53%  io_uring         [kernel.kallsyms]                   [k] io_init_req
     0.53%     0.53%  io_uring         [kernel.kallsyms]                   [k] __blkdev_direct_IO_async
     0.53%     0.53%  kublk            [kernel.kallsyms]                   [k] io_issue_sqe
     0.51%     0.51%  io_uring         [kernel.kallsyms]                   [k] blk_mq_start_request
     0.51%     0.51%  kublk            [kernel.kallsyms]                   [k] io_req_local_work_add
     0.51%     0.51%  kublk            [kernel.kallsyms]                   [k] kmem_cache_free
     0.49%     0.49%  io_uring         [kernel.kallsyms]                   [k] io_import_fixed


- slab stat

# (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
./remote_node_defrag_ratio:100
./total_objects:9078 N1=4233 N5=4845
./alloc_fastpath:897715187 C1=45250242 C3=50602079 C5=89955493 C6=128 C7=81923744 C8=128 C9=46275792 C10=128 C11=46037573 C12=128 C13=53037806 C14=128 C15=49291969 C16=128 C17=49716073 C18=4 C19=45475417 C20=130 C21=75693223 C22=128 C23=69595236 C24=128 C25=52992066 C26=1 C27=51082176 C28=66 C29=44931239 C30=2 C31=45853827 C48=2 C59=2 C63=1
./cpu_slabs:0
./objects:5404 N1=2665 N5=2739
./sheaf_return_slow:0
./objects_partial:3772 N1=1849 N5=1923
./sheaf_return_fast:0
./cpu_partial:0
./free_slowpath:580544104 C0=45249992 C2=50601817 C4=2 C6=2 C8=46275666 C10=46037443 C12=53037685 C14=49291858 C16=49715937 C18=45475167 C20=13 C22=21 C24=52991949 C26=51081920 C28=44931147 C30=45853478 C49=2 C59=2 C61=2 C63=1
./barn_get_fail:20733914 C1=1616081 C3=1807218 C5=23 C6=1 C7=10 C8=5 C9=1652707 C10=5 C11=1644200 C12=5 C13=1894208 C14=5 C15=1760428 C16=5 C17=1775575 C18=1 C19=1624123 C20=4 C21=6 C22=5 C23=21 C24=5 C25=1892574 C26=1 C27=1824364 C28=3 C29=1604692 C31=1637636 C48=1 C59=1 C63=1
./sheaf_prefill_oversize:0
./skip_kfence:0
./min_partial:5
./order_fallback:0
./sheaf_capacity:28
./sheaf_flush:84 C4=28 C20=28 C22=28
./free_rcu_sheaf:0
./sheaf_alloc:120 C1=1 C3=1 C4=10 C5=1 C6=3 C8=1 C9=1 C10=1 C11=1 C12=1 C13=1 C14=1 C15=2 C16=1 C17=1 C18=65 C19=1 C20=6 C22=7 C23=1 C24=1 C25=1 C26=1 C27=2 C28=1 C29=3 C31=1 C48=1 C59=1 C63=1
./sheaf_free:0
./sheaf_prefill_slow:0
./sheaf_prefill_fast:0
./poison:0
./red_zone:0
./free_slab:2252626 C0=217768 C2=178064 C8=177352 C10=188763 C12=172593 C14=195156 C16=189975 C18=178757 C24=179036 C26=187498 C28=189290 C30=198374
./slabs:178 N1=83 N5=95
./barn_get:11327366 C5=3212674 C6=4 C7=2925838 C20=1 C21=2703324 C23=2485524 C31=1
./alloc_slowpath:0
./destroy_by_rcu:1
./free_rcu_sheaf_fail:0
./barn_put:11327384 C4=3212682 C6=2925843 C20=2703321 C22=2485537 C28=1
./usersize:0
./sanity_checks:0
./barn_put_fail:3 C4=1 C20=1 C22=1
./align:64
./alloc_node_mismatch:0
./alloc_slab:2252805 C1=175514 C3=194092 C5=9 C8=2 C9=178894 C10=1 C11=177472 C12=5 C13=202580 C14=1 C15=190184 C16=2 C17=194266 C18=1 C19=179219 C20=1 C24=1 C25=205268 C27=197542 C29=177638 C30=1 C31=180109 C48=1 C59=1 C63=1
./free_remove_partial:2252626 C0=217768 C2=178064 C8=177352 C10=188763 C12=172593 C14=195156 C16=189975 C18=178757 C24=179036 C26=187498 C28=189290 C30=198374
./aliases:0
./store_user:0
./trace:0
./reclaim_account:0
./order:2
./sheaf_refill:580549592 C1=45250268 C3=50602104 C5=644 C6=28 C7=280 C8=140 C9=46275796 C10=140 C11=46037600 C12=140 C13=53037824 C14=140 C15=49291984 C16=140 C17=49716100 C18=28 C19=45475444 C20=112 C21=168 C22=140 C23=588 C24=140 C25=52992072 C26=28 C27=51082192 C28=140 C29=44931264 C30=56 C31=45853808 C48=28 C59=28 C63=28
./object_size:256
./free_fastpath:317166967 C4=89955177 C6=81923631 C20=75693051 C22=69595106 C53=2
./hwcache_align:1
./cmpxchg_double_fail:2518173 C0=176664 C1=2695 C2=217757 C3=3198 C5=1 C7=2 C8=201087 C9=2783 C10=199017 C11=2839 C12=222146 C13=3146 C14=215883 C15=3113 C16=208327 C17=3279 C18=199045 C19=3150 C21=1 C24=225357 C25=3445 C26=218118 C27=3183 C28=197089 C29=3083 C30=200710 C31=3055
./objs_per_slab:51
./partial:146 N1=67 N5=79
./slabs_cpu_partial:0(0)
./free_add_partial:29301755 C0=958223 C1=1323669 C2=1094832 C3=1480537 C4=1 C5=11 C7=7 C8=990458 C9=1354237 C10=983110 C11=1348401 C12=1142550 C13=1553516 C14=1063308 C15=1441153 C16=1029143 C17=1463851 C18=949985 C19=1335529 C20=7 C21=3 C22=7 C23=17 C24=1107242 C25=1562991 C26=1056543 C27=1505571 C28=939243 C29=1319586 C30=948581 C31=1349443
./slab_size:320



Thanks,
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-05 15:48             ` Ming Lei
@ 2026-03-06  1:01               ` Ming Lei
  2026-03-06  4:17               ` Hao Li
  1 sibling, 0 replies; 40+ messages in thread
From: Ming Lei @ 2026-03-06  1:01 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

On Thu, Mar 05, 2026 at 11:48:09PM +0800, Ming Lei wrote:
> On Thu, Mar 05, 2026 at 02:05:20PM +0100, Vlastimil Babka (SUSE) wrote:
> > On 2/27/26 10:23, Ming Lei wrote:
> > > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > >> On 2/25/26 10:31, Ming Lei wrote:
> > >> > Hi Vlastimil,
> > >> > 
> > >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > >> >> > 
> > >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > >> >> > didn't anticipate this interaction with mempools. We could change them
> > >> >> > but there might be others using a similar pattern. Maybe it would be for
> > >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> > >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > >> >> Could you try this then, please? Thanks!
> > >> > 
> > >> > Thanks for working on this issue!
> > >> > 
> > >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > >> 
> > >> what about this patch in addition to the previous one? Thanks.
> > > 
> > > With the two patches, IOPS increases to 22M from 13M, but still much less than
> > > 36M which is obtained in v6.19-rc5, and slab-sheave PR follows v6.19-rc5.
> > 
> > OK thanks! Maybe now we're approching the original theories about effective
> > caching capacity etc...
> > 
> > > Also alloc_slowpath can't be observed any more.
> > > 
> > > Follows perf profile with the two patches:
> > 
> > What's the full perf profile of v6.19-rc5 and full profile of the patched
> > 7.0-rc2 then? Thanks.
> > 
> > Also contents of all the files under /sys/kernel/slab/$cache (forgot which
> > particular one it was) with CONFIG_SLUB_STATS=y would be great, thanks.
> 
> Please see the following log, and let me know if any other info is needed.
> 
> 1) v6.19-rc5
> 
> - IOPS: 34M
> 
> - perf profile
> 
> + perf report --vmlinux=/root/git/linux/vmlinux --kallsyms=/proc/kallsyms --stdio --max-stack 0
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 1M of event 'cycles:P'
> # Event count (approx.): 1045386603400
> #
> # Children      Self  Command          Shared Object             Symbol
> # ........  ........  ...............  ........................  ..............................................
> #
>     14.41%    14.41%  kublk            [kernel.kallsyms]         [k] _copy_from_iter
>     11.25%    11.25%  io_uring         [kernel.kallsyms]         [k] blk_mq_sched_bio_merge
>      3.73%     3.73%  kublk            [kernel.kallsyms]         [k] slab_update_freelist.isra.0
>      3.53%     3.53%  kublk            [kernel.kallsyms]         [k] ublk_dispatch_req
>      3.33%     3.33%  io_uring         [kernel.kallsyms]         [k] blk_mq_rq_ctx_init.isra.0
>      2.65%     2.65%  kublk            [kernel.kallsyms]         [k] blk_mq_free_request
>      2.01%     2.01%  io_uring         [kernel.kallsyms]         [k] blkdev_read_iter
>      1.92%     1.92%  io_uring         [kernel.kallsyms]         [k] __io_read
>      1.67%     1.67%  io_uring         [kernel.kallsyms]         [k] blk_mq_submit_bio
>      1.54%     1.54%  kublk            [kernel.kallsyms]         [k] ublk_ch_uring_cmd_local
>      1.36%     1.36%  io_uring         [kernel.kallsyms]         [k] __fsnotify_parent
>      1.30%     1.30%  io_uring         [kernel.kallsyms]         [k] clear_page_erms
>      1.19%     1.19%  io_uring         [kernel.kallsyms]         [k] llist_reverse_order
>      1.11%     1.11%  io_uring         [kernel.kallsyms]         [k] blk_cgroup_bio_start
>      0.98%     0.98%  kublk            [kernel.kallsyms]         [k] __check_object_size
>      0.98%     0.98%  kublk            kublk                     [.] ublk_queue_io_cmd
>      0.97%     0.97%  io_uring         [kernel.kallsyms]         [k] __submit_bio
>      0.97%     0.97%  kublk            [kernel.kallsyms]         [k] __slab_free
>      0.96%     0.96%  io_uring         [kernel.kallsyms]         [k] submit_bio_noacct_nocheck
>      0.92%     0.92%  kublk            [kernel.kallsyms]         [k] io_issue_sqe
>      0.91%     0.91%  io_uring         io_uring                  [.] submitter_uring_fn
>      0.88%     0.88%  io_uring         io_uring                  [.] get_offset.part.0
>      0.86%     0.86%  io_uring         [kernel.kallsyms]         [k] kmem_cache_alloc_noprof
>      0.85%     0.85%  kublk            [kernel.kallsyms]         [k] ublk_copy_user_pages.isra.0
>      0.77%     0.77%  io_uring         [kernel.kallsyms]         [k] blk_mq_start_request
>      0.74%     0.74%  kublk            kublk                     [.] ublk_null_queue_io
>      0.74%     0.74%  io_uring         [kernel.kallsyms]         [k] io_import_reg_buf
>      0.67%     0.67%  io_uring         [kernel.kallsyms]         [k] io_issue_sqe
>      0.66%     0.66%  io_uring         [kernel.kallsyms]         [k] bio_alloc_bioset
>      0.66%     0.66%  kublk            [kernel.kallsyms]         [k] kmem_cache_free
>      0.66%     0.66%  io_uring         [kernel.kallsyms]         [k] __blkdev_direct_IO_async
>      0.64%     0.64%  kublk            [kernel.kallsyms]         [k] __io_issue_sqe
>      0.61%     0.61%  io_uring         [kernel.kallsyms]         [k] submit_bio
>      0.59%     0.59%  kublk            [kernel.kallsyms]         [k] __io_uring_cmd_done
>      0.58%     0.58%  io_uring         [kernel.kallsyms]         [k] blk_rq_merge_ok
>      0.56%     0.56%  kublk            [kernel.kallsyms]         [k] __io_submit_flush_completions
>      0.54%     0.54%  kublk            kublk                     [.] __ublk_io_handler_fn.isra.0
>      0.53%     0.53%  kublk            [kernel.kallsyms]         [k] io_uring_cmd
>      0.52%     0.52%  io_uring         [kernel.kallsyms]         [k] __io_prep_rw
>      0.52%     0.52%  io_uring         [kernel.kallsyms]         [k] io_free_batch_list
>      0.50%     0.50%  kublk            [kernel.kallsyms]         [k] io_uring_cmd_prep
>      0.49%     0.49%  kublk            [kernel.kallsyms]         [k] blk_account_io_done.part.0
>      0.49%     0.49%  io_uring         [kernel.kallsyms]         [k] __io_submit_flush_completions
> 
> 
> - slab stat
> 
> # (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
> ./remote_node_defrag_ratio:100
> ./free_frozen:203789653 C0=13137513 C2=16103904 C4=5312681 C6=9805649 C8=14262027 C10=13676236 C12=8700700 C14=13041782 C16=11558292 C18=13258018 C19=2 C20=2813290 C22=7752577 C24=19173693 C26=16631916 C28=21707419 C29=2 C30=16853951 C31=1
> ./total_objects:6732 N1=3315 N5=3417
> ./cpuslab_flush:0
> ./alloc_fastpath:1284958471 C1=80252197 C3=80197810 C4=125 C5=82882536 C6=125 C7=83898247 C8=125 C9=81412735 C11=80400026 C12=125 C13=78664565 C14=44 C15=80954403 C17=80070327 C19=75310035 C20=125 C21=83788507 C22=81 C23=84943484 C25=78466239 C26=125 C27=78389061 C29=76890573 C31=78436849 C50=1 C60=1
> ./cpu_partial_free:37988123 C0=2275928 C2=2190868 C4=2789178 C6=2685497 C8=2282195 C10=2266792 C12=2340158 C14=2302589 C16=2359282 C18=2154683 C20=3028332 C22=2921916 C24=2103757 C26=2157902 C28=1972836 C30=2156210
> ./cpu_slabs:58 N1=28 N5=30
> ./objects:6167 N1=3092 N5=3075
> ./deactivate_full:0
> ./sheaf_return_slow:0
> ./objects_partial:608 N1=287 N5=321
> ./sheaf_return_fast:0
> ./cpu_partial:52
> ./cmpxchg_double_cpu_fail:1 C7=1
> ./free_slowpath:1361594822 C0=85109840 C2=85495921 C4=86775189 C6=88474098 C8=86495486 C10=85287670 C12=82701232 C14=85802194 C16=84711284 C18=79945983 C19=2 C20=87399505 C22=89361232 C24=84116440 C26=83560456 C28=82780090 C29=2 C30=83578197 C31=1
> ./barn_get_fail:0
> ./sheaf_prefill_oversize:0
> ./deactivate_to_tail:0
> ./skip_kfence:0
> ./min_partial:5
> ./order_fallback:0
> ./sheaf_capacity:0
> ./deactivate_empty:3616332 C0=269533 C2=262401 C4=116355 C6=112383 C8=271620 C10=266348 C12=278359 C14=271083 C16=264315 C18=242601 C20=170557 C22=159604 C24=231322 C26=240708 C28=220103 C30=239040
> ./sheaf_flush:0
> ./free_rcu_sheaf:0
> ./alloc_from_partial:11612237 C1=660211 C3=634301 C5=949155 C6=1 C7=914355 C9=661811 C11=658753 C13=679880 C15=669226 C17=684745 C19=624788 C20=1 C21=1037955 C22=1 C23=1002678 C25=611243 C27=625403 C29=571631 C31=626099
> ./sheaf_alloc:0
> ./sheaf_free:0
> ./sheaf_prefill_slow:0
> ./sheaf_prefill_fast:0
> ./poison:0
> ./red_zone:0
> ./free_cpu_sheaf:0
> ./free_slab:3616434 C0=269535 C2=262407 C4=116368 C6=112391 C8=271622 C10=266351 C12=278359 C14=271084 C16=264354 C18=242601 C20=170559 C22=159611 C24=231322 C26=240711 C28=220114 C30=239045
> ./slabs:132 N1=65 N5=67
> ./barn_get:0
> ./cpu_partial_node:22759400 C1=1312100 C3=1260562 C5=1821488 C6=2 C7=1752623 C9=1315094 C11=1309216 C13=1351244 C15=1329937 C17=1360857 C19=1241554 C20=2 C21=1968791 C22=2 C23=1898000 C25=1214784 C27=1242922 C29=1136091 C31=1244131
> ./alloc_slowpath:76640471 C1=4857913 C3=5298367 C4=3 C5=3892806 C6=3 C7=4575965 C8=3 C9=5082878 C11=4887906 C12=3 C13=4036796 C14=1 C15=4848003 C17=4641269 C19=4636149 C20=3 C21=3611116 C22=2 C23=4417922 C25=5650460 C26=3 C27=5171520 C29=5889792 C31=5141585 C50=1 C60=1 C62=1
> ./destroy_by_rcu:1
> ./free_rcu_sheaf_fail:0
> ./barn_put:0
> ./usersize:0
> ./sanity_checks:0
> ./barn_put_fail:0
> ./align:64
> ./alloc_node_mismatch:0
> ./deactivate_remote_frees:0
> ./alloc_slab:3616566 C1=303677 C3=296031 C4=3 C5=18366 C7=18301 C8=3 C9=305344 C11=298932 C12=3 C13=309156 C14=1 C15=303522 C17=313382 C19=288344 C21=21685 C23=21353 C25=277789 C26=3 C27=289631 C29=265057 C31=285980 C50=1 C60=1 C62=1
> ./free_remove_partial:102 C0=2 C2=6 C4=13 C6=8 C8=2 C10=3 C14=1 C16=39 C20=2 C22=7 C26=3 C28=11 C30=5
> ./aliases:0
> ./store_user:0
> ./trace:0
> ./reclaim_account:0
> ./order:2
> ./sheaf_refill:0
> ./object_size:256
> ./alloc_refill:38652283 C1=2581925 C3=3107474 C5=1103799 C7=1890686 C9=2800630 C11=2621006 C13=1696518 C15=2545318 C17=2282285 C19=2481464 C21=582686 C23=1495892 C25=3546646 C27=3013564 C29=3917013 C31=2985377
> ./alloc_cpu_sheaf:0
> ./cpu_partial_drain:12662698 C0=758642 C2=730289 C4=929725 C6=895165 C8=760731 C10=755597 C12=780052 C14=767529 C16=786427 C18=718227 C20=1009443 C22=973972 C24=701252 C26=719300 C28=657611 C30=718736
> ./free_fastpath:4 C1=2 C11=2
> ./hwcache_align:1
> ./cpu_partial_alloc:22759385 C1=1312100 C3=1260561 C5=1821486 C6=2 C7=1752623 C9=1315093 C11=1309215 C13=1351242 C15=1329937 C17=1360857 C19=1241553 C20=2 C21=1968790 C22=1 C23=1897999 C25=1214782 C27=1242922 C29=1136091 C31=1244129
> ./cmpxchg_double_fail:6247305 C0=396268 C1=16193 C2=484201 C3=11558 C4=198887 C5=7233 C6=336779 C7=7332 C8=444665 C9=11539 C10=403230 C11=10130 C12=258163 C13=6666 C14=389004 C15=9620 C16=357182 C17=9184 C18=378255 C19=9012 C20=103655 C21=2375 C22=260015 C23=6160 C24=552885 C25=22738 C26=464990 C27=11172 C28=592307 C29=23777 C30=451529 C31=10601
> ./deactivate_bypass:37988161 C1=2275987 C3=2190892 C4=2 C5=2789006 C6=2 C7=2685278 C8=2 C9=2282247 C11=2266899 C12=2 C13=2340277 C15=2302684 C17=2358983 C19=2154684 C20=2 C21=3028429 C22=1 C23=2922029 C25=2103813 C26=2 C27=2157955 C29=1972778 C31=2156207
> ./objs_per_slab:51
> ./partial:23 N1=10 N5=13
> ./slabs_cpu_partial:1122(44) C0=51(2) C2=25(1) C3=25(1) C4=76(3) C5=51(2) C6=51(2) C8=51(2) C9=25(1) C10=25(1) C11=25(1) C12=51(2) C13=51(2) C14=51(2) C16=25(1) C18=51(2) C19=25(1) C20=76(3) C21=25(1) C22=25(1) C23=25(1) C24=25(1) C25=51(2) C26=51(2) C28=76(3) C30=51(2) C31=51(2)
> ./free_add_partial:34371762 C0=2006393 C2=1928466 C4=2672820 C6=2573112 C8=2010573 C10=2000443 C12=2061797 C14=2031504 C16=2094966 C18=1912080 C20=2857772 C22=2762312 C24=1872434 C26=1917192 C28=1752730 C30=1917168
> ./slab_size:320
> ./cache_dma:0
> ./deactivate_to_head:0
> 
> 
> 
> 2) v7.0-rc2(commit c107785c7e8d) + two patches
> 
> 
> - IOPS: 23M

BTW, the two patches can be applied against 815c8e35511d (
"Merge branch 'slab/for-7.0/sheaves' into slab/for-next"), which is the 1st
Merge Request following v6.19-rc5 exactly in linus/master.

I have run test against 815c8e35511d ("Merge branch 'slab/for-7.0/sheaves' into slab/for-next")
with the two fixes, same IOPS is observed, and similar perf profile. 


Thanks,
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-05 15:48             ` Ming Lei
  2026-03-06  1:01               ` Ming Lei
@ 2026-03-06  4:17               ` Hao Li
  1 sibling, 0 replies; 40+ messages in thread
From: Hao Li @ 2026-03-06  4:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka (SUSE),
	Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Christoph Hellwig

On Thu, Mar 05, 2026 at 11:48:09PM +0800, Ming Lei wrote:
> 

[...]

> 2) v7.0-rc2(commit c107785c7e8d) + two patches
> 
> 
> - IOPS: 23M
> 
> - perf profile
> 
> + perf report --vmlinux=/root/git/linux/vmlinux --kallsyms=/proc/kallsyms --stdio --max-stack 0
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 858K of event 'cycles:P'
> # Event count (approx.): 667558170118
> #
> # Children      Self  Command          Shared Object                       Symbol
> # ........  ........  ...............  ..................................  ..............................................
> #
>     10.81%    10.81%  kublk            [kernel.kallsyms]                   [k] _copy_from_iter
>      5.23%     5.23%  io_uring         [kernel.kallsyms]                   [k] blk_mq_submit_bio
>      3.97%     3.97%  io_uring         [kernel.kallsyms]                   [k] __refill_objects_node
>      2.69%     2.69%  io_uring         [kernel.kallsyms]                   [k] io_rw_init_file
>      2.61%     2.61%  io_uring         [kernel.kallsyms]                   [k] blk_cgroup_bio_start
>      2.55%     2.55%  io_uring         [kernel.kallsyms]                   [k] blk_mq_rq_ctx_init.isra.0
>      2.52%     2.52%  kublk            [kernel.kallsyms]                   [k] blk_mq_free_request
>      2.45%     2.45%  kublk            [kernel.kallsyms]                   [k] ublk_dispatch_req
>      2.18%     2.18%  io_uring         [kernel.kallsyms]                   [k] __fsnotify_parent
>      1.87%     1.87%  kublk            [kernel.kallsyms]                   [k] __slab_free
>      1.82%     1.82%  io_uring         [kernel.kallsyms]                   [k] __io_read
>      1.77%     1.77%  kublk            [kernel.kallsyms]                   [k] slab_update_freelist.isra.0
>      1.72%     1.72%  kublk            [kernel.kallsyms]                   [k] __io_uring_cmd_done
>      1.70%     1.70%  io_uring         [kernel.kallsyms]                   [k] security_file_permission
>      1.68%     1.68%  io_uring         [kernel.kallsyms]                   [k] io_req_task_complete
>      1.51%     1.51%  kublk            [kernel.kallsyms]                   [k] ublk_start_io
>      1.32%     1.32%  io_uring         [kernel.kallsyms]                   [k] llist_reverse_order
>      1.30%     1.30%  io_uring         [kernel.kallsyms]                   [k] submit_bio_noacct_nocheck
>      1.22%     1.22%  kublk            [kernel.kallsyms]                   [k] blk_account_io_done.part.0
>      1.15%     1.15%  io_uring         [kernel.kallsyms]                   [k] kernel_init_pages
>      1.11%     1.11%  kublk            [kernel.kallsyms]                   [k] __local_bh_enable_ip
>      1.03%     1.03%  io_uring         [kernel.kallsyms]                   [k] io_import_reg_buf
>      1.03%     1.03%  kublk            [kernel.kallsyms]                   [k] ublk_ch_uring_cmd_local
>      1.01%     1.01%  io_uring         [kernel.kallsyms]                   [k] wbt_issue
>      0.97%     0.97%  io_uring         [kernel.kallsyms]                   [k] __submit_bio
>      0.81%     0.81%  kublk            [kernel.kallsyms]                   [k] avc_has_perm
>      0.80%     0.80%  io_uring         [kernel.kallsyms]                   [k] __rq_qos_issue
>      0.76%     0.76%  kublk            [kernel.kallsyms]                   [k] __blk_mq_free_request
>      0.73%     0.73%  kublk            kublk                               [.] ublk_queue_io_cmd
>      0.73%     0.73%  io_uring         io_uring                            [.] submitter_uring_fn
>      0.67%     0.67%  io_uring         [kernel.kallsyms]                   [k] kmem_cache_alloc_noprof
>      0.65%     0.65%  kublk            [kernel.kallsyms]                   [k] __io_submit_flush_completions
>      0.62%     0.62%  kublk            [kernel.kallsyms]                   [k] blk_stat_add
>      0.62%     0.62%  kublk            [kernel.kallsyms]                   [k] __ublk_complete_rq
>      0.61%     0.61%  kublk            [kernel.kallsyms]                   [k] blk_update_request
>      0.60%     0.60%  kublk            [kernel.kallsyms]                   [k] __blk_mq_end_request
>      0.58%     0.58%  io_uring         [kernel.kallsyms]                   [k] bio_alloc_bioset
>      0.56%     0.56%  kublk            [kernel.kallsyms]                   [k] __rcu_read_lock
>      0.54%     0.54%  io_uring         [kernel.kallsyms]                   [k] io_req_rw_complete
>      0.54%     0.54%  io_uring         [kernel.kallsyms]                   [k] io_free_batch_list
>      0.53%     0.53%  io_uring         [kernel.kallsyms]                   [k] __io_submit_flush_completions
>      0.53%     0.53%  io_uring         [kernel.kallsyms]                   [k] io_init_req
>      0.53%     0.53%  io_uring         [kernel.kallsyms]                   [k] __blkdev_direct_IO_async
>      0.53%     0.53%  kublk            [kernel.kallsyms]                   [k] io_issue_sqe
>      0.51%     0.51%  io_uring         [kernel.kallsyms]                   [k] blk_mq_start_request
>      0.51%     0.51%  kublk            [kernel.kallsyms]                   [k] io_req_local_work_add
>      0.51%     0.51%  kublk            [kernel.kallsyms]                   [k] kmem_cache_free
>      0.49%     0.49%  io_uring         [kernel.kallsyms]                   [k] io_import_fixed
> 
> 
> - slab stat
> 
> # (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
> ./remote_node_defrag_ratio:100
> ./total_objects:9078 N1=4233 N5=4845
> ./alloc_fastpath:897715187 C1=45250242 C3=50602079 C5=89955493 C6=128 C7=81923744 C8=128 C9=46275792 C10=128 C11=46037573 C12=128 C13=53037806 C14=128 C15=49291969 C16=128 C17=49716073 C18=4 C19=45475417 C20=130 C21=75693223 C22=128 C23=69595236 C24=128 C25=52992066 C26=1 C27=51082176 C28=66 C29=44931239 C30=2 C31=45853827 C48=2 C59=2 C63=1
> ./cpu_slabs:0
> ./objects:5404 N1=2665 N5=2739
> ./sheaf_return_slow:0
> ./objects_partial:3772 N1=1849 N5=1923
> ./sheaf_return_fast:0
> ./cpu_partial:0
> ./free_slowpath:580544104 C0=45249992 C2=50601817 C4=2 C6=2 C8=46275666 C10=46037443 C12=53037685 C14=49291858 C16=49715937 C18=45475167 C20=13 C22=21 C24=52991949 C26=51081920 C28=44931147 C30=45853478 C49=2 C59=2 C61=2 C63=1
> ./barn_get_fail:20733914 C1=1616081 C3=1807218 C5=23 C6=1 C7=10 C8=5 C9=1652707 C10=5 C11=1644200 C12=5 C13=1894208 C14=5 C15=1760428 C16=5 C17=1775575 C18=1 C19=1624123 C20=4 C21=6 C22=5 C23=21 C24=5 C25=1892574 C26=1 C27=1824364 C28=3 C29=1604692 C31=1637636 C48=1 C59=1 C63=1

It looks like barn_get_fail is much more pronounced on CPUs from memoryless NUMA nodes..


-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-26 18:02       ` Vlastimil Babka (SUSE)
  2026-02-27  9:23         ` Ming Lei
@ 2026-03-06  4:55         ` Harry Yoo
  2026-03-06  8:32           ` Hao Li
  2026-03-06  8:47           ` Vlastimil Babka (SUSE)
  1 sibling, 2 replies; 40+ messages in thread
From: Harry Yoo @ 2026-03-06  4:55 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, Christoph Hellwig

On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/25/26 10:31, Ming Lei wrote:
> > Hi Vlastimil,
> > 
> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> > 
> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> > didn't anticipate this interaction with mempools. We could change them
> >> > but there might be others using a similar pattern. Maybe it would be for
> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> Could you try this then, please? Thanks!
> > 
> > Thanks for working on this issue!
> > 
> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> 
> what about this patch in addition to the previous one? Thanks.
> 
> ----8<----
> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> Date: Thu, 26 Feb 2026 18:59:56 +0100
> Subject: [PATCH] mm/slab: put barn on every online node
> 
> Including memoryless nodes.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Just taking a quick grasp...

> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>  		return;
>  
> -	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> +	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> +			|| !node_isset(slab_nid(slab), slab_nodes))

I think you intended !node_isset(numa_mem_id(), slab_nodes)?

"Skip freeing to pcs if it's remote free, but memoryless nodes is
 an exception".

>  	    && likely(!slab_test_pfmemalloc(slab))) {
>  		if (likely(free_to_pcs(s, object, true)))
>  			return;

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-06  4:55         ` Harry Yoo
@ 2026-03-06  8:32           ` Hao Li
  2026-03-06  8:47           ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 40+ messages in thread
From: Hao Li @ 2026-03-06  8:32 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka (SUSE),
	Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Christoph Hellwig

On Fri, Mar 06, 2026 at 01:55:49PM +0900, Harry Yoo wrote:
> On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > On 2/25/26 10:31, Ming Lei wrote:
> > > Hi Vlastimil,
> > > 
> > > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > >> > 
> > >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > >> > didn't anticipate this interaction with mempools. We could change them
> > >> > but there might be others using a similar pattern. Maybe it would be for
> > >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > >> > (but carefully as some deadlock avoidance depends on it, we might need
> > >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > >> Could you try this then, please? Thanks!
> > > 
> > > Thanks for working on this issue!
> > > 
> > > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > 
> > what about this patch in addition to the previous one? Thanks.
> > 
> > ----8<----
> > From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> > From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> > Date: Thu, 26 Feb 2026 18:59:56 +0100
> > Subject: [PATCH] mm/slab: put barn on every online node
> > 
> > Including memoryless nodes.
> > 
> > Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > ---
> 
> Just taking a quick grasp...
> 
> > @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> >  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> >  		return;
> >  
> > -	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> > +	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > +			|| !node_isset(slab_nid(slab), slab_nodes))
> 
> I think you intended !node_isset(numa_mem_id(), slab_nodes)?

This is a good catch! and it could explain why CPUs on memoryless nodes can have
higher barn_get_fail. They have too less sheaves in barn...

> 
> "Skip freeing to pcs if it's remote free, but memoryless nodes is
>  an exception".
> 
> >  	    && likely(!slab_test_pfmemalloc(slab))) {
> >  		if (likely(free_to_pcs(s, object, true)))
> >  			return;
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-06  4:55         ` Harry Yoo
  2026-03-06  8:32           ` Hao Li
@ 2026-03-06  8:47           ` Vlastimil Babka (SUSE)
  2026-03-06 10:22             ` Ming Lei
  1 sibling, 1 reply; 40+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-06  8:47 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, Christoph Hellwig

On 3/6/26 05:55, Harry Yoo wrote:
> On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/25/26 10:31, Ming Lei wrote:
>> > Hi Vlastimil,
>> > 
>> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> >> On 2/24/26 21:27, Vlastimil Babka wrote:
>> >> > 
>> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> >> > didn't anticipate this interaction with mempools. We could change them
>> >> > but there might be others using a similar pattern. Maybe it would be for
>> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> >> > (but carefully as some deadlock avoidance depends on it, we might need
>> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> >> Could you try this then, please? Thanks!
>> > 
>> > Thanks for working on this issue!
>> > 
>> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
>> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
>> 
>> what about this patch in addition to the previous one? Thanks.
>> 
>> ----8<----
>> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
>> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
>> Date: Thu, 26 Feb 2026 18:59:56 +0100
>> Subject: [PATCH] mm/slab: put barn on every online node
>> 
>> Including memoryless nodes.
>> 
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
> 
> Just taking a quick grasp...
> 
>> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>>  		return;
>>  
>> -	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
>> +	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
>> +			|| !node_isset(slab_nid(slab), slab_nodes))
> 
> I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> 
> "Skip freeing to pcs if it's remote free, but memoryless nodes is
>  an exception".

Indeed, thanks! Ming, could you retry with that fixed up please?

>>  	    && likely(!slab_test_pfmemalloc(slab))) {
>>  		if (likely(free_to_pcs(s, object, true)))
>>  			return;
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-06  8:47           ` Vlastimil Babka (SUSE)
@ 2026-03-06 10:22             ` Ming Lei
  2026-03-11  1:10               ` Harry Yoo
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2026-03-06 10:22 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Vlastimil Babka, Andrew Morton, linux-mm,
	linux-kernel, linux-block, Hao Li, Christoph Hellwig

On Fri, Mar 06, 2026 at 09:47:27AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/6/26 05:55, Harry Yoo wrote:
> > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/25/26 10:31, Ming Lei wrote:
> >> > Hi Vlastimil,
> >> > 
> >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> >> > 
> >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> >> > didn't anticipate this interaction with mempools. We could change them
> >> >> > but there might be others using a similar pattern. Maybe it would be for
> >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> >> Could you try this then, please? Thanks!
> >> > 
> >> > Thanks for working on this issue!
> >> > 
> >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> >> 
> >> what about this patch in addition to the previous one? Thanks.
> >> 
> >> ----8<----
> >> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> >> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> >> Date: Thu, 26 Feb 2026 18:59:56 +0100
> >> Subject: [PATCH] mm/slab: put barn on every online node
> >> 
> >> Including memoryless nodes.
> >> 
> >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> >> ---
> > 
> > Just taking a quick grasp...
> > 
> >> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> >>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> >>  		return;
> >>  
> >> -	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> >> +	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> >> +			|| !node_isset(slab_nid(slab), slab_nodes))
> > 
> > I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> > 
> > "Skip freeing to pcs if it's remote free, but memoryless nodes is
> >  an exception".
> 
> Indeed, thanks! Ming, could you retry with that fixed up please?

After applying the following change, IOPS is ~25M:

- delta change on the two patches

diff --git a/mm/slub.c b/mm/slub.c
index 085fe49eec68..56fe8bd956c0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6142,7 +6142,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
                return;
 
        if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
-                       || !node_isset(slab_nid(slab), slab_nodes))
+                       || !node_isset(numa_mem_id(), slab_nodes))
            && likely(!slab_test_pfmemalloc(slab))) {
                if (likely(free_to_pcs(s, object, true)))
                        return;


- slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`

# (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
./remote_node_defrag_ratio:100
./total_objects:7395 N1=3876 N5=3519
./alloc_fastpath:507619662 C0=70 C1=27608632 C3=28990301 C5=35098386 C6=9 C7=35782152 C8=115 C9=31757274 C10=32 C11=30087065 C12=34 C13=31615065 C14=7 C15=31798233 C17=30695955 C18=128 C19=32204853 C20=64 C21=36842392 C23=36212376 C25=30013640 C27=29055001 C29=29990232 C30=48 C31=29867595 C36=2 C50=1
./cpu_slabs:0
./objects:7232 N1=3816 N5=3416
./sheaf_return_slow:0
./objects_partial:500 N1=195 N5=305
./sheaf_return_fast:0
./cpu_partial:0
./free_slowpath:20 C4=20
./barn_get_fail:260 C1=6 C3=26 C5=26 C7=7 C9=5 C10=2 C11=26 C12=2 C13=10 C14=1 C15=19 C17=8 C18=5 C19=19 C20=1 C21=9 C23=22 C25=11 C27=21 C29=26 C31=6 C36=1 C50=1
./sheaf_prefill_oversize:0
./skip_kfence:0
./min_partial:5
./order_fallback:0
./sheaf_capacity:28
./sheaf_flush:28 C24=28
./free_rcu_sheaf:0
./sheaf_alloc:178 C0=4 C2=9 C3=1 C4=9 C5=65 C6=4 C8=5 C10=8 C11=1 C12=4 C13=1 C14=8 C15=1 C16=5 C18=8 C19=1 C20=3 C22=10 C23=1 C24=5 C25=1 C26=7 C27=1 C28=10 C29=1 C30=2 C31=1 C36=1 C50=1
./sheaf_free:0
./sheaf_prefill_slow:0
./sheaf_prefill_fast:0
./poison:0
./red_zone:0
./free_slab:0
./slabs:145 N1=76 N5=69
./barn_get:18129029 C0=3 C1=986017 C3=1035342 C5=1253488 C6=1 C7=1277927 C8=5 C9=1134184 C11=1074513 C13=1129100 C15=1135633 C17=1096277 C19=1150155 C20=2 C21=1315791 C23=1293278 C25=1071905 C27=1037658 C29=1071054 C30=2 C31=1066694
./alloc_slowpath:0
./destroy_by_rcu:1
./free_rcu_sheaf_fail:0
./barn_put:18129105 C0=986015 C2=1035357 C4=1253502 C6=1277924 C8=1134182 C10=1074529 C12=1129101 C14=1135641 C16=1096273 C18=1150168 C20=1315792 C22=1293288 C24=1071905 C26=1037668 C28=1071069 C30=1066691
./usersize:0
./sanity_checks:0
./barn_put_fail:1 C24=1
./align:64
./alloc_node_mismatch:0
./alloc_slab:145 C1=3 C3=19 C5=6 C7=3 C9=3 C10=2 C11=18 C12=2 C13=6 C14=1 C15=12 C17=8 C18=3 C19=12 C21=2 C23=5 C25=7 C27=12 C29=15 C31=4 C36=1 C50=1
./free_remove_partial:0
./aliases:0
./store_user:0
./trace:0
./reclaim_account:0
./order:2
./sheaf_refill:7280 C1=168 C3=728 C5=728 C7=196 C9=140 C10=56 C11=728 C12=56 C13=280 C14=28 C15=532 C17=224 C18=140 C19=532 C20=28 C21=252 C23=616 C25=308 C27=588 C29=728 C31=168 C36=28 C50=28
./object_size:256
./free_fastpath:507615526 C0=27608438 C2=28990052 C4=35098103 C6=35781903 C8=31757101 C10=30086841 C12=31614841 C14=31797983 C16=30695700 C18=32204722 C19=1 C20=36842201 C22=36212117 C24=30013416 C26=29054742 C28=29989974 C30=29867383 C31=4 C39=2 C47=2
./hwcache_align:1
./cmpxchg_double_fail:0
./objs_per_slab:51
./partial:13 N1=5 N5=8
./slabs_cpu_partial:0(0)
./free_add_partial:117 C1=3 C3=7 C5=19 C7=4 C9=2 C11=8 C13=4 C15=7 C18=2 C19=7 C20=1 C21=7 C23=17 C24=3 C25=4 C27=9 C29=11 C31=2
./slab_size:320
./cache_dma:0





Thanks,
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-06 10:22             ` Ming Lei
@ 2026-03-11  1:10               ` Harry Yoo
  2026-03-11 10:15                 ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Harry Yoo @ 2026-03-11  1:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka (SUSE),
	Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, Christoph Hellwig

On Fri, Mar 06, 2026 at 06:22:37PM +0800, Ming Lei wrote:
> On Fri, Mar 06, 2026 at 09:47:27AM +0100, Vlastimil Babka (SUSE) wrote:
> > On 3/6/26 05:55, Harry Yoo wrote:
> > > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > >> On 2/25/26 10:31, Ming Lei wrote:
> > >> > Hi Vlastimil,
> > >> > 
> > >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > >> >> > 
> > >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > >> >> > didn't anticipate this interaction with mempools. We could change them
> > >> >> > but there might be others using a similar pattern. Maybe it would be for
> > >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> > >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > >> >> Could you try this then, please? Thanks!
> > >> > 
> > >> > Thanks for working on this issue!
> > >> > 
> > >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > >> 
> > >> what about this patch in addition to the previous one? Thanks.
> > >> 
> > >> ----8<----
> > >> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> > >> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> > >> Date: Thu, 26 Feb 2026 18:59:56 +0100
> > >> Subject: [PATCH] mm/slab: put barn on every online node
> > >> 
> > >> Including memoryless nodes.
> > >> 
> > >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > >> ---
> > > 
> > > Just taking a quick grasp...
> > > 
> > >> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > >>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > >>  		return;
> > >>  
> > >> -	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> > >> +	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > >> +			|| !node_isset(slab_nid(slab), slab_nodes))
> > > 
> > > I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> > > 
> > > "Skip freeing to pcs if it's remote free, but memoryless nodes is
> > >  an exception".
> > 
> > Indeed, thanks! Ming, could you retry with that fixed up please?
> 
> After applying the following change, IOPS is ~25M:
> 
> - delta change on the two patches
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 085fe49eec68..56fe8bd956c0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -6142,7 +6142,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>                 return;
>  
>         if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> -                       || !node_isset(slab_nid(slab), slab_nodes))
> +                       || !node_isset(numa_mem_id(), slab_nodes))
>             && likely(!slab_test_pfmemalloc(slab))) {
>                 if (likely(free_to_pcs(s, object, true)))
>                         return;
>

Hi Ming, thanks a lot for helping testing!

The stats look quite fine to me, but we're still seeing suboptimal IOPS.

> - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`

Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
refill if blocking is not allowed)?

Next time when testing it, could you please test on top of 7.0-rc3 w/
the memoryless node patch (w/ the delta above) applied?

Also, let us check a few things...

1) Does bumping up sheaf capacity change the slab stats & IOPS?

diff --git a/mm/slub.c b/mm/slub.c
index 0c906fefc31b..5207279417e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
 	 * should result in similar lock contention (barn or list_lock)
 	 */
 	if (s->size >= PAGE_SIZE)
-		capacity = 4;
+		capacity = 6;
 	else if (s->size >= 1024)
-		capacity = 12;
+		capacity = 24;
 	else if (s->size >= 256)
-		capacity = 26;
+		capacity = 52;
 	else
-		capacity = 60;
+		capacity = 120;
 
 	/* Increment capacity to make sheaf exactly a kmalloc size bucket */
 	size = struct_size_t(struct slab_sheaf, objects, capacity);

2) Is there any change in NUMA locality between v6.19 vs. v7.0-rc3 (patched)?
   (e.g., measured via
    perf stat -e node-loads,node-load-misses,node-stores,node-store-misses)

3) It's quite strange that blk_mq_sched_bio_merge() completely
   disappeared in v7.0-rc2 profile [1] . Is there any change
   in read/write io merge rate? (/proc/diskstats) between v6.19 and
   v7.0-rc3?

[1] https://lore.kernel.org/linux-mm/aamluV66pLIdo66g@fedora

> # (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
> ./remote_node_defrag_ratio:100
> ./total_objects:7395 N1=3876 N5=3519
> ./alloc_fastpath:507619662 C0=70 C1=27608632 C3=28990301 C5=35098386 C6=9 C7=35782152 C8=115 C9=31757274 C10=32 C11=30087065 C12=34 C13=31615065 C14=7 C15=31798233 C17=30695955 C18=128 C19=32204853 C20=64 C21=36842392 C23=36212376 C25=30013640 C27=29055001 C29=29990232 C30=48 C31=29867595 C36=2 C50=1
> ./cpu_slabs:0
> ./objects:7232 N1=3816 N5=3416
> ./sheaf_return_slow:0
> ./objects_partial:500 N1=195 N5=305
> ./sheaf_return_fast:0
> ./cpu_partial:0
> ./free_slowpath:20 C4=20
> ./barn_get_fail:260 C1=6 C3=26 C5=26 C7=7 C9=5 C10=2 C11=26 C12=2 C13=10 C14=1 C15=19 C17=8 C18=5 C19=19 C20=1 C21=9 C23=22 C25=11 C27=21 C29=26 C31=6 C36=1 C50=1
> ./sheaf_prefill_oversize:0
> ./skip_kfence:0
> ./min_partial:5
> ./order_fallback:0
> ./sheaf_capacity:28
> ./sheaf_flush:28 C24=28
> ./free_rcu_sheaf:0
> ./sheaf_alloc:178 C0=4 C2=9 C3=1 C4=9 C5=65 C6=4 C8=5 C10=8 C11=1 C12=4 C13=1 C14=8 C15=1 C16=5 C18=8 C19=1 C20=3 C22=10 C23=1 C24=5 C25=1 C26=7 C27=1 C28=10 C29=1 C30=2 C31=1 C36=1 C50=1
> ./sheaf_free:0
> ./sheaf_prefill_slow:0
> ./sheaf_prefill_fast:0
> ./poison:0
> ./red_zone:0
> ./free_slab:0
> ./slabs:145 N1=76 N5=69
> ./barn_get:18129029 C0=3 C1=986017 C3=1035342 C5=1253488 C6=1 C7=1277927 C8=5 C9=1134184 C11=1074513 C13=1129100 C15=1135633 C17=1096277 C19=1150155 C20=2 C21=1315791 C23=1293278 C25=1071905 C27=1037658 C29=1071054 C30=2 C31=1066694
> ./alloc_slowpath:0
> ./destroy_by_rcu:1
> ./free_rcu_sheaf_fail:0
> ./barn_put:18129105 C0=986015 C2=1035357 C4=1253502 C6=1277924 C8=1134182 C10=1074529 C12=1129101 C14=1135641 C16=1096273 C18=1150168 C20=1315792 C22=1293288 C24=1071905 C26=1037668 C28=1071069 C30=1066691
> ./usersize:0
> ./sanity_checks:0
> ./barn_put_fail:1 C24=1
> ./align:64
> ./alloc_node_mismatch:0
> ./alloc_slab:145 C1=3 C3=19 C5=6 C7=3 C9=3 C10=2 C11=18 C12=2 C13=6 C14=1 C15=12 C17=8 C18=3 C19=12 C21=2 C23=5 C25=7 C27=12 C29=15 C31=4 C36=1 C50=1
> ./free_remove_partial:0
> ./aliases:0
> ./store_user:0
> ./trace:0
> ./reclaim_account:0
> ./order:2
> ./sheaf_refill:7280 C1=168 C3=728 C5=728 C7=196 C9=140 C10=56 C11=728 C12=56 C13=280 C14=28 C15=532 C17=224 C18=140 C19=532 C20=28 C21=252 C23=616 C25=308 C27=588 C29=728 C31=168 C36=28 C50=28
> ./object_size:256
> ./free_fastpath:507615526 C0=27608438 C2=28990052 C4=35098103 C6=35781903 C8=31757101 C10=30086841 C12=31614841 C14=31797983 C16=30695700 C18=32204722 C19=1 C20=36842201 C22=36212117 C24=30013416 C26=29054742 C28=29989974 C30=29867383 C31=4 C39=2 C47=2
> ./hwcache_align:1
> ./cmpxchg_double_fail:0
> ./objs_per_slab:51
> ./partial:13 N1=5 N5=8
> ./slabs_cpu_partial:0(0)
> ./free_add_partial:117 C1=3 C3=7 C5=19 C7=4 C9=2 C11=8 C13=4 C15=7 C18=2 C19=7 C20=1 C21=7 C23=17 C24=3 C25=4 C27=9 C29=11 C31=2
> ./slab_size:320
> ./cache_dma:0
> 
> 
> Thanks,
> Ming
> 

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-11  1:10               ` Harry Yoo
@ 2026-03-11 10:15                 ` Ming Lei
  2026-03-11 10:43                   ` Ming Lei
  2026-03-12  4:11                   ` Harry Yoo
  0 siblings, 2 replies; 40+ messages in thread
From: Ming Lei @ 2026-03-11 10:15 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka (SUSE),
	Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, Christoph Hellwig

On Wed, Mar 11, 2026 at 10:10:13AM +0900, Harry Yoo wrote:
> On Fri, Mar 06, 2026 at 06:22:37PM +0800, Ming Lei wrote:
> > On Fri, Mar 06, 2026 at 09:47:27AM +0100, Vlastimil Babka (SUSE) wrote:
> > > On 3/6/26 05:55, Harry Yoo wrote:
> > > > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > > >> On 2/25/26 10:31, Ming Lei wrote:
> > > >> > Hi Vlastimil,
> > > >> > 
> > > >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > > >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > > >> >> > 
> > > >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > > >> >> > didn't anticipate this interaction with mempools. We could change them
> > > >> >> > but there might be others using a similar pattern. Maybe it would be for
> > > >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > > >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> > > >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > > >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > > >> >> Could you try this then, please? Thanks!
> > > >> > 
> > > >> > Thanks for working on this issue!
> > > >> > 
> > > >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > > >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > > >> 
> > > >> what about this patch in addition to the previous one? Thanks.
> > > >> 
> > > >> ----8<----
> > > >> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> > > >> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> > > >> Date: Thu, 26 Feb 2026 18:59:56 +0100
> > > >> Subject: [PATCH] mm/slab: put barn on every online node
> > > >> 
> > > >> Including memoryless nodes.
> > > >> 
> > > >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > > >> ---
> > > > 
> > > > Just taking a quick grasp...
> > > > 
> > > >> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > > >>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > > >>  		return;
> > > >>  
> > > >> -	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> > > >> +	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > > >> +			|| !node_isset(slab_nid(slab), slab_nodes))
> > > > 
> > > > I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> > > > 
> > > > "Skip freeing to pcs if it's remote free, but memoryless nodes is
> > > >  an exception".
> > > 
> > > Indeed, thanks! Ming, could you retry with that fixed up please?
> > 
> > After applying the following change, IOPS is ~25M:
> > 
> > - delta change on the two patches
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 085fe49eec68..56fe8bd956c0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -6142,7 +6142,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> >                 return;
> >  
> >         if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > -                       || !node_isset(slab_nid(slab), slab_nodes))
> > +                       || !node_isset(numa_mem_id(), slab_nodes))
> >             && likely(!slab_test_pfmemalloc(slab))) {
> >                 if (likely(free_to_pcs(s, object, true)))
> >                         return;
> >
> 
> Hi Ming, thanks a lot for helping testing!
> 
> The stats look quite fine to me, but we're still seeing suboptimal IOPS.
> 
> > - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
> 
> Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
> refill if blocking is not allowed)?

No, because fb1091febd66 isn't included into `815c8e35511d Merge branch
'slab/for-7.0/sheaves'.

> 
> Next time when testing it, could you please test on top of 7.0-rc3 w/
> the memoryless node patch (w/ the delta above) applied?

IOPS is same between `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
and 7.0-rc3 with the two patches.

IMO, it should be more easier to compare & investigate by focusing on
815c8e35511d, given there is only 41 patches between v6.19-rc5 and
commit 815c8e35511d.

> 
> Also, let us check a few things...
> 
> 1) Does bumping up sheaf capacity change the slab stats & IOPS?
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 0c906fefc31b..5207279417e2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
>  	 * should result in similar lock contention (barn or list_lock)
>  	 */
>  	if (s->size >= PAGE_SIZE)
> -		capacity = 4;
> +		capacity = 6;
>  	else if (s->size >= 1024)
> -		capacity = 12;
> +		capacity = 24;
>  	else if (s->size >= 256)
> -		capacity = 26;
> +		capacity = 52;
>  	else
> -		capacity = 60;
> +		capacity = 120;
>  
>  	/* Increment capacity to make sheaf exactly a kmalloc size bucket */
>  	size = struct_size_t(struct slab_sheaf, objects, capacity);

IOPS can be increased from 24M to 29M with this patch, against 7.0-rc3 with
Vlastimil's today patchset.

> 
> 2) Is there any change in NUMA locality between v6.19 vs. v7.0-rc3 (patched)?
>    (e.g., measured via
>     perf stat -e node-loads,node-load-misses,node-stores,node-store-misses)

root@tomsrv:~/temp/mm/7.0-rc3/patched# perf stat -a -e node-loads,node-load-misses,node-stores,node-store-misses
Error:
No supported events found.
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (node-loads).
"dmesg | grep -i perf" may provide additional information.

Looks the events are not supported on AMD Zen4 machine.

> 
> 3) It's quite strange that blk_mq_sched_bio_merge() completely
>    disappeared in v7.0-rc2 profile [1] . Is there any change
>    in read/write io merge rate? (/proc/diskstats) between v6.19 and
>    v7.0-rc3?

It isn't strange.

Because IOPS drops to 13M on v7.0-rc2 from 34M on v6.19-rc5, so blk_mq_sched_bio_merge
can't be shown obviously, which code path is run for each bio(IO).

It is one totally random READ IO, and IO merge shouldn't happen.

Thanks, 
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-11 10:15                 ` Ming Lei
@ 2026-03-11 10:43                   ` Ming Lei
  2026-03-12  4:11                   ` Harry Yoo
  1 sibling, 0 replies; 40+ messages in thread
From: Ming Lei @ 2026-03-11 10:43 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka (SUSE),
	Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, Christoph Hellwig

On Wed, Mar 11, 2026 at 6:16 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, Mar 11, 2026 at 10:10:13AM +0900, Harry Yoo wrote:
> > On Fri, Mar 06, 2026 at 06:22:37PM +0800, Ming Lei wrote:
> > > On Fri, Mar 06, 2026 at 09:47:27AM +0100, Vlastimil Babka (SUSE) wrote:
> > > > On 3/6/26 05:55, Harry Yoo wrote:
> > > > > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > > > >> On 2/25/26 10:31, Ming Lei wrote:
> > > > >> > Hi Vlastimil,
> > > > >> >
> > > > >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > > > >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > > > >> >> >
> > > > >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > > > >> >> > didn't anticipate this interaction with mempools. We could change them
> > > > >> >> > but there might be others using a similar pattern. Maybe it would be for
> > > > >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > > > >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> > > > >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > > > >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > > > >> >> Could you try this then, please? Thanks!
> > > > >> >
> > > > >> > Thanks for working on this issue!
> > > > >> >
> > > > >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > > > >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > > > >>
> > > > >> what about this patch in addition to the previous one? Thanks.
> > > > >>
> > > > >> ----8<----
> > > > >> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> > > > >> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> > > > >> Date: Thu, 26 Feb 2026 18:59:56 +0100
> > > > >> Subject: [PATCH] mm/slab: put barn on every online node
> > > > >>
> > > > >> Including memoryless nodes.
> > > > >>
> > > > >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > > > >> ---
> > > > >
> > > > > Just taking a quick grasp...
> > > > >
> > > > >> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > > > >>        if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > > > >>                return;
> > > > >>
> > > > >> -      if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> > > > >> +      if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > > > >> +                      || !node_isset(slab_nid(slab), slab_nodes))
> > > > >
> > > > > I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> > > > >
> > > > > "Skip freeing to pcs if it's remote free, but memoryless nodes is
> > > > >  an exception".
> > > >
> > > > Indeed, thanks! Ming, could you retry with that fixed up please?
> > >
> > > After applying the following change, IOPS is ~25M:
> > >
> > > - delta change on the two patches
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 085fe49eec68..56fe8bd956c0 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -6142,7 +6142,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > >                 return;
> > >
> > >         if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > > -                       || !node_isset(slab_nid(slab), slab_nodes))
> > > +                       || !node_isset(numa_mem_id(), slab_nodes))
> > >             && likely(!slab_test_pfmemalloc(slab))) {
> > >                 if (likely(free_to_pcs(s, object, true)))
> > >                         return;
> > >
> >
> > Hi Ming, thanks a lot for helping testing!
> >
> > The stats look quite fine to me, but we're still seeing suboptimal IOPS.
> >
> > > - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
> >
> > Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
> > refill if blocking is not allowed)?
>
> No, because fb1091febd66 isn't included into `815c8e35511d Merge branch
> 'slab/for-7.0/sheaves'.
>
> >
> > Next time when testing it, could you please test on top of 7.0-rc3 w/
> > the memoryless node patch (w/ the delta above) applied?
>
> IOPS is same between `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
> and 7.0-rc3 with the two patches.
>
> IMO, it should be more easier to compare & investigate by focusing on
> 815c8e35511d, given there is only 41 patches between v6.19-rc5 and
> commit 815c8e35511d.
>
> >
> > Also, let us check a few things...
> >
> > 1) Does bumping up sheaf capacity change the slab stats & IOPS?
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 0c906fefc31b..5207279417e2 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
> >        * should result in similar lock contention (barn or list_lock)
> >        */
> >       if (s->size >= PAGE_SIZE)
> > -             capacity = 4;
> > +             capacity = 6;
> >       else if (s->size >= 1024)
> > -             capacity = 12;
> > +             capacity = 24;
> >       else if (s->size >= 256)
> > -             capacity = 26;
> > +             capacity = 52;
> >       else
> > -             capacity = 60;
> > +             capacity = 120;
> >
> >       /* Increment capacity to make sheaf exactly a kmalloc size bucket */
> >       size = struct_size_t(struct slab_sheaf, objects, capacity);
>
> IOPS can be increased from 24M to 29M with this patch, against 7.0-rc3 with
> Vlastimil's today patchset.

BTW, the improvement looks unstable; sometimes it reaches 28–29M, but sometimes
it doesn't, just 25–26M.

Thanks,



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-11 10:15                 ` Ming Lei
  2026-03-11 10:43                   ` Ming Lei
@ 2026-03-12  4:11                   ` Harry Yoo
  1 sibling, 0 replies; 40+ messages in thread
From: Harry Yoo @ 2026-03-12  4:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka (SUSE),
	Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, Christoph Hellwig

On Wed, Mar 11, 2026 at 06:15:51PM +0800, Ming Lei wrote:
> On Wed, Mar 11, 2026 at 10:10:13AM +0900, Harry Yoo wrote:
> > Hi Ming, thanks a lot for helping testing!
> > 
> > The stats look quite fine to me, but we're still seeing suboptimal IOPS.
> > 
> > > - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
> > 
> > Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
> > refill if blocking is not allowed)?
> 
> No, because fb1091febd66 isn't included into `815c8e35511d Merge branch
> 'slab/for-7.0/sheaves'.

Ok. But the "mm/slab: allow sheaf refill if blocking is not allowed"
would impact the performance, so let's not forget to include that.

> > Next time when testing it, could you please test on top of 7.0-rc3 w/
> > the memoryless node patch (w/ the delta above) applied?
> 
> IOPS is same between `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
> and 7.0-rc3 with the two patches.

Thanks!

> IMO, it should be more easier to compare & investigate by focusing on
> 815c8e35511d, given there is only 41 patches between v6.19-rc5 and
> commit 815c8e35511d.

I was thinking that there might be another regression involved here
but yeah, apparently it's not...

> > Also, let us check a few things...
> > 
> > 1) Does bumping up sheaf capacity change the slab stats & IOPS?
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 0c906fefc31b..5207279417e2 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
> >  	 * should result in similar lock contention (barn or list_lock)
> >  	 */
> >  	if (s->size >= PAGE_SIZE)
> > -		capacity = 4;
> > +		capacity = 6;
> >  	else if (s->size >= 1024)
> > -		capacity = 12;
> > +		capacity = 24;
> >  	else if (s->size >= 256)
> > -		capacity = 26;
> > +		capacity = 52;
> >  	else
> > -		capacity = 60;
> > +		capacity = 120;
> >  
> >  	/* Increment capacity to make sheaf exactly a kmalloc size bucket */
> >  	size = struct_size_t(struct slab_sheaf, objects, capacity);
> 
> IOPS can be increased from 24M to 29M with this patch, against 7.0-rc3 with
> Vlastimil's today patchset.

Oh, thanks!

Could you please try to keep increasing the numbers until the
performance stops improving?

It might or might not reach the original performance,
but that would be good to know.

> > 2) Is there any change in NUMA locality between v6.19 vs. v7.0-rc3 (patched)?
> >    (e.g., measured via
> >     perf stat -e node-loads,node-load-misses,node-stores,node-store-misses)
> 
> root@tomsrv:~/temp/mm/7.0-rc3/patched# perf stat -a -e node-loads,node-load-misses,node-stores,node-store-misses
> Error:
> No supported events found.
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (node-loads).
> "dmesg | grep -i perf" may provide additional information.
> 
> Looks the events are not supported on AMD Zen4 machine.

Ouch.

> > 3) It's quite strange that blk_mq_sched_bio_merge() completely
> >    disappeared in v7.0-rc2 profile [1] . Is there any change
> >    in read/write io merge rate? (/proc/diskstats) between v6.19 and
> >    v7.0-rc3?
> 
> It isn't strange.
> 
> Because IOPS drops to 13M on v7.0-rc2 from 34M on v6.19-rc5, so blk_mq_sched_bio_merge
> can't be shown obviously, which code path is run for each bio(IO).
> 
> It is one totally random READ IO, and IO merge shouldn't happen.

I missed that point. Thanks!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
                   ` (2 preceding siblings ...)
  2026-02-24 20:27 ` Vlastimil Babka
@ 2026-03-12 11:26 ` Hao Li
  2026-03-12 11:56   ` Ming Lei
  3 siblings, 1 reply; 40+ messages in thread
From: Hao Li @ 2026-03-12 11:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, linux-mm,
	linux-kernel, linux-block

On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,
> 
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.
> 
> Reproducer
> ==========
> 
> Hardware: NUMA machine with >= 32 CPUs
> Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> 
>     # build kublk selftest
>     make -C tools/testing/selftests/ublk/
> 
>     # create ublk null target device with 16 queues
>     tools/testing/selftests/ublk/kublk add -t null -q 16
> 
>     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
>     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> 
>     # cleanup
>     tools/testing/selftests/ublk/kublk del -n 0
> 
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> 

Hi Ming,

I also have a similar machine, but my test results show that the IOPS is below
1M, only around 900K. That seems quite strange to me.

My test commands are:

```bash
tools/testing/selftests/ublk/kublk add -t null -q 16
taskset -c 24-47 /home/haolee/fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
```

Below are my machine numa info. Could there be something configured incorrectly
on my side?

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 193175 MB
node 0 free: 164227 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 4 size: 193434 MB
node 4 free: 189559 MB
node 5 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  12  12  12  32  32  32  32
  1:  12  10  12  12  32  32  32  32
  2:  12  12  10  12  32  32  32  32
  3:  12  12  12  10  32  32  32  32
  4:  32  32  32  32  10  12  12  12
  5:  32  32  32  32  12  10  12  12
  6:  32  32  32  32  12  12  10  12
  7:  32  32  32  32  12  12  12  10


-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-12 11:26 ` Hao Li
@ 2026-03-12 11:56   ` Ming Lei
  2026-03-12 12:13     ` Hao Li
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2026-03-12 11:56 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, linux-mm,
	linux-kernel, linux-block

On Thu, Mar 12, 2026 at 07:26:28PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> > 
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> > 
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> > 
> > Reproducer
> > ==========
> > 
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > 
> >     # build kublk selftest
> >     make -C tools/testing/selftests/ublk/
> > 
> >     # create ublk null target device with 16 queues
> >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > 
> >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> >     # cleanup
> >     tools/testing/selftests/ublk/kublk del -n 0
> > 
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > 
> 
> Hi Ming,
> 
> I also have a similar machine, but my test results show that the IOPS is below
> 1M, only around 900K. That seems quite strange to me.
> 
> My test commands are:
> 
> ```bash
> tools/testing/selftests/ublk/kublk add -t null -q 16
> taskset -c 24-47 /home/haolee/fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> ```

The command line looks similar with mine, just in my tests:

taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0

so the test is run cpu 0~31, which covers all 8 numa node.

Also what is the single job perf result on your setting?

/home/haolee/fio/t/io_uring -p0 -n 1 -r 20 /dev/ublkb0

> 
> Below are my machine numa info. Could there be something configured incorrectly
> on my side?
> 
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> node 0 size: 193175 MB
> node 0 free: 164227 MB
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 2 size: 0 MB
> node 2 free: 0 MB
> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> node 3 size: 0 MB
> node 3 free: 0 MB
> node 4 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> node 4 size: 193434 MB
> node 4 free: 189559 MB
> node 5 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
> node 5 size: 0 MB
> node 5 free: 0 MB
> node 6 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
> node 6 size: 0 MB
> node 6 free: 0 MB
> node 7 cpus: 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
> node 7 size: 0 MB
> node 7 free: 0 MB
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  12  12  12  32  32  32  32
>   1:  12  10  12  12  32  32  32  32
>   2:  12  12  10  12  32  32  32  32
>   3:  12  12  12  10  32  32  32  32
>   4:  32  32  32  32  10  12  12  12
>   5:  32  32  32  32  12  10  12  12
>   6:  32  32  32  32  12  12  10  12
>   7:  32  32  32  32  12  12  12  10

The nuam topo is different with mine, please see:

https://lore.kernel.org/all/aZ7p9uF8H8u6RxrK@fedora/


Thanks,
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-12 11:56   ` Ming Lei
@ 2026-03-12 12:13     ` Hao Li
  2026-03-12 14:50       ` Ming Lei
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Li @ 2026-03-12 12:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, linux-mm,
	linux-kernel, linux-block

On Thu, Mar 12, 2026 at 07:56:31PM +0800, Ming Lei wrote:
> On Thu, Mar 12, 2026 at 07:26:28PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Hello Vlastimil and MM guys,
> > > 
> > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > performance regression for workloads with persistent cross-CPU
> > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > drop).
> > > 
> > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > paths"), so the exact first bad commit could not be identified.
> > > 
> > > Reproducer
> > > ==========
> > > 
> > > Hardware: NUMA machine with >= 32 CPUs
> > > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > > 
> > >     # build kublk selftest
> > >     make -C tools/testing/selftests/ublk/
> > > 
> > >     # create ublk null target device with 16 queues
> > >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > > 
> > >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > 
> > >     # cleanup
> > >     tools/testing/selftests/ublk/kublk del -n 0
> > > 
> > > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > > 
> > 
> > Hi Ming,
> > 
> > I also have a similar machine, but my test results show that the IOPS is below
> > 1M, only around 900K. That seems quite strange to me.
> > 
> > My test commands are:
> > 
> > ```bash
> > tools/testing/selftests/ublk/kublk add -t null -q 16
> > taskset -c 24-47 /home/haolee/fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > ```
> 
> The command line looks similar with mine, just in my tests:
> 
> taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> 
> so the test is run cpu 0~31, which covers all 8 numa node.

Oh, yes, this is a difference.

> 
> Also what is the single job perf result on your setting?
> 
> /home/haolee/fio/t/io_uring -p0 -n 1 -r 20 /dev/ublkb0

If I use this command without taskset, the IOPS is still 900K...

> 
> > 
> > Below are my machine numa info. Could there be something configured incorrectly
> > on my side?
> > 
> > available: 8 nodes (0-7)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> > node 0 size: 193175 MB
> > node 0 free: 164227 MB
> > node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> > node 1 size: 0 MB
> > node 1 free: 0 MB
> > node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> > node 2 size: 0 MB
> > node 2 free: 0 MB
> > node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> > node 3 size: 0 MB
> > node 3 free: 0 MB
> > node 4 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> > node 4 size: 193434 MB
> > node 4 free: 189559 MB
> > node 5 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
> > node 5 size: 0 MB
> > node 5 free: 0 MB
> > node 6 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
> > node 6 size: 0 MB
> > node 6 free: 0 MB
> > node 7 cpus: 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
> > node 7 size: 0 MB
> > node 7 free: 0 MB
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  12  12  12  32  32  32  32
> >   1:  12  10  12  12  32  32  32  32
> >   2:  12  12  10  12  32  32  32  32
> >   3:  12  12  12  10  32  32  32  32
> >   4:  32  32  32  32  10  12  12  12
> >   5:  32  32  32  32  12  10  12  12
> >   6:  32  32  32  32  12  12  10  12
> >   7:  32  32  32  32  12  12  12  10
> 
> The nuam topo is different with mine, please see:
> 
> https://lore.kernel.org/all/aZ7p9uF8H8u6RxrK@fedora/

Yes, our NUMA topology does have some differences, but I feel there may be some
other factors affecting my test results as well.

Even when I run with "-p0 -n 16 -r 20 /dev/ublkb0" without using taskset to pin
the CPU affinity, the best performance I can get is only around 10M.

My cpu is also AMD Zen 4

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-12 12:13     ` Hao Li
@ 2026-03-12 14:50       ` Ming Lei
  2026-03-13  3:26         ` Hao Li
  0 siblings, 1 reply; 40+ messages in thread
From: Ming Lei @ 2026-03-12 14:50 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, linux-mm,
	linux-kernel, linux-block

On Thu, Mar 12, 2026 at 08:13:18PM +0800, Hao Li wrote:
> On Thu, Mar 12, 2026 at 07:56:31PM +0800, Ming Lei wrote:
> > On Thu, Mar 12, 2026 at 07:26:28PM +0800, Hao Li wrote:
> > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > Hello Vlastimil and MM guys,
> > > > 
> > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > performance regression for workloads with persistent cross-CPU
> > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > drop).
> > > > 
> > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > paths"), so the exact first bad commit could not be identified.
> > > > 
> > > > Reproducer
> > > > ==========
> > > > 
> > > > Hardware: NUMA machine with >= 32 CPUs
> > > > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > > > 
> > > >     # build kublk selftest
> > > >     make -C tools/testing/selftests/ublk/
> > > > 
> > > >     # create ublk null target device with 16 queues
> > > >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > > > 
> > > >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > > >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > > 
> > > >     # cleanup
> > > >     tools/testing/selftests/ublk/kublk del -n 0
> > > > 
> > > > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > > > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > > > 
> > > 
> > > Hi Ming,
> > > 
> > > I also have a similar machine, but my test results show that the IOPS is below
> > > 1M, only around 900K. That seems quite strange to me.
> > > 
> > > My test commands are:
> > > 
> > > ```bash
> > > tools/testing/selftests/ublk/kublk add -t null -q 16
> > > taskset -c 24-47 /home/haolee/fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > ```
> > 
> > The command line looks similar with mine, just in my tests:
> > 
> > taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> > so the test is run cpu 0~31, which covers all 8 numa node.
> 
> Oh, yes, this is a difference.
> 
> > 
> > Also what is the single job perf result on your setting?
> > 
> > /home/haolee/fio/t/io_uring -p0 -n 1 -r 20 /dev/ublkb0
> 
> If I use this command without taskset, the IOPS is still 900K...

So single job(-n 1) can reach 900K, which is not bad.

But if 16 jobs still can reach 1M, which looks not good.

In my machine, single job can reach 2.7M, 16jobs(taskset -c 0-31) can get 13M
on v7.0-rc3.


> 
> > 
> > > 
> > > Below are my machine numa info. Could there be something configured incorrectly
> > > on my side?
> > > 
> > > available: 8 nodes (0-7)
> > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> > > node 0 size: 193175 MB
> > > node 0 free: 164227 MB
> > > node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> > > node 1 size: 0 MB
> > > node 1 free: 0 MB
> > > node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> > > node 2 size: 0 MB
> > > node 2 free: 0 MB
> > > node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> > > node 3 size: 0 MB
> > > node 3 free: 0 MB
> > > node 4 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> > > node 4 size: 193434 MB
> > > node 4 free: 189559 MB
> > > node 5 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
> > > node 5 size: 0 MB
> > > node 5 free: 0 MB
> > > node 6 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
> > > node 6 size: 0 MB
> > > node 6 free: 0 MB
> > > node 7 cpus: 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
> > > node 7 size: 0 MB
> > > node 7 free: 0 MB
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >   0:  10  12  12  12  32  32  32  32
> > >   1:  12  10  12  12  32  32  32  32
> > >   2:  12  12  10  12  32  32  32  32
> > >   3:  12  12  12  10  32  32  32  32
> > >   4:  32  32  32  32  10  12  12  12
> > >   5:  32  32  32  32  12  10  12  12
> > >   6:  32  32  32  32  12  12  10  12
> > >   7:  32  32  32  32  12  12  12  10
> > 
> > The nuam topo is different with mine, please see:
> > 
> > https://lore.kernel.org/all/aZ7p9uF8H8u6RxrK@fedora/
> 
> Yes, our NUMA topology does have some differences, but I feel there may be some
> other factors affecting my test results as well.
> 
> Even when I run with "-p0 -n 16 -r 20 /dev/ublkb0" without using taskset to pin
> the CPU affinity, the best performance I can get is only around 10M.

What is the data when you run same test on v6.19?
 
Thanks,
Ming



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-03-12 14:50       ` Ming Lei
@ 2026-03-13  3:26         ` Hao Li
  0 siblings, 0 replies; 40+ messages in thread
From: Hao Li @ 2026-03-13  3:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Harry Yoo, Andrew Morton, linux-mm,
	linux-kernel, linux-block

On Thu, Mar 12, 2026 at 10:50:32PM +0800, Ming Lei wrote:
> On Thu, Mar 12, 2026 at 08:13:18PM +0800, Hao Li wrote:
> > On Thu, Mar 12, 2026 at 07:56:31PM +0800, Ming Lei wrote:
> > > On Thu, Mar 12, 2026 at 07:26:28PM +0800, Hao Li wrote:
> > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > Hello Vlastimil and MM guys,
> > > > > 
> > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > performance regression for workloads with persistent cross-CPU
> > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > drop).
> > > > > 
> > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > paths"), so the exact first bad commit could not be identified.
> > > > > 
> > > > > Reproducer
> > > > > ==========
> > > > > 
> > > > > Hardware: NUMA machine with >= 32 CPUs
> > > > > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > > > > 
> > > > >     # build kublk selftest
> > > > >     make -C tools/testing/selftests/ublk/
> > > > > 
> > > > >     # create ublk null target device with 16 queues
> > > > >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > > > > 
> > > > >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > > > >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > > > 
> > > > >     # cleanup
> > > > >     tools/testing/selftests/ublk/kublk del -n 0
> > > > > 
> > > > > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > > > > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > > > > 
> > > > 
> > > > Hi Ming,
> > > > 
> > > > I also have a similar machine, but my test results show that the IOPS is below
> > > > 1M, only around 900K. That seems quite strange to me.
> > > > 
> > > > My test commands are:
> > > > 
> > > > ```bash
> > > > tools/testing/selftests/ublk/kublk add -t null -q 16
> > > > taskset -c 24-47 /home/haolee/fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > > ```
> > > 
> > > The command line looks similar with mine, just in my tests:
> > > 
> > > taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > 
> > > so the test is run cpu 0~31, which covers all 8 numa node.
> > 
> > Oh, yes, this is a difference.
> > 
> > > 
> > > Also what is the single job perf result on your setting?
> > > 
> > > /home/haolee/fio/t/io_uring -p0 -n 1 -r 20 /dev/ublkb0
> > 
> > If I use this command without taskset, the IOPS is still 900K...
> 
> So single job(-n 1) can reach 900K, which is not bad.
> 
> But if 16 jobs still can reach 1M, which looks not good.
> 
> In my machine, single job can reach 2.7M, 16jobs(taskset -c 0-31) can get 13M
> on v7.0-rc3.

Thanks for sharing your data!
I've made some affinity adjustments, and the test results have improved.

Although the absolute numbers are still not as high as yours, some differences
in the relative results have already started to show up.

> 
> 
> > 
> > > 
> > > > 
> > > > Below are my machine numa info. Could there be something configured incorrectly
> > > > on my side?
> > > > 
> > > > available: 8 nodes (0-7)
> > > > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> > > > node 0 size: 193175 MB
> > > > node 0 free: 164227 MB
> > > > node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
> > > > node 1 size: 0 MB
> > > > node 1 free: 0 MB
> > > > node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> > > > node 2 size: 0 MB
> > > > node 2 free: 0 MB
> > > > node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
> > > > node 3 size: 0 MB
> > > > node 3 free: 0 MB
> > > > node 4 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
> > > > node 4 size: 193434 MB
> > > > node 4 free: 189559 MB
> > > > node 5 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
> > > > node 5 size: 0 MB
> > > > node 5 free: 0 MB
> > > > node 6 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
> > > > node 6 size: 0 MB
> > > > node 6 free: 0 MB
> > > > node 7 cpus: 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
> > > > node 7 size: 0 MB
> > > > node 7 free: 0 MB
> > > > node distances:
> > > > node   0   1   2   3   4   5   6   7
> > > >   0:  10  12  12  12  32  32  32  32
> > > >   1:  12  10  12  12  32  32  32  32
> > > >   2:  12  12  10  12  32  32  32  32
> > > >   3:  12  12  12  10  32  32  32  32
> > > >   4:  32  32  32  32  10  12  12  12
> > > >   5:  32  32  32  32  12  10  12  12
> > > >   6:  32  32  32  32  12  12  10  12
> > > >   7:  32  32  32  32  12  12  12  10
> > > 
> > > The nuam topo is different with mine, please see:
> > > 
> > > https://lore.kernel.org/all/aZ7p9uF8H8u6RxrK@fedora/
> > 
> > Yes, our NUMA topology does have some differences, but I feel there may be some
> > other factors affecting my test results as well.
> > 
> > Even when I run with "-p0 -n 16 -r 20 /dev/ublkb0" without using taskset to pin
> > the CPU affinity, the best performance I can get is only around 10M.
> 
> What is the data when you run same test on v6.19?

I noticed the following output while creating the queue:

dev id 0: nr_hw_queues 16 queue_depth 128 block size 512 dev_capacity 524288000
        max rq size 1048576 daemon pid 545894 flags 0x6042 state LIVE
        queue 0: affinity(24 )
        queue 1: affinity(36 )
        queue 2: affinity(72 )
        queue 3: affinity(84 )
        queue 4: affinity(96 )
        queue 5: affinity(108 )
        queue 6: affinity(120 )
        queue 7: affinity(132 )
        queue 8: affinity(144 )
        queue 9: affinity(156 )
        queue 10: affinity(168 )
        queue 11: affinity(180 )
        queue 12: affinity(48 )
        queue 13: affinity(60 )
        queue 14: affinity(0 )
        queue 15: affinity(12 )

I noticed that each queue was assigned an affinity, so I also tried using
taskset -c 0,12,24,36,48,60,72,84,96,108,120,132,144,156,168,180, and the IOPS
reached a new high. The performance was even better than without using taskset
for CPU affinity.

For the good case, IOPS can reach 19M on commit 41f1a086.
For the bad case, IOPS can reach 14M on commit 815c8e35.

The results are fairly stable. So although the absolute numbers in my
environment are still different from those in yours, the relative difference
between the bad case and the good case is already clear. I think this means
I've successfully reproduced your test results.

Thank you for your help and insights!

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-03-13  3:27 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24  5:00 ` Harry Yoo
2026-02-24  9:07   ` Ming Lei
2026-02-25  5:32     ` Hao Li
2026-02-25  6:54       ` Harry Yoo
2026-02-25  7:06         ` Hao Li
2026-02-25  7:19           ` Harry Yoo
2026-02-25  8:19             ` Hao Li
2026-02-25  8:41               ` Harry Yoo
2026-02-25  8:54                 ` Hao Li
2026-02-25  8:21             ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo
2026-02-24  7:41     ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25  5:24   ` Harry Yoo
2026-02-25  8:45   ` Vlastimil Babka (SUSE)
2026-02-25  9:31     ` Ming Lei
2026-02-25 11:29       ` Vlastimil Babka (SUSE)
2026-02-25 12:24         ` Ming Lei
2026-02-25 13:22           ` Vlastimil Babka (SUSE)
2026-02-26 18:02       ` Vlastimil Babka (SUSE)
2026-02-27  9:23         ` Ming Lei
2026-03-05 13:05           ` Vlastimil Babka (SUSE)
2026-03-05 15:48             ` Ming Lei
2026-03-06  1:01               ` Ming Lei
2026-03-06  4:17               ` Hao Li
2026-03-06  4:55         ` Harry Yoo
2026-03-06  8:32           ` Hao Li
2026-03-06  8:47           ` Vlastimil Babka (SUSE)
2026-03-06 10:22             ` Ming Lei
2026-03-11  1:10               ` Harry Yoo
2026-03-11 10:15                 ` Ming Lei
2026-03-11 10:43                   ` Ming Lei
2026-03-12  4:11                   ` Harry Yoo
2026-03-12 11:26 ` Hao Li
2026-03-12 11:56   ` Ming Lei
2026-03-12 12:13     ` Hao Li
2026-03-12 14:50       ` Ming Lei
2026-03-13  3:26         ` Hao Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox