[Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
@ 2026-02-24  2:52 Ming Lei
  2026-02-24  5:00 ` Harry Yoo
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ming Lei @ 2026-02-24  2:52 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: ming.lei, linux-mm, linux-kernel, linux-block

Hello Vlastimil and MM guys,

The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
performance regression for workloads with persistent cross-CPU
alloc/free patterns. ublk null target benchmark IOPS drops
significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
drop).

Bisecting within the sheaves series is blocked by a kernel panic at
17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
paths"), so the exact first bad commit could not be identified.

Reproducer
==========

Hardware: NUMA machine with >= 32 CPUs
Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)

    # build kublk selftest
    make -C tools/testing/selftests/ublk/

    # create ublk null target device with 16 queues
    tools/testing/selftests/ublk/kublk add -t null -q 16

    # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
    taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0

    # cleanup
    tools/testing/selftests/ublk/kublk del -n 0

Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

perf profile (bad kernel)
=========================

~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
with massive spinlock contention on the node partial list lock:

+   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
-   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
   - 44.41% kmem_cache_alloc_noprof
      - 43.89% ___slab_alloc
         + 41.16% get_from_any_partial
           0.91% get_from_partial_node
         + 0.87% alloc_from_new_slab
         + 0.65% allocate_slab
-   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
   - 44.49% mempool_alloc_noprof
      - 44.43% kmem_cache_alloc_noprof
         - 43.90% ___slab_alloc
            + 41.18% get_from_any_partial
              0.90% get_from_partial_node
            + 0.87% alloc_from_new_slab
            + 0.65% allocate_slab
+   41.23%     0.10%  io_uring  [k] get_from_any_partial
+   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
-   40.75%     0.20%  io_uring  [k] get_from_partial_node
   - 40.56% get_from_partial_node
      - 38.83% __raw_spin_lock_irqsave
           38.65% native_queued_spin_lock_slowpath

Analysis
========

The ublk null target workload exposes a cross-CPU slab allocation
pattern: bios are allocated on the io_uring submitter CPU during block
layer submission, but freed on a different CPU — the ublk daemon thread
that runs the completion via io_uring_cmd_complete_in_task() task work.
And the completion CPU stays in same LLC or numa node with submission CPU.

This cross-CPU alloc/free pattern is not unique to ublk. The block
layer's default rq_affinity=1 setting completes requests on a CPU
sharing LLC with the submission CPU, which similarly causes bio freeing
on a different CPU than allocation. The ublk null target simply makes
this pattern more pronounced and measurable because all overhead is in
the bio alloc/free path with no actual I/O.

**The following is from AI, just for reference**

The result is that the allocating CPU's per-CPU slab caches are
continuously drained without being replenished by local frees. The bio
layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
leaving the submitter CPUs' caches empty and falling through to
mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

In v6.19, SLUB handled this with a 3-tier allocation hierarchy:

  Tier 1: CPU slab freelist         lock-free (cmpxchg)
  Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
  Tier 3: Node partial list         kmem_cache_node->list_lock

The CPU partial slab list (Tier 2) was the critical buffer. It was
populated during __slab_free() -> put_cpu_partial() and provided a
lock-free pool of partial slabs per CPU. Even when the CPU slab was
exhausted, the CPU partial list could supply more slabs without
touching any shared lock.

The sheaves architecture replaces this with a 2-tier hierarchy:

  Tier 1: Per-CPU sheaf             lock-free (local_lock)
  Tier 2: Node partial list         kmem_cache_node->list_lock

The intermediate lock-free tier is gone. When the per-CPU sheaf is
empty and the spare sheaf is also empty, every refill must go through
the node partial list, requiring kmem_cache_node->list_lock. With 16
CPUs simultaneously allocating bios and all hitting empty sheaves, this
creates a thundering herd on the node list_lock.

When the local node's partial list is also depleted (objects freed on
remote nodes accumulate there instead), get_from_any_partial() kicks in
to search other NUMA nodes, compounding the contention with cross-NUMA
list_lock acquisition — explaining the 41% in get_from_any_partial ->
native_queued_spin_lock_slowpath seen in the profile.

The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
__refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
does not address the fundamental architectural issue: the missing
lock-free intermediate caching tier that the CPU partial list provided.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
@ 2026-02-24  5:00 ` Harry Yoo
  2026-02-24  9:07   ` Ming Lei
  2026-02-24  6:51 ` Hao Li
  2026-02-24 20:27 ` Vlastimil Babka
  2 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-02-24  5:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, surenb

On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,

Hi Ming, thanks for the report!

> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.

Ouch. Why did it crash?

> Reproducer
> ==========
> 
> Hardware: NUMA machine with >= 32 CPUs
> Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> 
>     # build kublk selftest
>     make -C tools/testing/selftests/ublk/
> 
>     # create ublk null target device with 16 queues
>     tools/testing/selftests/ublk/kublk add -t null -q 16
> 
>     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
>     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> 
>     # cleanup
>     tools/testing/selftests/ublk/kublk del -n 0
> 
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)

Thanks for such a detailed steps to reproduce :)

> perf profile (bad kernel)
> =========================
> 
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
> 
> +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
>    - 44.41% kmem_cache_alloc_noprof
>       - 43.89% ___slab_alloc
>          + 41.16% get_from_any_partial
>            0.91% get_from_partial_node
>          + 0.87% alloc_from_new_slab
>          + 0.65% allocate_slab
> -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
>    - 44.49% mempool_alloc_noprof
>       - 44.43% kmem_cache_alloc_noprof
>          - 43.90% ___slab_alloc
>             + 41.18% get_from_any_partial
>               0.90% get_from_partial_node
>             + 0.87% alloc_from_new_slab
>             + 0.65% allocate_slab
> +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> -   40.75%     0.20%  io_uring  [k] get_from_partial_node
>    - 40.56% get_from_partial_node
>       - 38.83% __raw_spin_lock_irqsave
>            38.65% native_queued_spin_lock_slowpath

That's pretty severe contention. Interestingly, the profile shows
a severe contention on the alloc path, but I don't see free path here.
wondering why only the alloc path is suffering, hmm...

Anyway, I think there may be two pieces contributing to this contention:

Part 1) We probably made the portion of slowpath bigger,
        by caching a smaller number of objects per CPU
	after transitioning to sheaves.

Part 2) We probably made the slowpath much slower.

We need to investigate those parts separately.

Regarding Part 1:

# Point 1. The CPU slab was not considered in the sheaf capacity calculation

calculate_sheaf_capacity() does not take into account that the CPU slab
was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
calculation to cache a number of objects similar to the CPU slab + percpu
partial slab list layers that SLUB previously had?

# Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
# and that probably means we're caching less objects per CPU.

Because SLUB previously assumed "slabs are half-full" when calculating
the number of slabs to cache per CPU, that could actually cache as twice
as many objects than intended when slabs are mostly empty.

Because sheaves track the number of objects precisely, that inaccuracy
is gone. If the workload was previously benefiting from the inaccuracy,
sheaves can make CPUs cache a smaller number of objects per CPU compared
to the percpu slab caching layer.

Anyway, I guess we need to check how many objects are actually
cached per CPU w/ and w/o sheaves, during the benchmark.

After making sure the number of objects cached per CPU is the same as
before, we could further investigate how much Part 2 plays into it.

Slightly off-topic, by the way, slab currently doesn't let system admins
set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
the default capacity. I think we need to allow sys admins to set a custom
sheaf_capacity in the very near future.

> Analysis
> ========
> 
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.

Ok, so a submitter CPU keeps allocating objects, while a completion CPU
keeps freeing objects.

> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
>
> **The following is from AI, just for reference**
> 
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.

Ok.

> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.

Well, the sheaves layer is supposed to provide a similar lock-free pool
of objects per CPU. The percpu slab layer was supposed to cache a certain
number of objects (from multiple slabs), which is translated to the
sheaf capacity now.

> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
>
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.

Again, the sheaves layer is supposed to cache a similar number of
objects previously covered by Tier 1 + Tier 2... oh, wait.
The sheaf capacity calculation logic does not take "Tier 1 CPU slab
freelist" into account.

> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
  2026-02-24  5:00 ` Harry Yoo
@ 2026-02-24  6:51 ` Hao Li
  2026-02-24  7:10   ` Harry Yoo
  2026-02-24 20:27 ` Vlastimil Babka
  2 siblings, 1 reply; 18+ messages in thread
From: Hao Li @ 2026-02-24  6:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo

On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,
> 
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).

Thanks for testing.

> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"),

As Harry said, this is odd. Could you post crash logs?

> so the exact first bad commit could not be identified.

Based on my earlier test results, this performance regression (more precisely, I
suspect it is an expected return to the previous baseline - see below) should have
been introduced by two patches:

slab: add optimized sheaf refill from partial list
slab: remove SLUB_CPU_PARTIAL

https://lore.kernel.org/linux-mm/imzzlzuzjmlkhxc7hszxh5ba7jksvqcieg5rzyryijkkdhai5q@l2t4ye5quozb/

> 
> Reproducer
> ==========
> 
[...]
> 
> the result is that the allocating cpu's per-cpu slab caches are
> continuously drained without being replenished by local frees. the bio
> layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> leaving the submitter cpus' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> 
> in v6.19, slub handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
> 
> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
> 
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.

The purpose of introducing sheaves was to fully replace the percpu partial slabs
mechanism with sheaves. During this process, we first added the sheaves caching
layer and only later removed the percpu partial slabs layer, so it's expected
that performance could first improve and then return to the previous level.

Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
`815c8e35511d`)? If those two results are close, then the ~64% performance
regression we're currently discussing might be better interpreted as returning
to the previous baseline (i.e. a reversion), rather than a true regression.

The link below contains my previous test results. According to will-it-scale,
the performance of "no sheaves at all" and "only the sheaves layer exists" is
close:
https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/


-- 
Thanks,
Hao

> 
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  6:51 ` Hao Li
@ 2026-02-24  7:10   ` Harry Yoo
  2026-02-24  7:41     ` Hao Li
  0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-02-24  7:10 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block

On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Reproducer
> > ==========
> > 
> [...]
> > 
> > the result is that the allocating cpu's per-cpu slab caches are
> > continuously drained without being replenished by local frees. the bio
> > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > leaving the submitter cpus' caches empty and falling through to
> > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> > 
> > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> > 
> >   Tier 1: CPU slab freelist         lock-free (cmpxchg)
> >   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
> >   Tier 3: Node partial list         kmem_cache_node->list_lock
> > 
> > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > populated during __slab_free() -> put_cpu_partial() and provided a
> > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > exhausted, the CPU partial list could supply more slabs without
> > touching any shared lock.
> > 
> > The sheaves architecture replaces this with a 2-tier hierarchy:
> > 
> >   Tier 1: Per-CPU sheaf             lock-free (local_lock)
> >   Tier 2: Node partial list         kmem_cache_node->list_lock
> > 
> > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > empty and the spare sheaf is also empty, every refill must go through
> > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > creates a thundering herd on the node list_lock.
> > 
> > When the local node's partial list is also depleted (objects freed on
> > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > to search other NUMA nodes, compounding the contention with cross-NUMA
> > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > native_queued_spin_lock_slowpath seen in the profile.
> 
> The purpose of introducing sheaves was to fully replace the percpu partial slabs
> mechanism with sheaves. During this process, we first added the sheaves caching
> layer and only later removed the percpu partial slabs layer, so it's expected
> that performance could first improve and then return to the previous level.

There's one difference here; you used will-it-scale mmap2 test case that
involves maple tree node and vm_area_struct cache that already has
sheaves enabled in v6.19.

And Ming's benchmark stresses bio-<size> caches.

Since other caches don't have sheaves in v6.19, they're not supposed to
have performance gain by having additional sheaves layer on top of cpu
slab + percpu partial slab list.

> Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> `815c8e35511d`)? If those two results are close, then the ~64% performance
> regression we're currently discussing might be better interpreted as returning
> to the previous baseline (i.e. a reversion), rather than a true regression.
>
> The link below contains my previous test results. According to will-it-scale,
> the performance of "no sheaves at all" and "only the sheaves layer exists" is
> close:
> https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  7:10   ` Harry Yoo
@ 2026-02-24  7:41     ` Hao Li
  0 siblings, 0 replies; 18+ messages in thread
From: Hao Li @ 2026-02-24  7:41 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block

On Tue, Feb 24, 2026 at 04:10:43PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Reproducer
> > > ==========
> > > 
> > [...]
> > > 
> > > the result is that the allocating cpu's per-cpu slab caches are
> > > continuously drained without being replenished by local frees. the bio
> > > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > > leaving the submitter cpus' caches empty and falling through to
> > > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> > > 
> > > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> > > 
> > >   Tier 1: CPU slab freelist         lock-free (cmpxchg)
> > >   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
> > >   Tier 3: Node partial list         kmem_cache_node->list_lock
> > > 
> > > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > > populated during __slab_free() -> put_cpu_partial() and provided a
> > > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > > exhausted, the CPU partial list could supply more slabs without
> > > touching any shared lock.
> > > 
> > > The sheaves architecture replaces this with a 2-tier hierarchy:
> > > 
> > >   Tier 1: Per-CPU sheaf             lock-free (local_lock)
> > >   Tier 2: Node partial list         kmem_cache_node->list_lock
> > > 
> > > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > > empty and the spare sheaf is also empty, every refill must go through
> > > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > > creates a thundering herd on the node list_lock.
> > > 
> > > When the local node's partial list is also depleted (objects freed on
> > > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > > to search other NUMA nodes, compounding the contention with cross-NUMA
> > > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > > native_queued_spin_lock_slowpath seen in the profile.
> > 
> > The purpose of introducing sheaves was to fully replace the percpu partial slabs
> > mechanism with sheaves. During this process, we first added the sheaves caching
> > layer and only later removed the percpu partial slabs layer, so it's expected
> > that performance could first improve and then return to the previous level.
> 
> There's one difference here; you used will-it-scale mmap2 test case that
> involves maple tree node and vm_area_struct cache that already has
> sheaves enabled in v6.19.
> 
> And Ming's benchmark stresses bio-<size> caches.
> 
> Since other caches don't have sheaves in v6.19, they're not supposed to
> have performance gain by having additional sheaves layer on top of cpu
> slab + percpu partial slab list.

Oh, yes-you're right. That distinction is important!
I think I've gotten a bit stuck in a fixed way of thinking...
Thanks for pointing it out!

> 
> > Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> > commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> > `815c8e35511d`)? If those two results are close, then the ~64% performance
> > regression we're currently discussing might be better interpreted as returning
> > to the previous baseline (i.e. a reversion), rather than a true regression.
> >
> > The link below contains my previous test results. According to will-it-scale,
> > the performance of "no sheaves at all" and "only the sheaves layer exists" is
> > close:
> > https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  5:00 ` Harry Yoo
@ 2026-02-24  9:07   ` Ming Lei
  2026-02-25  5:32     ` Hao Li
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2026-02-24  9:07 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Hao Li, surenb

Hi Harry,

On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> 
> Hi Ming, thanks for the report!
> 
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> > 
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> 
> Ouch. Why did it crash?

[   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
[   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
[   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
[   16.162430] RIP: 0010:__put_partials+0x2f/0x140
[   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
[   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
[   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
[   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
[   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
[   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
[   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
[   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
[   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
[   16.162447] PKRU: 55555554
[   16.162448] Call Trace:
[   16.162450]  <TASK>
[   16.162452]  kmem_cache_free+0x410/0x490
[   16.162454]  do_readlinkat+0x14e/0x180
[   16.162459]  __x64_sys_readlinkat+0x1c/0x30
[   16.162461]  do_syscall_64+0x7e/0x6b0
[   16.162465]  ? post_alloc_hook+0xb9/0x140
[   16.162468]  ? get_page_from_freelist+0x478/0x720
[   16.162470]  ? path_openat+0xb3/0x2a0
[   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
[   16.162474]  ? count_memcg_events+0xd6/0x210
[   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
[   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
[   16.162481]  ? charge_memcg+0x48/0x80
[   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
[   16.162484]  ? __folio_mod_stat+0x2d/0x90
[   16.162487]  ? set_ptes.isra.0+0x36/0x80
[   16.162490]  ? do_anonymous_page+0x100/0x4a0
[   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
[   16.162493]  ? count_memcg_events+0xd6/0x210
[   16.162494]  ? handle_mm_fault+0x212/0x340
[   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
[   16.162500]  ? irqentry_exit+0x6d/0x540
[   16.162502]  ? exc_page_fault+0x7e/0x1a0
[   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

> 
> > Reproducer
> > ==========
> > 
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > 
> >     # build kublk selftest
> >     make -C tools/testing/selftests/ublk/
> > 
> >     # create ublk null target device with 16 queues
> >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > 
> >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> >     # cleanup
> >     tools/testing/selftests/ublk/kublk del -n 0
> > 
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> 
> Thanks for such a detailed steps to reproduce :)
> 
> > perf profile (bad kernel)
> > =========================
> > 
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> > 
> > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> >    - 44.41% kmem_cache_alloc_noprof
> >       - 43.89% ___slab_alloc
> >          + 41.16% get_from_any_partial
> >            0.91% get_from_partial_node
> >          + 0.87% alloc_from_new_slab
> >          + 0.65% allocate_slab
> > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> >    - 44.49% mempool_alloc_noprof
> >       - 44.43% kmem_cache_alloc_noprof
> >          - 43.90% ___slab_alloc
> >             + 41.18% get_from_any_partial
> >               0.90% get_from_partial_node
> >             + 0.87% alloc_from_new_slab
> >             + 0.65% allocate_slab
> > +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> > +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> > -   40.75%     0.20%  io_uring  [k] get_from_partial_node
> >    - 40.56% get_from_partial_node
> >       - 38.83% __raw_spin_lock_irqsave
> >            38.65% native_queued_spin_lock_slowpath
> 
> That's pretty severe contention. Interestingly, the profile shows
> a severe contention on the alloc path, but I don't see free path here.
> wondering why only the alloc path is suffering, hmm...

free path looks fine.

+    2.84%     0.16%  kublk            [kernel.kallsyms]       [k] mempool_free
+    2.66%     0.17%  kublk            [kernel.kallsyms]       [k] security_uring_cmd
+    2.57%     0.36%  kublk            [kernel.kallsyms]       [k] __slab_free

> 
> Anyway, I think there may be two pieces contributing to this contention:
> 
> Part 1) We probably made the portion of slowpath bigger,
>         by caching a smaller number of objects per CPU
> 	after transitioning to sheaves.
> 
> Part 2) We probably made the slowpath much slower.
> 
> We need to investigate those parts separately.
> 
> Regarding Part 1:
> 
> # Point 1. The CPU slab was not considered in the sheaf capacity calculation
> 
> calculate_sheaf_capacity() does not take into account that the CPU slab
> was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
> calculation to cache a number of objects similar to the CPU slab + percpu
> partial slab list layers that SLUB previously had?
> 
> # Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
> # and that probably means we're caching less objects per CPU.
> 
> Because SLUB previously assumed "slabs are half-full" when calculating
> the number of slabs to cache per CPU, that could actually cache as twice
> as many objects than intended when slabs are mostly empty.
> 
> Because sheaves track the number of objects precisely, that inaccuracy
> is gone. If the workload was previously benefiting from the inaccuracy,
> sheaves can make CPUs cache a smaller number of objects per CPU compared
> to the percpu slab caching layer.
> 
> Anyway, I guess we need to check how many objects are actually
> cached per CPU w/ and w/o sheaves, during the benchmark.

In the workload `fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0`, queue depth
is 128, so there should be 128 inflight bios on these 16 tasks/cpus.

> 
> After making sure the number of objects cached per CPU is the same as
> before, we could further investigate how much Part 2 plays into it.
> 
> Slightly off-topic, by the way, slab currently doesn't let system admins
> set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
> the default capacity. I think we need to allow sys admins to set a custom
> sheaf_capacity in the very near future.
> 
> > Analysis
> > ========
> > 
> > The ublk null target workload exposes a cross-CPU slab allocation
> > pattern: bios are allocated on the io_uring submitter CPU during block
> > layer submission, but freed on a different CPU — the ublk daemon thread
> > that runs the completion via io_uring_cmd_complete_in_task() task work.
> > And the completion CPU stays in same LLC or numa node with submission CPU.
> 
> Ok, so a submitter CPU keeps allocating objects, while a completion CPU
> keeps freeing objects.

Yes.


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
  2026-02-24  5:00 ` Harry Yoo
  2026-02-24  6:51 ` Hao Li
@ 2026-02-24 20:27 ` Vlastimil Babka
  2026-02-25  5:24   ` Harry Yoo
  2026-02-25  8:45   ` Vlastimil Babka (SUSE)
  2 siblings, 2 replies; 18+ messages in thread
From: Vlastimil Babka @ 2026-02-24 20:27 UTC (permalink / raw)
  To: Ming Lei, Vlastimil Babka, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-block, Harry Yoo, Hao Li,
	Christoph Hellwig

On 2/24/26 3:52 AM, Ming Lei wrote:
> Hello Vlastimil and MM guys,
> 
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
> 
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.
> 
> Reproducer
> ==========
> 
> Hardware: NUMA machine with >= 32 CPUs
> Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> 
>     # build kublk selftest
>     make -C tools/testing/selftests/ublk/
> 
>     # create ublk null target device with 16 queues
>     tools/testing/selftests/ublk/kublk add -t null -q 16
> 
>     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
>     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> 
>     # cleanup
>     tools/testing/selftests/ublk/kublk del -n 0
> 
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> 
> perf profile (bad kernel)
> =========================
> 
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
> 
> +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
>    - 44.41% kmem_cache_alloc_noprof
>       - 43.89% ___slab_alloc
>          + 41.16% get_from_any_partial

So this function is not used in the sheaf refill path, but in the
fallback slowpath when alloc_from_pcs() fastpath fails.

>            0.91% get_from_partial_node
>          + 0.87% alloc_from_new_slab
>          + 0.65% allocate_slab
> -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
>    - 44.49% mempool_alloc_noprof
>       - 44.43% kmem_cache_alloc_noprof

And I'd guess alloc_from_pcs() fails because in
__pcs_replace_empty_main() we have gfpflags_allow_blocking() false,
because mempool_alloc_noprof() tries the first attempt without
__GFP_DIRECT_RECLAIM. So that will succeed, but we end up relying on the
slowpath all the time and performance will drop.

It made sense to me not to refill sheaves when we can't reclaim, but I
didn't anticipate this interaction with mempools. We could change them
but there might be others using a similar pattern. Maybe it would be for
the best to just drop that heuristic from __pcs_replace_empty_main()
(but carefully as some deadlock avoidance depends on it, we might need
to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
tomorrow to test this theory, unless someone beats me to it (feel free to).

Until then IMHO we can dismiss the AI explanation and also the
insufficient sheaf capacity theories.

>          - 43.90% ___slab_alloc
>             + 41.18% get_from_any_partial
>               0.90% get_from_partial_node
>             + 0.87% alloc_from_new_slab
>             + 0.65% allocate_slab
> +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> -   40.75%     0.20%  io_uring  [k] get_from_partial_node
>    - 40.56% get_from_partial_node
>       - 38.83% __raw_spin_lock_irqsave
>            38.65% native_queued_spin_lock_slowpath
> 
> Analysis
> ========
> 
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.
> 
> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
> 
> **The following is from AI, just for reference**
> 
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.
> 
> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
> 
>   Tier 1: CPU slab freelist         lock-free (cmpxchg)
>   Tier 2: CPU partial slab list     lock-free (per-CPU local_lock)
>   Tier 3: Node partial list         kmem_cache_node->list_lock
> 
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
> 
> The sheaves architecture replaces this with a 2-tier hierarchy:
> 
>   Tier 1: Per-CPU sheaf             lock-free (local_lock)
>   Tier 2: Node partial list         kmem_cache_node->list_lock
> 
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
> 
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.
> 
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
> 
> Thanks,
> Ming
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24 20:27 ` Vlastimil Babka
@ 2026-02-25  5:24   ` Harry Yoo
  2026-02-25  8:45   ` Vlastimil Babka (SUSE)
  1 sibling, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2026-02-25  5:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ming Lei, Andrew Morton, linux-mm, linux-kernel, linux-block,
	Hao Li, Christoph Hellwig

On Tue, Feb 24, 2026 at 09:27:40PM +0100, Vlastimil Babka wrote:
> On 2/24/26 3:52 AM, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> > 
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> > 
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> > 
> > Reproducer
> > ==========
> > 
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > 
> >     # build kublk selftest
> >     make -C tools/testing/selftests/ublk/
> > 
> >     # create ublk null target device with 16 queues
> >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > 
> >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> >     # cleanup
> >     tools/testing/selftests/ublk/kublk del -n 0
> > 
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > 
> > perf profile (bad kernel)
> > =========================
> > 
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> > 
> > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> >    - 44.41% kmem_cache_alloc_noprof
> >       - 43.89% ___slab_alloc
> >          + 41.16% get_from_any_partial
> 
> So this function is not used in the sheaf refill path, but in the
> fallback slowpath when alloc_from_pcs() fastpath fails.

Good point.

> >            0.91% get_from_partial_node
> >          + 0.87% alloc_from_new_slab
> >          + 0.65% allocate_slab
> > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> >    - 44.49% mempool_alloc_noprof
> >       - 44.43% kmem_cache_alloc_noprof
> 
> And I'd guess alloc_from_pcs() fails because in
> __pcs_replace_empty_main() we have gfpflags_allow_blocking() false,
> because mempool_alloc_noprof() tries the first attempt without
> __GFP_DIRECT_RECLAIM. So that will succeed, but we end up relying on the
> slowpath all the time and performance will drop.

That's very good point. I was missing that aspect.

> It made sense to me not to refill sheaves when we can't reclaim, but I
> didn't anticipate this interaction with mempools.

Me neither :)

> We could change them but there might be others using a similar pattern.

Probably, yes.

> Maybe it would be for the best to just drop that heuristic from
> __pcs_replace_empty_main()

Sounds fair.

> (but carefully as some deadlock avoidance depends on it, we might need
> to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> tomorrow to test this theory, unless someone beats me to it (feel free to).

I think your point is valid. Let's give it a try.

> Until then IMHO we can dismiss the AI explanation and also the
> insufficient sheaf capacity theories.

Yeah :) let's first see how it performs after addressing your point.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24  9:07   ` Ming Lei
@ 2026-02-25  5:32     ` Hao Li
  2026-02-25  6:54       ` Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Hao Li @ 2026-02-25  5:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Harry Yoo, Vlastimil Babka, Andrew Morton, linux-mm,
	linux-kernel, linux-block, surenb

On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> Hi Harry,
> 
> On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Hello Vlastimil and MM guys,
> > 
> > Hi Ming, thanks for the report!
> > 
> > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > performance regression for workloads with persistent cross-CPU
> > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > drop).
> > > 
> > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > paths"), so the exact first bad commit could not be identified.
> > 
> > Ouch. Why did it crash?
> 
> [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> [   16.162447] PKRU: 55555554
> [   16.162448] Call Trace:
> [   16.162450]  <TASK>
> [   16.162452]  kmem_cache_free+0x410/0x490
> [   16.162454]  do_readlinkat+0x14e/0x180
> [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> [   16.162461]  do_syscall_64+0x7e/0x6b0
> [   16.162465]  ? post_alloc_hook+0xb9/0x140
> [   16.162468]  ? get_page_from_freelist+0x478/0x720
> [   16.162470]  ? path_openat+0xb3/0x2a0
> [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> [   16.162474]  ? count_memcg_events+0xd6/0x210
> [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> [   16.162481]  ? charge_memcg+0x48/0x80
> [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> [   16.162493]  ? count_memcg_events+0xd6/0x210
> [   16.162494]  ? handle_mm_fault+0x212/0x340
> [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> [   16.162500]  ? irqentry_exit+0x6d/0x540
> [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

For this problem, I have a hypothesis which is inspired by a comment in the
patch "slab: remove cpu (partial) slabs usage from allocation paths":

/*
 * get a single object from the slab. This might race against __slab_free(),
 * which however has to take the list_lock if it's about to make the slab fully
 * free.
 */

My understanding is that this comment is pointing out a possible race between
__slab_free() and get_from_partial_node(). Since __slab_free() takes
n->list_lock when it is about to make the slab fully free, and
get_from_partial_node() also takes the same lock, the two paths should be
mutually excluded by the lock and thus safe.

However, I'm wondering if there could be another race window. Suppose CPU0's
get_from_partial_node() has already finished __slab_update_freelist(), but has
not yet reached remove_partial(). In that gap, another CPU1 could free an object
to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
__slab_free() will call put_cpu_partial(s, slab, 1) without holding
n->list_lock, trying to add this slab to the CPU partial list. In that case,
both paths would operate on the same union field in struct slab, which might
lead to list corruption.

> 
> > 
> > > Reproducer
> > > ==========
> > > 
> > > Hardware: NUMA machine with >= 32 CPUs
> > > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > > 
> > >     # build kublk selftest
> > >     make -C tools/testing/selftests/ublk/
> > > 
> > >     # create ublk null target device with 16 queues
> > >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > > 
> > >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > > 
> > >     # cleanup
> > >     tools/testing/selftests/ublk/kublk del -n 0
> > > 
> > > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > 
> > Thanks for such a detailed steps to reproduce :)
> > 
> > > perf profile (bad kernel)
> > > =========================
> > > 
> > > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > > with massive spinlock contention on the node partial list lock:
> > > 
> > > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> > >    - 44.41% kmem_cache_alloc_noprof
> > >       - 43.89% ___slab_alloc
> > >          + 41.16% get_from_any_partial
> > >            0.91% get_from_partial_node
> > >          + 0.87% alloc_from_new_slab
> > >          + 0.65% allocate_slab
> > > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> > >    - 44.49% mempool_alloc_noprof
> > >       - 44.43% kmem_cache_alloc_noprof
> > >          - 43.90% ___slab_alloc
> > >             + 41.18% get_from_any_partial
> > >               0.90% get_from_partial_node
> > >             + 0.87% alloc_from_new_slab
> > >             + 0.65% allocate_slab
> > > +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> > > +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> > > -   40.75%     0.20%  io_uring  [k] get_from_partial_node
> > >    - 40.56% get_from_partial_node
> > >       - 38.83% __raw_spin_lock_irqsave
> > >            38.65% native_queued_spin_lock_slowpath
> > 
> > That's pretty severe contention. Interestingly, the profile shows
> > a severe contention on the alloc path, but I don't see free path here.
> > wondering why only the alloc path is suffering, hmm...
> 
> free path looks fine.
> 
> +    2.84%     0.16%  kublk            [kernel.kallsyms]       [k] mempool_free
> +    2.66%     0.17%  kublk            [kernel.kallsyms]       [k] security_uring_cmd
> +    2.57%     0.36%  kublk            [kernel.kallsyms]       [k] __slab_free
> 
> > 
> > Anyway, I think there may be two pieces contributing to this contention:
> > 
> > Part 1) We probably made the portion of slowpath bigger,
> >         by caching a smaller number of objects per CPU
> > 	after transitioning to sheaves.
> > 
> > Part 2) We probably made the slowpath much slower.
> > 
> > We need to investigate those parts separately.
> > 
> > Regarding Part 1:
> > 
> > # Point 1. The CPU slab was not considered in the sheaf capacity calculation
> > 
> > calculate_sheaf_capacity() does not take into account that the CPU slab
> > was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
> > calculation to cache a number of objects similar to the CPU slab + percpu
> > partial slab list layers that SLUB previously had?
> > 
> > # Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
> > # and that probably means we're caching less objects per CPU.
> > 
> > Because SLUB previously assumed "slabs are half-full" when calculating
> > the number of slabs to cache per CPU, that could actually cache as twice
> > as many objects than intended when slabs are mostly empty.
> > 
> > Because sheaves track the number of objects precisely, that inaccuracy
> > is gone. If the workload was previously benefiting from the inaccuracy,
> > sheaves can make CPUs cache a smaller number of objects per CPU compared
> > to the percpu slab caching layer.
> > 
> > Anyway, I guess we need to check how many objects are actually
> > cached per CPU w/ and w/o sheaves, during the benchmark.
> 
> In the workload `fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0`, queue depth
> is 128, so there should be 128 inflight bios on these 16 tasks/cpus.
> 
> > 
> > After making sure the number of objects cached per CPU is the same as
> > before, we could further investigate how much Part 2 plays into it.
> > 
> > Slightly off-topic, by the way, slab currently doesn't let system admins
> > set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
> > the default capacity. I think we need to allow sys admins to set a custom
> > sheaf_capacity in the very near future.
> > 
> > > Analysis
> > > ========
> > > 
> > > The ublk null target workload exposes a cross-CPU slab allocation
> > > pattern: bios are allocated on the io_uring submitter CPU during block
> > > layer submission, but freed on a different CPU — the ublk daemon thread
> > > that runs the completion via io_uring_cmd_complete_in_task() task work.
> > > And the completion CPU stays in same LLC or numa node with submission CPU.
> > 
> > Ok, so a submitter CPU keeps allocating objects, while a completion CPU
> > keeps freeing objects.
> 
> Yes.
> 
> 
> Thanks, 
> Ming
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  5:32     ` Hao Li
@ 2026-02-25  6:54       ` Harry Yoo
  2026-02-25  7:06         ` Hao Li
  0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-02-25  6:54 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > Hi Harry,
> > 
> > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > Hello Vlastimil and MM guys,
> > > 
> > > Hi Ming, thanks for the report!
> > > 
> > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > performance regression for workloads with persistent cross-CPU
> > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > drop).
> > > > 
> > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > paths"), so the exact first bad commit could not be identified.
> > > 
> > > Ouch. Why did it crash?
> > 
> > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > [   16.162447] PKRU: 55555554
> > [   16.162448] Call Trace:
> > [   16.162450]  <TASK>
> > [   16.162452]  kmem_cache_free+0x410/0x490
> > [   16.162454]  do_readlinkat+0x14e/0x180
> > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > [   16.162470]  ? path_openat+0xb3/0x2a0
> > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > [   16.162481]  ? charge_memcg+0x48/0x80
> > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> For this problem, I have a hypothesis which is inspired by a comment in the
> patch "slab: remove cpu (partial) slabs usage from allocation paths":
> 
> /*
>  * get a single object from the slab. This might race against __slab_free(),
>  * which however has to take the list_lock if it's about to make the slab fully
>  * free.
>  */
> 
> My understanding is that this comment is pointing out a possible race between
> __slab_free() and get_from_partial_node(). Since __slab_free() takes
> n->list_lock when it is about to make the slab fully free, and
> get_from_partial_node() also takes the same lock, the two paths should be
> mutually excluded by the lock and thus safe.
> 
> However, I'm wondering if there could be another race window. Suppose CPU0's
> get_from_partial_node() has already finished __slab_update_freelist(), but has
> not yet reached remove_partial(). In that gap, another CPU1 could free an object
> to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
>
> __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> n->list_lock, trying to add this slab to the CPU partial list.

If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
for CPU0 to release the lock. And CPU0 will remove the slab from the
partial list before releasing the lock. Or am I missing something?

> In that case,
> both paths would operate on the same union field in struct slab, which might
> lead to list corruption.

Not sure how the scenario you describe could happen:

CPU 0					CPU1
- get_from_partial_node()		
  -> spin_lock(&n->list_lock)		
					- __slab_free()
  -> __slab_update_freelist(),
     slab becomes full
					-> was_full == 1
					-> spin_lock(&n->list_lock)
					// starts spining
  -> if (!new.freelist)
  ->   remove_partial()
  -> spin_unlock()
					-> spin_lock(&n->list_lock)
					   // acquired!
					-> slab_update_freelist()
					-> spin_unlock(&n->list_lock)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  6:54       ` Harry Yoo
@ 2026-02-25  7:06         ` Hao Li
  2026-02-25  7:19           ` Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Hao Li @ 2026-02-25  7:06 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > Hi Harry,
> > > 
> > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > Hello Vlastimil and MM guys,
> > > > 
> > > > Hi Ming, thanks for the report!
> > > > 
> > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > performance regression for workloads with persistent cross-CPU
> > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > drop).
> > > > > 
> > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > paths"), so the exact first bad commit could not be identified.
> > > > 
> > > > Ouch. Why did it crash?
> > > 
> > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > [   16.162447] PKRU: 55555554
> > > [   16.162448] Call Trace:
> > > [   16.162450]  <TASK>
> > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > 
> > For this problem, I have a hypothesis which is inspired by a comment in the
> > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > 
> > /*
> >  * get a single object from the slab. This might race against __slab_free(),
> >  * which however has to take the list_lock if it's about to make the slab fully
> >  * free.
> >  */
> > 
> > My understanding is that this comment is pointing out a possible race between
> > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > n->list_lock when it is about to make the slab fully free, and
> > get_from_partial_node() also takes the same lock, the two paths should be
> > mutually excluded by the lock and thus safe.
> > 
> > However, I'm wondering if there could be another race window. Suppose CPU0's
> > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> >
> > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > n->list_lock, trying to add this slab to the CPU partial list.
> 
> If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> for CPU0 to release the lock. And CPU0 will remove the slab from the
> partial list before releasing the lock. Or am I missing something?
> 
> > In that case,
> > both paths would operate on the same union field in struct slab, which might
> > lead to list corruption.
> 
> Not sure how the scenario you describe could happen:
> 
> CPU 0					CPU1
> - get_from_partial_node()		
>   -> spin_lock(&n->list_lock)		
> 					- __slab_free()
>   -> __slab_update_freelist(),
>      slab becomes full
> 					-> was_full == 1
> 					-> spin_lock(&n->list_lock)

In __slab_free, if was_full == 1, then the condition
!(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
enter the "if" block and therefore n->list_lock is not acquired.
Does that sound right.

-- 
Thanks,
Hao

> 					// starts spining
>   -> if (!new.freelist)
>   ->   remove_partial()
>   -> spin_unlock()
> 					-> spin_lock(&n->list_lock)
> 					   // acquired!
> 					-> slab_update_freelist()
> 					-> spin_unlock(&n->list_lock)
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  7:06         ` Hao Li
@ 2026-02-25  7:19           ` Harry Yoo
  2026-02-25  8:19             ` Hao Li
  2026-02-25  8:21             ` Harry Yoo
  0 siblings, 2 replies; 18+ messages in thread
From: Harry Yoo @ 2026-02-25  7:19 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > Hi Harry,
> > > > 
> > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > Hello Vlastimil and MM guys,
> > > > > 
> > > > > Hi Ming, thanks for the report!
> > > > > 
> > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > drop).
> > > > > > 
> > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > 
> > > > > Ouch. Why did it crash?
> > > > 
> > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > [   16.162447] PKRU: 55555554
> > > > [   16.162448] Call Trace:
> > > > [   16.162450]  <TASK>
> > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > 
> > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > 
> > > /*
> > >  * get a single object from the slab. This might race against __slab_free(),
> > >  * which however has to take the list_lock if it's about to make the slab fully
> > >  * free.
> > >  */
> > > 
> > > My understanding is that this comment is pointing out a possible race between
> > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > n->list_lock when it is about to make the slab fully free, and
> > > get_from_partial_node() also takes the same lock, the two paths should be
> > > mutually excluded by the lock and thus safe.
> > > 
> > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > >
> > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > n->list_lock, trying to add this slab to the CPU partial list.
> > 
> > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > partial list before releasing the lock. Or am I missing something?
> > 
> > > In that case,
> > > both paths would operate on the same union field in struct slab, which might
> > > lead to list corruption.
> > 
> > Not sure how the scenario you describe could happen:
> > 
> > CPU 0					CPU1
> > - get_from_partial_node()		
> >   -> spin_lock(&n->list_lock)		
> > 					- __slab_free()
> >   -> __slab_update_freelist(),
> >      slab becomes full
> > 					-> was_full == 1
> > 					-> spin_lock(&n->list_lock)
> 
> In __slab_free, if was_full == 1, then the condition
> !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> enter the "if" block and therefore n->list_lock is not acquired.
> Does that sound right.

Nah, you're right. Just slipped my mind. No need to acquire the lock
if it was full, because that means it's not on the partial list.

Hmm... but the logic has been there for very long time.

Looks like we broke a premise for the percpu slab caching layer
to work correctly, while transitioning to sheaves.

I think the new behavior introduced during the sheaves transition is that
SLUB can now allocate objects from slabs without freezing it. Allocating
objects from slab without freezing it seems to confuse the free path...

But not sure if we could "fix" that because the percpu partial slab
caching layer is gone anyway :)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  7:19           ` Harry Yoo
@ 2026-02-25  8:19             ` Hao Li
  2026-02-25  8:41               ` Harry Yoo
  2026-02-25  8:21             ` Harry Yoo
  1 sibling, 1 reply; 18+ messages in thread
From: Hao Li @ 2026-02-25  8:19 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > Hi Harry,
> > > > > 
> > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > Hello Vlastimil and MM guys,
> > > > > > 
> > > > > > Hi Ming, thanks for the report!
> > > > > > 
> > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > drop).
> > > > > > > 
> > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > 
> > > > > > Ouch. Why did it crash?
> > > > > 
> > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > [   16.162447] PKRU: 55555554
> > > > > [   16.162448] Call Trace:
> > > > > [   16.162450]  <TASK>
> > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > 
> > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > 
> > > > /*
> > > >  * get a single object from the slab. This might race against __slab_free(),
> > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > >  * free.
> > > >  */
> > > > 
> > > > My understanding is that this comment is pointing out a possible race between
> > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > n->list_lock when it is about to make the slab fully free, and
> > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > mutually excluded by the lock and thus safe.
> > > > 
> > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > >
> > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > 
> > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > partial list before releasing the lock. Or am I missing something?
> > > 
> > > > In that case,
> > > > both paths would operate on the same union field in struct slab, which might
> > > > lead to list corruption.
> > > 
> > > Not sure how the scenario you describe could happen:
> > > 
> > > CPU 0					CPU1
> > > - get_from_partial_node()		
> > >   -> spin_lock(&n->list_lock)		
> > > 					- __slab_free()
> > >   -> __slab_update_freelist(),
> > >      slab becomes full
> > > 					-> was_full == 1
> > > 					-> spin_lock(&n->list_lock)
> > 
> > In __slab_free, if was_full == 1, then the condition
> > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > enter the "if" block and therefore n->list_lock is not acquired.
> > Does that sound right.
> 
> Nah, you're right. Just slipped my mind. No need to acquire the lock
> if it was full, because that means it's not on the partial list.

Exactly.

> 
> Hmm... but the logic has been there for very long time.

Yes.

> 
> Looks like we broke a premise for the percpu slab caching layer
> to work correctly, while transitioning to sheaves.
> 
> I think the new behavior introduced during the sheaves transition is that
> SLUB can now allocate objects from slabs without freezing it. Allocating
> objects from slab without freezing it seems to confuse the free path...

I feel it's not a big issue.

I think the root cause of this issue is as follows:

Before this commit, get_partial_node would first remove the slab from the node
list and then return the slab to the upper layer for freezing and object
allocation. Therefore, when __slab_free encounters a slab marked as was_full,
that slab would no longer be on the node list, avoiding race conditions with
list operations.

However, after this commit, get_from_partial_node first allocates an object
from the slab and then removes the slab from the node list. During the
interval between these two steps, __slab_free might encounter a slab marked as
was_full and then it want to add the slab to the CPU partial list, while at
the same time, another process is trying to remove the same slab from the node
list, leading to a race condition.

> 
> But not sure if we could "fix" that because the percpu partial slab
> caching layer is gone anyway :)

Yes, this bug has already disappeared with subsequent patches...

By the way, to allow Ming Lei to continue the bisect process, maybe we should
come up with a temporary workaround, such as:

} else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
	spin_lock_irqsave(&n->list_lock, flags);
	/*
	 * Let this empty critical section push back put_cpu_partial, ensuring
	 * its execution happens after the critical section of
	 * get_from_partial_node running in parallel.
	 */
	spin_unlock_irqrestore(&n->list_lock, flags);
	/*
	 * If we started with a full slab then put it onto the
	 * per cpu partial list.
	 */
	put_cpu_partial(s, slab, 1);
	stat(s, CPU_PARTIAL_FREE);
}

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  7:19           ` Harry Yoo
  2026-02-25  8:19             ` Hao Li
@ 2026-02-25  8:21             ` Harry Yoo
  1 sibling, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2026-02-25  8:21 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > [   16.162447] PKRU: 55555554
> > > > > [   16.162448] Call Trace:
> > > > > [   16.162450]  <TASK>
> > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > 
> > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > 
> > > > /*
> > > >  * get a single object from the slab. This might race against __slab_free(),
> > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > >  * free.
> > > >  */
> > > > 
> > > > My understanding is that this comment is pointing out a possible race between
> > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > n->list_lock when it is about to make the slab fully free, and
> > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > mutually excluded by the lock and thus safe.
> > > > 
> > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > >
> > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > 
> > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > partial list before releasing the lock. Or am I missing something?
> > > 
> > > > In that case,
> > > > both paths would operate on the same union field in struct slab, which might
> > > > lead to list corruption.
> > > 
> > > Not sure how the scenario you describe could happen:
> > > 
> > > CPU 0					CPU1
> > > - get_from_partial_node()		
> > >   -> spin_lock(&n->list_lock)		
> > > 					- __slab_free()
> > >   -> __slab_update_freelist(),
> > >      slab becomes full
> > > 					-> was_full == 1
> > > 					-> spin_lock(&n->list_lock)
> > 
> > In __slab_free, if was_full == 1, then the condition
> > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > enter the "if" block and therefore n->list_lock is not acquired.
> > Does that sound right.
> 
> Nah, you're right. Just slipped my mind. No need to acquire the lock
> if it was full, because that means it's not on the partial list.

"because it's not on the partial list, and SLUB is going to add it
 to the percpu partial slab list (to avoid acquiring the lock)"

> Hmm... but the logic has been there for very long time.
> 
> Looks like we broke a premise for the percpu slab caching layer
> to work correctly, while transitioning to sheaves.
> 
> I think the new behavior introduced during the sheaves transition is that
> SLUB can now allocate objects from slabs without freezing it. Allocating
> objects from slab without freezing it seems to confuse the free path...

Just elaborating the analysis a bit:

Hao Li (thankfully!) analyzed that there's a race condition between
1) alloc path removes a slab from partial list when it transitions from
partial to full and 2) free path adds the slab to percpu partial slab list
when it transitions from full to partial.

The following race could occur:

CPU 0					CPU1
- get_from_partial_node()		
  -> spin_lock(&n->list_lock)		
					- __slab_free()
  -> __slab_update_freelist()
     // slab becomes full
					-> was_full == 1,
					   no lock acquired
					-> slab_update_freelist()
					-> if (was_frozen) // not frozen!
					->  else if (was_full)
					->    put_cpu_partial(slab)
					      // add the slab to percpu
					      // partial slabs
  -> if (!new.freelist)
  ->   remove_partial(slab)
       // CPU1's percpu partial slab list
          is now corrupted

And later when CPU1 calls __put_partials(), it crashes while
iterating over the percpu partial slab list.

The race condition did not exist before sheaves, because
1) slabs were not on the partial list when the alloc path allocates
objects and 2) the alloc path froze them before allocating objects.
When slabs are frozen, free path doesn't call put_cpu_partial().

Commit 17c38c88294d ("slab: remove cpu (partial) slabs usage from
allocation paths") changed both 1) and 2) and introduced the race
described above. Now, 1) slabs are on partial list when the alloc path
allocates objects, and 2) it does not freeze slabs.

Because the alloc path does not freeze slabs, the free path thinks
that it can always safely add slabs to the percpu partial slab list,
but it's now racy because there's a window between it becomes full
and it's removed from the partial list.

This should be have been fixed after removing cpu partial slabs layer
from the free path, though.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  8:19             ` Hao Li
@ 2026-02-25  8:41               ` Harry Yoo
  2026-02-25  8:54                 ` Hao Li
  0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-02-25  8:41 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 04:19:49PM +0800, Hao Li wrote:
> On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> > On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > > Hi Harry,
> > > > > > 
> > > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > > Hello Vlastimil and MM guys,
> > > > > > > 
> > > > > > > Hi Ming, thanks for the report!
> > > > > > > 
> > > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > > drop).
> > > > > > > > 
> > > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > > 
> > > > > > > Ouch. Why did it crash?
> > > > > > 
> > > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > > [   16.162447] PKRU: 55555554
> > > > > > [   16.162448] Call Trace:
> > > > > > [   16.162450]  <TASK>
> > > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > 
> > > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > > 
> > > > > /*
> > > > >  * get a single object from the slab. This might race against __slab_free(),
> > > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > > >  * free.
> > > > >  */
> > > > > 
> > > > > My understanding is that this comment is pointing out a possible race between
> > > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > > n->list_lock when it is about to make the slab fully free, and
> > > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > > mutually excluded by the lock and thus safe.
> > > > > 
> > > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > > >
> > > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > > 
> > > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > > partial list before releasing the lock. Or am I missing something?
> > > > 
> > > > > In that case,
> > > > > both paths would operate on the same union field in struct slab, which might
> > > > > lead to list corruption.
> > > > 
> > > > Not sure how the scenario you describe could happen:
> > > > 
> > > > CPU 0					CPU1
> > > > - get_from_partial_node()		
> > > >   -> spin_lock(&n->list_lock)		
> > > > 					- __slab_free()
> > > >   -> __slab_update_freelist(),
> > > >      slab becomes full
> > > > 					-> was_full == 1
> > > > 					-> spin_lock(&n->list_lock)
> > > 
> > > In __slab_free, if was_full == 1, then the condition
> > > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > > enter the "if" block and therefore n->list_lock is not acquired.
> > > Does that sound right.
> > 
> > Nah, you're right. Just slipped my mind. No need to acquire the lock
> > if it was full, because that means it's not on the partial list.
> 
> Exactly.
> 
> > 
> > Hmm... but the logic has been there for very long time.
> 
> Yes.
> 
> > 
> > Looks like we broke a premise for the percpu slab caching layer
> > to work correctly, while transitioning to sheaves.
> > 
> > I think the new behavior introduced during the sheaves transition is that
> > SLUB can now allocate objects from slabs without freezing it. Allocating
> > objects from slab without freezing it seems to confuse the free path...
> 
> I feel it's not a big issue.
> 
> I think the root cause of this issue is as follows:
> 
> Before this commit, get_partial_node would first remove the slab from the node
> list and then return the slab to the upper layer for freezing and object
> allocation. Therefore, when __slab_free encounters a slab marked as was_full,
> that slab would no longer be on the node list, avoiding race conditions with
> list operations.

Right, that's an important point. Just realized that while elaborating
the analysis :), there was a race condition between you and I!

> However, after this commit, get_from_partial_node first allocates an object
> from the slab and then removes the slab from the node list.

Right.

> During the
> interval between these two steps, __slab_free might encounter a slab marked as
> was_full and then it want to add the slab to the CPU partial list,

Right.

> while at the same time, another process is trying to remove the same slab
> from the node list, leading to a race condition.

Exactly.

> > But not sure if we could "fix" that because the percpu partial slab
> > caching layer is gone anyway :)
> 
> Yes, this bug has already disappeared with subsequent patches...
> 
> By the way, to allow Ming Lei to continue the bisect process, maybe we should
> come up with a temporary workaround, such as:
>
> } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
> 	spin_lock_irqsave(&n->list_lock, flags);
> 	/*
> 	 * Let this empty critical section push back put_cpu_partial, ensuring
> 	 * its execution happens after the critical section of
> 	 * get_from_partial_node running in parallel.
> 	 */
> 	spin_unlock_irqrestore(&n->list_lock, flags);
> 	/*
> 	 * If we started with a full slab then put it onto the
> 	 * per cpu partial list.
> 	 */
> 	put_cpu_partial(s, slab, 1);
> 	stat(s, CPU_PARTIAL_FREE);
> }

Hmm but if that affects the performance (by always acquiring
n->list_lock), the result is probably not valid anyway.

I'd rather bet that Vlastimil's analysis is correct :)

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-24 20:27 ` Vlastimil Babka
  2026-02-25  5:24   ` Harry Yoo
@ 2026-02-25  8:45   ` Vlastimil Babka (SUSE)
  2026-02-25  9:31     ` Ming Lei
  1 sibling, 1 reply; 18+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-25  8:45 UTC (permalink / raw)
  To: Vlastimil Babka, Ming Lei, Andrew Morton
  Cc: linux-mm, linux-kernel, linux-block, Harry Yoo, Hao Li,
	Christoph Hellwig

On 2/24/26 21:27, Vlastimil Babka wrote:
> 
> It made sense to me not to refill sheaves when we can't reclaim, but I
> didn't anticipate this interaction with mempools. We could change them
> but there might be others using a similar pattern. Maybe it would be for
> the best to just drop that heuristic from __pcs_replace_empty_main()
> (but carefully as some deadlock avoidance depends on it, we might need
> to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> tomorrow to test this theory, unless someone beats me to it (feel free to).
Could you try this then, please? Thanks!

----8<----
From b04dad02eb72feb1736241518dd4d3dd64aadc0e Mon Sep 17 00:00:00 2001
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Date: Wed, 25 Feb 2026 09:40:22 +0100
Subject: [PATCH] mm/slab: allow sheaf refill if blocking is not allowed

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 862642c165ed..258307270442 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4526,7 +4526,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4547,8 +4547,9 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	allow_spin = gfpflags_allow_spinning(gfp);
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
@@ -4558,9 +4559,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 
 	stat(s, BARN_GET_FAIL);
 
-	can_alloc = gfpflags_allow_blocking(gfp);
-
-	if (can_alloc) {
+	if (allow_spin) {
 		if (pcs->spare) {
 			empty = pcs->spare;
 			pcs->spare = NULL;
@@ -4571,7 +4570,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 
 	local_unlock(&s->cpu_sheaves->lock);
 
-	if (!can_alloc)
+	if (!allow_spin)
 		return NULL;
 
 	if (empty) {
@@ -4591,11 +4590,8 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	if (!full)
 		return NULL;
 
-	/*
-	 * we can reach here only when gfpflags_allow_blocking
-	 * so this must not be an irq
-	 */
-	local_lock(&s->cpu_sheaves->lock);
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto barn_put;
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	/*
@@ -4626,6 +4622,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return pcs;
 	}
 
+barn_put:
 	barn_put_full_sheaf(barn, full);
 	stat(s, BARN_PUT);
 
-- 
2.53.0




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  8:41               ` Harry Yoo
@ 2026-02-25  8:54                 ` Hao Li
  0 siblings, 0 replies; 18+ messages in thread
From: Hao Li @ 2026-02-25  8:54 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, surenb

On Wed, Feb 25, 2026 at 05:41:15PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 04:19:49PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > > > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > > > Hi Harry,
> > > > > > > 
> > > > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > > > Hello Vlastimil and MM guys,
> > > > > > > > 
> > > > > > > > Hi Ming, thanks for the report!
> > > > > > > > 
> > > > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > > > drop).
> > > > > > > > > 
> > > > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > > > 
> > > > > > > > Ouch. Why did it crash?
> > > > > > > 
> > > > > > > [   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > > > [   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
> > > > > > > [   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > > > [   16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > > > [   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > > > [   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > > > [   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > > > [   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > > > [   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > > > [   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > > > [   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > > > [   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > > > [   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > [   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > > > [   16.162447] PKRU: 55555554
> > > > > > > [   16.162448] Call Trace:
> > > > > > > [   16.162450]  <TASK>
> > > > > > > [   16.162452]  kmem_cache_free+0x410/0x490
> > > > > > > [   16.162454]  do_readlinkat+0x14e/0x180
> > > > > > > [   16.162459]  __x64_sys_readlinkat+0x1c/0x30
> > > > > > > [   16.162461]  do_syscall_64+0x7e/0x6b0
> > > > > > > [   16.162465]  ? post_alloc_hook+0xb9/0x140
> > > > > > > [   16.162468]  ? get_page_from_freelist+0x478/0x720
> > > > > > > [   16.162470]  ? path_openat+0xb3/0x2a0
> > > > > > > [   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > > > [   16.162474]  ? count_memcg_events+0xd6/0x210
> > > > > > > [   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
> > > > > > > [   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > > > [   16.162481]  ? charge_memcg+0x48/0x80
> > > > > > > [   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > > > [   16.162484]  ? __folio_mod_stat+0x2d/0x90
> > > > > > > [   16.162487]  ? set_ptes.isra.0+0x36/0x80
> > > > > > > [   16.162490]  ? do_anonymous_page+0x100/0x4a0
> > > > > > > [   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
> > > > > > > [   16.162493]  ? count_memcg_events+0xd6/0x210
> > > > > > > [   16.162494]  ? handle_mm_fault+0x212/0x340
> > > > > > > [   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
> > > > > > > [   16.162500]  ? irqentry_exit+0x6d/0x540
> > > > > > > [   16.162502]  ? exc_page_fault+0x7e/0x1a0
> > > > > > > [   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > > 
> > > > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > > > 
> > > > > > /*
> > > > > >  * get a single object from the slab. This might race against __slab_free(),
> > > > > >  * which however has to take the list_lock if it's about to make the slab fully
> > > > > >  * free.
> > > > > >  */
> > > > > > 
> > > > > > My understanding is that this comment is pointing out a possible race between
> > > > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > > > n->list_lock when it is about to make the slab fully free, and
> > > > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > > > mutually excluded by the lock and thus safe.
> > > > > > 
> > > > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > > > >
> > > > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > > > 
> > > > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > > > partial list before releasing the lock. Or am I missing something?
> > > > > 
> > > > > > In that case,
> > > > > > both paths would operate on the same union field in struct slab, which might
> > > > > > lead to list corruption.
> > > > > 
> > > > > Not sure how the scenario you describe could happen:
> > > > > 
> > > > > CPU 0					CPU1
> > > > > - get_from_partial_node()		
> > > > >   -> spin_lock(&n->list_lock)		
> > > > > 					- __slab_free()
> > > > >   -> __slab_update_freelist(),
> > > > >      slab becomes full
> > > > > 					-> was_full == 1
> > > > > 					-> spin_lock(&n->list_lock)
> > > > 
> > > > In __slab_free, if was_full == 1, then the condition
> > > > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > > > enter the "if" block and therefore n->list_lock is not acquired.
> > > > Does that sound right.
> > > 
> > > Nah, you're right. Just slipped my mind. No need to acquire the lock
> > > if it was full, because that means it's not on the partial list.
> > 
> > Exactly.
> > 
> > > 
> > > Hmm... but the logic has been there for very long time.
> > 
> > Yes.
> > 
> > > 
> > > Looks like we broke a premise for the percpu slab caching layer
> > > to work correctly, while transitioning to sheaves.
> > > 
> > > I think the new behavior introduced during the sheaves transition is that
> > > SLUB can now allocate objects from slabs without freezing it. Allocating
> > > objects from slab without freezing it seems to confuse the free path...
> > 
> > I feel it's not a big issue.
> > 
> > I think the root cause of this issue is as follows:
> > 
> > Before this commit, get_partial_node would first remove the slab from the node
> > list and then return the slab to the upper layer for freezing and object
> > allocation. Therefore, when __slab_free encounters a slab marked as was_full,
> > that slab would no longer be on the node list, avoiding race conditions with
> > list operations.
> 
> Right, that's an important point. Just realized that while elaborating
> the analysis :), there was a race condition between you and I!

Haha, true race condition - we both sent emails within a minute :D

> 
> > However, after this commit, get_from_partial_node first allocates an object
> > from the slab and then removes the slab from the node list.
> 
> Right.
> 
> > During the
> > interval between these two steps, __slab_free might encounter a slab marked as
> > was_full and then it want to add the slab to the CPU partial list,
> 
> Right.
> 
> > while at the same time, another process is trying to remove the same slab
> > from the node list, leading to a race condition.
> 
> Exactly.
> 
> > > But not sure if we could "fix" that because the percpu partial slab
> > > caching layer is gone anyway :)
> > 
> > Yes, this bug has already disappeared with subsequent patches...
> > 
> > By the way, to allow Ming Lei to continue the bisect process, maybe we should
> > come up with a temporary workaround, such as:
> >
> > } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
> > 	spin_lock_irqsave(&n->list_lock, flags);
> > 	/*
> > 	 * Let this empty critical section push back put_cpu_partial, ensuring
> > 	 * its execution happens after the critical section of
> > 	 * get_from_partial_node running in parallel.
> > 	 */
> > 	spin_unlock_irqrestore(&n->list_lock, flags);
> > 	/*
> > 	 * If we started with a full slab then put it onto the
> > 	 * per cpu partial list.
> > 	 */
> > 	put_cpu_partial(s, slab, 1);
> > 	stat(s, CPU_PARTIAL_FREE);
> > }
> 
> Hmm but if that affects the performance (by always acquiring
> n->list_lock), the result is probably not valid anyway.
> 
> I'd rather bet that Vlastimil's analysis is correct :)

Indeed.
Let's look forward to the test results for Vlastimil's patch!

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
  2026-02-25  8:45   ` Vlastimil Babka (SUSE)
@ 2026-02-25  9:31     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2026-02-25  9:31 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
	linux-block, Harry Yoo, Hao Li, Christoph Hellwig

Hi Vlastimil,

On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/24/26 21:27, Vlastimil Babka wrote:
> > 
> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > didn't anticipate this interaction with mempools. We could change them
> > but there might be others using a similar pattern. Maybe it would be for
> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > (but carefully as some deadlock avoidance depends on it, we might need
> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> Could you try this then, please? Thanks!

Thanks for working on this issue!

Unfortunately the patch doesn't make a difference on IOPS in the perf test,
follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):

```
04cb971e2d28 (HEAD -> master) mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
a5a9cf3f020f mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
7dff99b35460 (origin/master) Remove WARN_ALL_UNSEEDED_RANDOM kernel config option
551d44200152 default_gfp(): avoid using the "newfangled" __VA_OPT__ trick
6de23f81a5e0 (tag: v7.0-rc1) Linux 7.0-rc1
```

+   49.03%     2.00%  io_uring         [kernel.kallsyms]       [k] __blkdev_direct_IO_async
-   38.66%     1.16%  io_uring         [kernel.kallsyms]       [k] bio_alloc_bioset
   - 37.51% bio_alloc_bioset
      - 34.98% mempool_alloc_noprof
         - 34.87% kmem_cache_alloc_noprof
            - 33.82% ___slab_alloc
               - 30.25% get_from_any_partial
                  - 29.59% get_from_partial_node
                     - 28.42% __raw_spin_lock_irqsave
                          native_queued_spin_lock_slowpath
               + 2.16% allocate_slab
               + 0.60% alloc_from_new_slab
              0.51% __pcs_replace_empty_main
        1.58% bio_associate_blkg
   + 1.16% submitter_uring_fn
+   35.16%     0.30%  io_uring         [kernel.kallsyms]       [k] kmem_cache_alloc_noprof
+   35.13%     0.12%  io_uring         [kernel.kallsyms]       [k] mempool_alloc_noprof


Thanks,
Ming



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-02-25  9:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24  5:00 ` Harry Yoo
2026-02-24  9:07   ` Ming Lei
2026-02-25  5:32     ` Hao Li
2026-02-25  6:54       ` Harry Yoo
2026-02-25  7:06         ` Hao Li
2026-02-25  7:19           ` Harry Yoo
2026-02-25  8:19             ` Hao Li
2026-02-25  8:41               ` Harry Yoo
2026-02-25  8:54                 ` Hao Li
2026-02-25  8:21             ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo
2026-02-24  7:41     ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25  5:24   ` Harry Yoo
2026-02-25  8:45   ` Vlastimil Babka (SUSE)
2026-02-25  9:31     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox