* [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
@ 2026-02-24 2:52 Ming Lei
2026-02-24 5:00 ` Harry Yoo
` (2 more replies)
0 siblings, 3 replies; 25+ messages in thread
From: Ming Lei @ 2026-02-24 2:52 UTC (permalink / raw)
To: Vlastimil Babka, Andrew Morton
Cc: ming.lei, linux-mm, linux-kernel, linux-block
Hello Vlastimil and MM guys,
The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
performance regression for workloads with persistent cross-CPU
alloc/free patterns. ublk null target benchmark IOPS drops
significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
drop).
Bisecting within the sheaves series is blocked by a kernel panic at
17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
paths"), so the exact first bad commit could not be identified.
Reproducer
==========
Hardware: NUMA machine with >= 32 CPUs
Kernel: v7.0-rc (with slab/for-7.0/sheaves merged)
# build kublk selftest
make -C tools/testing/selftests/ublk/
# create ublk null target device with 16 queues
tools/testing/selftests/ublk/kublk add -t null -q 16
# run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
# cleanup
tools/testing/selftests/ublk/kublk del -n 0
Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
perf profile (bad kernel)
=========================
~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
with massive spinlock contention on the node partial list lock:
+ 47.65% 1.21% io_uring [k] bio_alloc_bioset
- 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof
- 44.41% kmem_cache_alloc_noprof
- 43.89% ___slab_alloc
+ 41.16% get_from_any_partial
0.91% get_from_partial_node
+ 0.87% alloc_from_new_slab
+ 0.65% allocate_slab
- 44.70% 0.21% io_uring [k] mempool_alloc_noprof
- 44.49% mempool_alloc_noprof
- 44.43% kmem_cache_alloc_noprof
- 43.90% ___slab_alloc
+ 41.18% get_from_any_partial
0.90% get_from_partial_node
+ 0.87% alloc_from_new_slab
+ 0.65% allocate_slab
+ 41.23% 0.10% io_uring [k] get_from_any_partial
+ 40.82% 0.48% io_uring [k] __raw_spin_lock_irqsave
- 40.75% 0.20% io_uring [k] get_from_partial_node
- 40.56% get_from_partial_node
- 38.83% __raw_spin_lock_irqsave
38.65% native_queued_spin_lock_slowpath
Analysis
========
The ublk null target workload exposes a cross-CPU slab allocation
pattern: bios are allocated on the io_uring submitter CPU during block
layer submission, but freed on a different CPU — the ublk daemon thread
that runs the completion via io_uring_cmd_complete_in_task() task work.
And the completion CPU stays in same LLC or numa node with submission CPU.
This cross-CPU alloc/free pattern is not unique to ublk. The block
layer's default rq_affinity=1 setting completes requests on a CPU
sharing LLC with the submission CPU, which similarly causes bio freeing
on a different CPU than allocation. The ublk null target simply makes
this pattern more pronounced and measurable because all overhead is in
the bio alloc/free path with no actual I/O.
**The following is from AI, just for reference**
The result is that the allocating CPU's per-CPU slab caches are
continuously drained without being replenished by local frees. The bio
layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
leaving the submitter CPUs' caches empty and falling through to
mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.
In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
Tier 1: CPU slab freelist lock-free (cmpxchg)
Tier 2: CPU partial slab list lock-free (per-CPU local_lock)
Tier 3: Node partial list kmem_cache_node->list_lock
The CPU partial slab list (Tier 2) was the critical buffer. It was
populated during __slab_free() -> put_cpu_partial() and provided a
lock-free pool of partial slabs per CPU. Even when the CPU slab was
exhausted, the CPU partial list could supply more slabs without
touching any shared lock.
The sheaves architecture replaces this with a 2-tier hierarchy:
Tier 1: Per-CPU sheaf lock-free (local_lock)
Tier 2: Node partial list kmem_cache_node->list_lock
The intermediate lock-free tier is gone. When the per-CPU sheaf is
empty and the spare sheaf is also empty, every refill must go through
the node partial list, requiring kmem_cache_node->list_lock. With 16
CPUs simultaneously allocating bios and all hitting empty sheaves, this
creates a thundering herd on the node list_lock.
When the local node's partial list is also depleted (objects freed on
remote nodes accumulate there instead), get_from_any_partial() kicks in
to search other NUMA nodes, compounding the contention with cross-NUMA
list_lock acquisition — explaining the 41% in get_from_any_partial ->
native_queued_spin_lock_slowpath seen in the profile.
The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
__refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
does not address the fundamental architectural issue: the missing
lock-free intermediate caching tier that the CPU partial list provided.
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
@ 2026-02-24 5:00 ` Harry Yoo
2026-02-24 9:07 ` Ming Lei
2026-02-24 6:51 ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2 siblings, 1 reply; 25+ messages in thread
From: Harry Yoo @ 2026-02-24 5:00 UTC (permalink / raw)
To: Ming Lei
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Hao Li, surenb
On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,
Hi Ming, thanks for the report!
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
>
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.
Ouch. Why did it crash?
> Reproducer
> ==========
>
> Hardware: NUMA machine with >= 32 CPUs
> Kernel: v7.0-rc (with slab/for-7.0/sheaves merged)
>
> # build kublk selftest
> make -C tools/testing/selftests/ublk/
>
> # create ublk null target device with 16 queues
> tools/testing/selftests/ublk/kublk add -t null -q 16
>
> # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
>
> # cleanup
> tools/testing/selftests/ublk/kublk del -n 0
>
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
Thanks for such a detailed steps to reproduce :)
> perf profile (bad kernel)
> =========================
>
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
>
> + 47.65% 1.21% io_uring [k] bio_alloc_bioset
> - 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof
> - 44.41% kmem_cache_alloc_noprof
> - 43.89% ___slab_alloc
> + 41.16% get_from_any_partial
> 0.91% get_from_partial_node
> + 0.87% alloc_from_new_slab
> + 0.65% allocate_slab
> - 44.70% 0.21% io_uring [k] mempool_alloc_noprof
> - 44.49% mempool_alloc_noprof
> - 44.43% kmem_cache_alloc_noprof
> - 43.90% ___slab_alloc
> + 41.18% get_from_any_partial
> 0.90% get_from_partial_node
> + 0.87% alloc_from_new_slab
> + 0.65% allocate_slab
> + 41.23% 0.10% io_uring [k] get_from_any_partial
> + 40.82% 0.48% io_uring [k] __raw_spin_lock_irqsave
> - 40.75% 0.20% io_uring [k] get_from_partial_node
> - 40.56% get_from_partial_node
> - 38.83% __raw_spin_lock_irqsave
> 38.65% native_queued_spin_lock_slowpath
That's pretty severe contention. Interestingly, the profile shows
a severe contention on the alloc path, but I don't see free path here.
wondering why only the alloc path is suffering, hmm...
Anyway, I think there may be two pieces contributing to this contention:
Part 1) We probably made the portion of slowpath bigger,
by caching a smaller number of objects per CPU
after transitioning to sheaves.
Part 2) We probably made the slowpath much slower.
We need to investigate those parts separately.
Regarding Part 1:
# Point 1. The CPU slab was not considered in the sheaf capacity calculation
calculate_sheaf_capacity() does not take into account that the CPU slab
was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
calculation to cache a number of objects similar to the CPU slab + percpu
partial slab list layers that SLUB previously had?
# Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
# and that probably means we're caching less objects per CPU.
Because SLUB previously assumed "slabs are half-full" when calculating
the number of slabs to cache per CPU, that could actually cache as twice
as many objects than intended when slabs are mostly empty.
Because sheaves track the number of objects precisely, that inaccuracy
is gone. If the workload was previously benefiting from the inaccuracy,
sheaves can make CPUs cache a smaller number of objects per CPU compared
to the percpu slab caching layer.
Anyway, I guess we need to check how many objects are actually
cached per CPU w/ and w/o sheaves, during the benchmark.
After making sure the number of objects cached per CPU is the same as
before, we could further investigate how much Part 2 plays into it.
Slightly off-topic, by the way, slab currently doesn't let system admins
set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
the default capacity. I think we need to allow sys admins to set a custom
sheaf_capacity in the very near future.
> Analysis
> ========
>
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.
Ok, so a submitter CPU keeps allocating objects, while a completion CPU
keeps freeing objects.
> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
>
> **The following is from AI, just for reference**
>
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.
Ok.
> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
>
> Tier 1: CPU slab freelist lock-free (cmpxchg)
> Tier 2: CPU partial slab list lock-free (per-CPU local_lock)
> Tier 3: Node partial list kmem_cache_node->list_lock
>
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
Well, the sheaves layer is supposed to provide a similar lock-free pool
of objects per CPU. The percpu slab layer was supposed to cache a certain
number of objects (from multiple slabs), which is translated to the
sheaf capacity now.
> The sheaves architecture replaces this with a 2-tier hierarchy:
>
> Tier 1: Per-CPU sheaf lock-free (local_lock)
> Tier 2: Node partial list kmem_cache_node->list_lock
>
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
>
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.
Again, the sheaves layer is supposed to cache a similar number of
objects previously covered by Tier 1 + Tier 2... oh, wait.
The sheaf capacity calculation logic does not take "Tier 1 CPU slab
freelist" into account.
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
>
> Thanks,
> Ming
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24 5:00 ` Harry Yoo
@ 2026-02-24 6:51 ` Hao Li
2026-02-24 7:10 ` Harry Yoo
2026-02-24 20:27 ` Vlastimil Babka
2 siblings, 1 reply; 25+ messages in thread
From: Hao Li @ 2026-02-24 6:51 UTC (permalink / raw)
To: Ming Lei
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo
On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> Hello Vlastimil and MM guys,
>
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
Thanks for testing.
>
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"),
As Harry said, this is odd. Could you post crash logs?
> so the exact first bad commit could not be identified.
Based on my earlier test results, this performance regression (more precisely, I
suspect it is an expected return to the previous baseline - see below) should have
been introduced by two patches:
slab: add optimized sheaf refill from partial list
slab: remove SLUB_CPU_PARTIAL
https://lore.kernel.org/linux-mm/imzzlzuzjmlkhxc7hszxh5ba7jksvqcieg5rzyryijkkdhai5q@l2t4ye5quozb/
>
> Reproducer
> ==========
>
[...]
>
> the result is that the allocating cpu's per-cpu slab caches are
> continuously drained without being replenished by local frees. the bio
> layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> leaving the submitter cpus' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
>
> in v6.19, slub handled this with a 3-tier allocation hierarchy:
>
> Tier 1: CPU slab freelist lock-free (cmpxchg)
> Tier 2: CPU partial slab list lock-free (per-CPU local_lock)
> Tier 3: Node partial list kmem_cache_node->list_lock
>
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
>
> The sheaves architecture replaces this with a 2-tier hierarchy:
>
> Tier 1: Per-CPU sheaf lock-free (local_lock)
> Tier 2: Node partial list kmem_cache_node->list_lock
>
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
>
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.
The purpose of introducing sheaves was to fully replace the percpu partial slabs
mechanism with sheaves. During this process, we first added the sheaves caching
layer and only later removed the percpu partial slabs layer, so it's expected
that performance could first improve and then return to the previous level.
Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
`815c8e35511d`)? If those two results are close, then the ~64% performance
regression we're currently discussing might be better interpreted as returning
to the previous baseline (i.e. a reversion), rather than a true regression.
The link below contains my previous test results. According to will-it-scale,
the performance of "no sheaves at all" and "only the sheaves layer exists" is
close:
https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/
--
Thanks,
Hao
>
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
>
> Thanks,
> Ming
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 6:51 ` Hao Li
@ 2026-02-24 7:10 ` Harry Yoo
2026-02-24 7:41 ` Hao Li
0 siblings, 1 reply; 25+ messages in thread
From: Harry Yoo @ 2026-02-24 7:10 UTC (permalink / raw)
To: Hao Li
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block
On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Reproducer
> > ==========
> >
> [...]
> >
> > the result is that the allocating cpu's per-cpu slab caches are
> > continuously drained without being replenished by local frees. the bio
> > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > leaving the submitter cpus' caches empty and falling through to
> > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> >
> > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> >
> > Tier 1: CPU slab freelist lock-free (cmpxchg)
> > Tier 2: CPU partial slab list lock-free (per-CPU local_lock)
> > Tier 3: Node partial list kmem_cache_node->list_lock
> >
> > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > populated during __slab_free() -> put_cpu_partial() and provided a
> > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > exhausted, the CPU partial list could supply more slabs without
> > touching any shared lock.
> >
> > The sheaves architecture replaces this with a 2-tier hierarchy:
> >
> > Tier 1: Per-CPU sheaf lock-free (local_lock)
> > Tier 2: Node partial list kmem_cache_node->list_lock
> >
> > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > empty and the spare sheaf is also empty, every refill must go through
> > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > creates a thundering herd on the node list_lock.
> >
> > When the local node's partial list is also depleted (objects freed on
> > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > to search other NUMA nodes, compounding the contention with cross-NUMA
> > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > native_queued_spin_lock_slowpath seen in the profile.
>
> The purpose of introducing sheaves was to fully replace the percpu partial slabs
> mechanism with sheaves. During this process, we first added the sheaves caching
> layer and only later removed the percpu partial slabs layer, so it's expected
> that performance could first improve and then return to the previous level.
There's one difference here; you used will-it-scale mmap2 test case that
involves maple tree node and vm_area_struct cache that already has
sheaves enabled in v6.19.
And Ming's benchmark stresses bio-<size> caches.
Since other caches don't have sheaves in v6.19, they're not supposed to
have performance gain by having additional sheaves layer on top of cpu
slab + percpu partial slab list.
> Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> `815c8e35511d`)? If those two results are close, then the ~64% performance
> regression we're currently discussing might be better interpreted as returning
> to the previous baseline (i.e. a reversion), rather than a true regression.
>
> The link below contains my previous test results. According to will-it-scale,
> the performance of "no sheaves at all" and "only the sheaves layer exists" is
> close:
> https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 7:10 ` Harry Yoo
@ 2026-02-24 7:41 ` Hao Li
0 siblings, 0 replies; 25+ messages in thread
From: Hao Li @ 2026-02-24 7:41 UTC (permalink / raw)
To: Harry Yoo
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block
On Tue, Feb 24, 2026 at 04:10:43PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 02:51:26PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Reproducer
> > > ==========
> > >
> > [...]
> > >
> > > the result is that the allocating cpu's per-cpu slab caches are
> > > continuously drained without being replenished by local frees. the bio
> > > layer's own per-cpu cache (bio_alloc_cache) suffers the same mismatch:
> > > freed bios go to the completion cpu's cache via bio_put_percpu_cache(),
> > > leaving the submitter cpus' caches empty and falling through to
> > > mempool_alloc() -> kmem_cache_alloc() -> slub slow path.
> > >
> > > in v6.19, slub handled this with a 3-tier allocation hierarchy:
> > >
> > > Tier 1: CPU slab freelist lock-free (cmpxchg)
> > > Tier 2: CPU partial slab list lock-free (per-CPU local_lock)
> > > Tier 3: Node partial list kmem_cache_node->list_lock
> > >
> > > The CPU partial slab list (Tier 2) was the critical buffer. It was
> > > populated during __slab_free() -> put_cpu_partial() and provided a
> > > lock-free pool of partial slabs per CPU. Even when the CPU slab was
> > > exhausted, the CPU partial list could supply more slabs without
> > > touching any shared lock.
> > >
> > > The sheaves architecture replaces this with a 2-tier hierarchy:
> > >
> > > Tier 1: Per-CPU sheaf lock-free (local_lock)
> > > Tier 2: Node partial list kmem_cache_node->list_lock
> > >
> > > The intermediate lock-free tier is gone. When the per-CPU sheaf is
> > > empty and the spare sheaf is also empty, every refill must go through
> > > the node partial list, requiring kmem_cache_node->list_lock. With 16
> > > CPUs simultaneously allocating bios and all hitting empty sheaves, this
> > > creates a thundering herd on the node list_lock.
> > >
> > > When the local node's partial list is also depleted (objects freed on
> > > remote nodes accumulate there instead), get_from_any_partial() kicks in
> > > to search other NUMA nodes, compounding the contention with cross-NUMA
> > > list_lock acquisition — explaining the 41% in get_from_any_partial ->
> > > native_queued_spin_lock_slowpath seen in the profile.
> >
> > The purpose of introducing sheaves was to fully replace the percpu partial slabs
> > mechanism with sheaves. During this process, we first added the sheaves caching
> > layer and only later removed the percpu partial slabs layer, so it's expected
> > that performance could first improve and then return to the previous level.
>
> There's one difference here; you used will-it-scale mmap2 test case that
> involves maple tree node and vm_area_struct cache that already has
> sheaves enabled in v6.19.
>
> And Ming's benchmark stresses bio-<size> caches.
>
> Since other caches don't have sheaves in v6.19, they're not supposed to
> have performance gain by having additional sheaves layer on top of cpu
> slab + percpu partial slab list.
Oh, yes-you're right. That distinction is important!
I think I've gotten a bit stuck in a fixed way of thinking...
Thanks for pointing it out!
>
> > Would you mind also comparing against a baseline with "no sheaves at all" (e.g.
> > commit `9d4e6ab865c4`) versus "only the sheaves layer exists" (i.e. commit
> > `815c8e35511d`)? If those two results are close, then the ~64% performance
> > regression we're currently discussing might be better interpreted as returning
> > to the previous baseline (i.e. a reversion), rather than a true regression.
> >
> > The link below contains my previous test results. According to will-it-scale,
> > the performance of "no sheaves at all" and "only the sheaves layer exists" is
> > close:
> > https://lore.kernel.org/linux-mm/pdmjsvpkl5nsntiwfwguplajq27ak3xpboq3ab77zrbu763pq7@la3hyiqigpir/
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 5:00 ` Harry Yoo
@ 2026-02-24 9:07 ` Ming Lei
2026-02-25 5:32 ` Hao Li
0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2026-02-24 9:07 UTC (permalink / raw)
To: Harry Yoo
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Hao Li, surenb
Hi Harry,
On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
>
> Hi Ming, thanks for the report!
>
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> >
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
>
> Ouch. Why did it crash?
[ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
[ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
[ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
[ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
[ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
[ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
[ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
[ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
[ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
[ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
[ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
[ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
[ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
[ 16.162447] PKRU: 55555554
[ 16.162448] Call Trace:
[ 16.162450] <TASK>
[ 16.162452] kmem_cache_free+0x410/0x490
[ 16.162454] do_readlinkat+0x14e/0x180
[ 16.162459] __x64_sys_readlinkat+0x1c/0x30
[ 16.162461] do_syscall_64+0x7e/0x6b0
[ 16.162465] ? post_alloc_hook+0xb9/0x140
[ 16.162468] ? get_page_from_freelist+0x478/0x720
[ 16.162470] ? path_openat+0xb3/0x2a0
[ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
[ 16.162474] ? count_memcg_events+0xd6/0x210
[ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
[ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
[ 16.162481] ? charge_memcg+0x48/0x80
[ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
[ 16.162484] ? __folio_mod_stat+0x2d/0x90
[ 16.162487] ? set_ptes.isra.0+0x36/0x80
[ 16.162490] ? do_anonymous_page+0x100/0x4a0
[ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
[ 16.162493] ? count_memcg_events+0xd6/0x210
[ 16.162494] ? handle_mm_fault+0x212/0x340
[ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
[ 16.162500] ? irqentry_exit+0x6d/0x540
[ 16.162502] ? exc_page_fault+0x7e/0x1a0
[ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> > Reproducer
> > ==========
> >
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel: v7.0-rc (with slab/for-7.0/sheaves merged)
> >
> > # build kublk selftest
> > make -C tools/testing/selftests/ublk/
> >
> > # create ublk null target device with 16 queues
> > tools/testing/selftests/ublk/kublk add -t null -q 16
> >
> > # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> >
> > # cleanup
> > tools/testing/selftests/ublk/kublk del -n 0
> >
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
>
> Thanks for such a detailed steps to reproduce :)
>
> > perf profile (bad kernel)
> > =========================
> >
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> >
> > + 47.65% 1.21% io_uring [k] bio_alloc_bioset
> > - 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof
> > - 44.41% kmem_cache_alloc_noprof
> > - 43.89% ___slab_alloc
> > + 41.16% get_from_any_partial
> > 0.91% get_from_partial_node
> > + 0.87% alloc_from_new_slab
> > + 0.65% allocate_slab
> > - 44.70% 0.21% io_uring [k] mempool_alloc_noprof
> > - 44.49% mempool_alloc_noprof
> > - 44.43% kmem_cache_alloc_noprof
> > - 43.90% ___slab_alloc
> > + 41.18% get_from_any_partial
> > 0.90% get_from_partial_node
> > + 0.87% alloc_from_new_slab
> > + 0.65% allocate_slab
> > + 41.23% 0.10% io_uring [k] get_from_any_partial
> > + 40.82% 0.48% io_uring [k] __raw_spin_lock_irqsave
> > - 40.75% 0.20% io_uring [k] get_from_partial_node
> > - 40.56% get_from_partial_node
> > - 38.83% __raw_spin_lock_irqsave
> > 38.65% native_queued_spin_lock_slowpath
>
> That's pretty severe contention. Interestingly, the profile shows
> a severe contention on the alloc path, but I don't see free path here.
> wondering why only the alloc path is suffering, hmm...
free path looks fine.
+ 2.84% 0.16% kublk [kernel.kallsyms] [k] mempool_free
+ 2.66% 0.17% kublk [kernel.kallsyms] [k] security_uring_cmd
+ 2.57% 0.36% kublk [kernel.kallsyms] [k] __slab_free
>
> Anyway, I think there may be two pieces contributing to this contention:
>
> Part 1) We probably made the portion of slowpath bigger,
> by caching a smaller number of objects per CPU
> after transitioning to sheaves.
>
> Part 2) We probably made the slowpath much slower.
>
> We need to investigate those parts separately.
>
> Regarding Part 1:
>
> # Point 1. The CPU slab was not considered in the sheaf capacity calculation
>
> calculate_sheaf_capacity() does not take into account that the CPU slab
> was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
> calculation to cache a number of objects similar to the CPU slab + percpu
> partial slab list layers that SLUB previously had?
>
> # Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
> # and that probably means we're caching less objects per CPU.
>
> Because SLUB previously assumed "slabs are half-full" when calculating
> the number of slabs to cache per CPU, that could actually cache as twice
> as many objects than intended when slabs are mostly empty.
>
> Because sheaves track the number of objects precisely, that inaccuracy
> is gone. If the workload was previously benefiting from the inaccuracy,
> sheaves can make CPUs cache a smaller number of objects per CPU compared
> to the percpu slab caching layer.
>
> Anyway, I guess we need to check how many objects are actually
> cached per CPU w/ and w/o sheaves, during the benchmark.
In the workload `fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0`, queue depth
is 128, so there should be 128 inflight bios on these 16 tasks/cpus.
>
> After making sure the number of objects cached per CPU is the same as
> before, we could further investigate how much Part 2 plays into it.
>
> Slightly off-topic, by the way, slab currently doesn't let system admins
> set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
> the default capacity. I think we need to allow sys admins to set a custom
> sheaf_capacity in the very near future.
>
> > Analysis
> > ========
> >
> > The ublk null target workload exposes a cross-CPU slab allocation
> > pattern: bios are allocated on the io_uring submitter CPU during block
> > layer submission, but freed on a different CPU — the ublk daemon thread
> > that runs the completion via io_uring_cmd_complete_in_task() task work.
> > And the completion CPU stays in same LLC or numa node with submission CPU.
>
> Ok, so a submitter CPU keeps allocating objects, while a completion CPU
> keeps freeing objects.
Yes.
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24 5:00 ` Harry Yoo
2026-02-24 6:51 ` Hao Li
@ 2026-02-24 20:27 ` Vlastimil Babka
2026-02-25 5:24 ` Harry Yoo
2026-02-25 8:45 ` Vlastimil Babka (SUSE)
2 siblings, 2 replies; 25+ messages in thread
From: Vlastimil Babka @ 2026-02-24 20:27 UTC (permalink / raw)
To: Ming Lei, Vlastimil Babka, Andrew Morton
Cc: linux-mm, linux-kernel, linux-block, Harry Yoo, Hao Li,
Christoph Hellwig
On 2/24/26 3:52 AM, Ming Lei wrote:
> Hello Vlastimil and MM guys,
>
> The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> performance regression for workloads with persistent cross-CPU
> alloc/free patterns. ublk null target benchmark IOPS drops
> significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> drop).
>
> Bisecting within the sheaves series is blocked by a kernel panic at
> 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> paths"), so the exact first bad commit could not be identified.
>
> Reproducer
> ==========
>
> Hardware: NUMA machine with >= 32 CPUs
> Kernel: v7.0-rc (with slab/for-7.0/sheaves merged)
>
> # build kublk selftest
> make -C tools/testing/selftests/ublk/
>
> # create ublk null target device with 16 queues
> tools/testing/selftests/ublk/kublk add -t null -q 16
>
> # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
>
> # cleanup
> tools/testing/selftests/ublk/kublk del -n 0
>
> Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
>
> perf profile (bad kernel)
> =========================
>
> ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> with massive spinlock contention on the node partial list lock:
>
> + 47.65% 1.21% io_uring [k] bio_alloc_bioset
> - 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof
> - 44.41% kmem_cache_alloc_noprof
> - 43.89% ___slab_alloc
> + 41.16% get_from_any_partial
So this function is not used in the sheaf refill path, but in the
fallback slowpath when alloc_from_pcs() fastpath fails.
> 0.91% get_from_partial_node
> + 0.87% alloc_from_new_slab
> + 0.65% allocate_slab
> - 44.70% 0.21% io_uring [k] mempool_alloc_noprof
> - 44.49% mempool_alloc_noprof
> - 44.43% kmem_cache_alloc_noprof
And I'd guess alloc_from_pcs() fails because in
__pcs_replace_empty_main() we have gfpflags_allow_blocking() false,
because mempool_alloc_noprof() tries the first attempt without
__GFP_DIRECT_RECLAIM. So that will succeed, but we end up relying on the
slowpath all the time and performance will drop.
It made sense to me not to refill sheaves when we can't reclaim, but I
didn't anticipate this interaction with mempools. We could change them
but there might be others using a similar pattern. Maybe it would be for
the best to just drop that heuristic from __pcs_replace_empty_main()
(but carefully as some deadlock avoidance depends on it, we might need
to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
tomorrow to test this theory, unless someone beats me to it (feel free to).
Until then IMHO we can dismiss the AI explanation and also the
insufficient sheaf capacity theories.
> - 43.90% ___slab_alloc
> + 41.18% get_from_any_partial
> 0.90% get_from_partial_node
> + 0.87% alloc_from_new_slab
> + 0.65% allocate_slab
> + 41.23% 0.10% io_uring [k] get_from_any_partial
> + 40.82% 0.48% io_uring [k] __raw_spin_lock_irqsave
> - 40.75% 0.20% io_uring [k] get_from_partial_node
> - 40.56% get_from_partial_node
> - 38.83% __raw_spin_lock_irqsave
> 38.65% native_queued_spin_lock_slowpath
>
> Analysis
> ========
>
> The ublk null target workload exposes a cross-CPU slab allocation
> pattern: bios are allocated on the io_uring submitter CPU during block
> layer submission, but freed on a different CPU — the ublk daemon thread
> that runs the completion via io_uring_cmd_complete_in_task() task work.
> And the completion CPU stays in same LLC or numa node with submission CPU.
>
> This cross-CPU alloc/free pattern is not unique to ublk. The block
> layer's default rq_affinity=1 setting completes requests on a CPU
> sharing LLC with the submission CPU, which similarly causes bio freeing
> on a different CPU than allocation. The ublk null target simply makes
> this pattern more pronounced and measurable because all overhead is in
> the bio alloc/free path with no actual I/O.
>
> **The following is from AI, just for reference**
>
> The result is that the allocating CPU's per-CPU slab caches are
> continuously drained without being replenished by local frees. The bio
> layer's own per-CPU cache (bio_alloc_cache) suffers the same mismatch:
> freed bios go to the completion CPU's cache via bio_put_percpu_cache(),
> leaving the submitter CPUs' caches empty and falling through to
> mempool_alloc() -> kmem_cache_alloc() -> SLUB slow path.
>
> In v6.19, SLUB handled this with a 3-tier allocation hierarchy:
>
> Tier 1: CPU slab freelist lock-free (cmpxchg)
> Tier 2: CPU partial slab list lock-free (per-CPU local_lock)
> Tier 3: Node partial list kmem_cache_node->list_lock
>
> The CPU partial slab list (Tier 2) was the critical buffer. It was
> populated during __slab_free() -> put_cpu_partial() and provided a
> lock-free pool of partial slabs per CPU. Even when the CPU slab was
> exhausted, the CPU partial list could supply more slabs without
> touching any shared lock.
>
> The sheaves architecture replaces this with a 2-tier hierarchy:
>
> Tier 1: Per-CPU sheaf lock-free (local_lock)
> Tier 2: Node partial list kmem_cache_node->list_lock
>
> The intermediate lock-free tier is gone. When the per-CPU sheaf is
> empty and the spare sheaf is also empty, every refill must go through
> the node partial list, requiring kmem_cache_node->list_lock. With 16
> CPUs simultaneously allocating bios and all hitting empty sheaves, this
> creates a thundering herd on the node list_lock.
>
> When the local node's partial list is also depleted (objects freed on
> remote nodes accumulate there instead), get_from_any_partial() kicks in
> to search other NUMA nodes, compounding the contention with cross-NUMA
> list_lock acquisition — explaining the 41% in get_from_any_partial ->
> native_queued_spin_lock_slowpath seen in the profile.
>
> The mitigation in 40fd0acc45d0 ("slub: avoid list_lock contention from
> __refill_objects_any()") uses spin_trylock for cross-NUMA refill, but
> does not address the fundamental architectural issue: the missing
> lock-free intermediate caching tier that the CPU partial list provided.
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 20:27 ` Vlastimil Babka
@ 2026-02-25 5:24 ` Harry Yoo
2026-02-25 8:45 ` Vlastimil Babka (SUSE)
1 sibling, 0 replies; 25+ messages in thread
From: Harry Yoo @ 2026-02-25 5:24 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Ming Lei, Andrew Morton, linux-mm, linux-kernel, linux-block,
Hao Li, Christoph Hellwig
On Tue, Feb 24, 2026 at 09:27:40PM +0100, Vlastimil Babka wrote:
> On 2/24/26 3:52 AM, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> >
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> >
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> >
> > Reproducer
> > ==========
> >
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel: v7.0-rc (with slab/for-7.0/sheaves merged)
> >
> > # build kublk selftest
> > make -C tools/testing/selftests/ublk/
> >
> > # create ublk null target device with 16 queues
> > tools/testing/selftests/ublk/kublk add -t null -q 16
> >
> > # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> >
> > # cleanup
> > tools/testing/selftests/ublk/kublk del -n 0
> >
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> >
> > perf profile (bad kernel)
> > =========================
> >
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> >
> > + 47.65% 1.21% io_uring [k] bio_alloc_bioset
> > - 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof
> > - 44.41% kmem_cache_alloc_noprof
> > - 43.89% ___slab_alloc
> > + 41.16% get_from_any_partial
>
> So this function is not used in the sheaf refill path, but in the
> fallback slowpath when alloc_from_pcs() fastpath fails.
Good point.
> > 0.91% get_from_partial_node
> > + 0.87% alloc_from_new_slab
> > + 0.65% allocate_slab
> > - 44.70% 0.21% io_uring [k] mempool_alloc_noprof
> > - 44.49% mempool_alloc_noprof
> > - 44.43% kmem_cache_alloc_noprof
>
> And I'd guess alloc_from_pcs() fails because in
> __pcs_replace_empty_main() we have gfpflags_allow_blocking() false,
> because mempool_alloc_noprof() tries the first attempt without
> __GFP_DIRECT_RECLAIM. So that will succeed, but we end up relying on the
> slowpath all the time and performance will drop.
That's very good point. I was missing that aspect.
> It made sense to me not to refill sheaves when we can't reclaim, but I
> didn't anticipate this interaction with mempools.
Me neither :)
> We could change them but there might be others using a similar pattern.
Probably, yes.
> Maybe it would be for the best to just drop that heuristic from
> __pcs_replace_empty_main()
Sounds fair.
> (but carefully as some deadlock avoidance depends on it, we might need
> to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> tomorrow to test this theory, unless someone beats me to it (feel free to).
I think your point is valid. Let's give it a try.
> Until then IMHO we can dismiss the AI explanation and also the
> insufficient sheaf capacity theories.
Yeah :) let's first see how it performs after addressing your point.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 9:07 ` Ming Lei
@ 2026-02-25 5:32 ` Hao Li
2026-02-25 6:54 ` Harry Yoo
0 siblings, 1 reply; 25+ messages in thread
From: Hao Li @ 2026-02-25 5:32 UTC (permalink / raw)
To: Ming Lei
Cc: Harry Yoo, Vlastimil Babka, Andrew Morton, linux-mm,
linux-kernel, linux-block, surenb
On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> Hi Harry,
>
> On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > Hello Vlastimil and MM guys,
> >
> > Hi Ming, thanks for the report!
> >
> > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > performance regression for workloads with persistent cross-CPU
> > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > drop).
> > >
> > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > paths"), so the exact first bad commit could not be identified.
> >
> > Ouch. Why did it crash?
>
> [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> [ 16.162447] PKRU: 55555554
> [ 16.162448] Call Trace:
> [ 16.162450] <TASK>
> [ 16.162452] kmem_cache_free+0x410/0x490
> [ 16.162454] do_readlinkat+0x14e/0x180
> [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> [ 16.162461] do_syscall_64+0x7e/0x6b0
> [ 16.162465] ? post_alloc_hook+0xb9/0x140
> [ 16.162468] ? get_page_from_freelist+0x478/0x720
> [ 16.162470] ? path_openat+0xb3/0x2a0
> [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> [ 16.162474] ? count_memcg_events+0xd6/0x210
> [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> [ 16.162481] ? charge_memcg+0x48/0x80
> [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> [ 16.162493] ? count_memcg_events+0xd6/0x210
> [ 16.162494] ? handle_mm_fault+0x212/0x340
> [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> [ 16.162500] ? irqentry_exit+0x6d/0x540
> [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
For this problem, I have a hypothesis which is inspired by a comment in the
patch "slab: remove cpu (partial) slabs usage from allocation paths":
/*
* get a single object from the slab. This might race against __slab_free(),
* which however has to take the list_lock if it's about to make the slab fully
* free.
*/
My understanding is that this comment is pointing out a possible race between
__slab_free() and get_from_partial_node(). Since __slab_free() takes
n->list_lock when it is about to make the slab fully free, and
get_from_partial_node() also takes the same lock, the two paths should be
mutually excluded by the lock and thus safe.
However, I'm wondering if there could be another race window. Suppose CPU0's
get_from_partial_node() has already finished __slab_update_freelist(), but has
not yet reached remove_partial(). In that gap, another CPU1 could free an object
to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
__slab_free() will call put_cpu_partial(s, slab, 1) without holding
n->list_lock, trying to add this slab to the CPU partial list. In that case,
both paths would operate on the same union field in struct slab, which might
lead to list corruption.
>
> >
> > > Reproducer
> > > ==========
> > >
> > > Hardware: NUMA machine with >= 32 CPUs
> > > Kernel: v7.0-rc (with slab/for-7.0/sheaves merged)
> > >
> > > # build kublk selftest
> > > make -C tools/testing/selftests/ublk/
> > >
> > > # create ublk null target device with 16 queues
> > > tools/testing/selftests/ublk/kublk add -t null -q 16
> > >
> > > # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> > > taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > >
> > > # cleanup
> > > tools/testing/selftests/ublk/kublk del -n 0
> > >
> > > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > > Bad: 815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> >
> > Thanks for such a detailed steps to reproduce :)
> >
> > > perf profile (bad kernel)
> > > =========================
> > >
> > > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > > with massive spinlock contention on the node partial list lock:
> > >
> > > + 47.65% 1.21% io_uring [k] bio_alloc_bioset
> > > - 44.87% 0.45% io_uring [k] kmem_cache_alloc_noprof
> > > - 44.41% kmem_cache_alloc_noprof
> > > - 43.89% ___slab_alloc
> > > + 41.16% get_from_any_partial
> > > 0.91% get_from_partial_node
> > > + 0.87% alloc_from_new_slab
> > > + 0.65% allocate_slab
> > > - 44.70% 0.21% io_uring [k] mempool_alloc_noprof
> > > - 44.49% mempool_alloc_noprof
> > > - 44.43% kmem_cache_alloc_noprof
> > > - 43.90% ___slab_alloc
> > > + 41.18% get_from_any_partial
> > > 0.90% get_from_partial_node
> > > + 0.87% alloc_from_new_slab
> > > + 0.65% allocate_slab
> > > + 41.23% 0.10% io_uring [k] get_from_any_partial
> > > + 40.82% 0.48% io_uring [k] __raw_spin_lock_irqsave
> > > - 40.75% 0.20% io_uring [k] get_from_partial_node
> > > - 40.56% get_from_partial_node
> > > - 38.83% __raw_spin_lock_irqsave
> > > 38.65% native_queued_spin_lock_slowpath
> >
> > That's pretty severe contention. Interestingly, the profile shows
> > a severe contention on the alloc path, but I don't see free path here.
> > wondering why only the alloc path is suffering, hmm...
>
> free path looks fine.
>
> + 2.84% 0.16% kublk [kernel.kallsyms] [k] mempool_free
> + 2.66% 0.17% kublk [kernel.kallsyms] [k] security_uring_cmd
> + 2.57% 0.36% kublk [kernel.kallsyms] [k] __slab_free
>
> >
> > Anyway, I think there may be two pieces contributing to this contention:
> >
> > Part 1) We probably made the portion of slowpath bigger,
> > by caching a smaller number of objects per CPU
> > after transitioning to sheaves.
> >
> > Part 2) We probably made the slowpath much slower.
> >
> > We need to investigate those parts separately.
> >
> > Regarding Part 1:
> >
> > # Point 1. The CPU slab was not considered in the sheaf capacity calculation
> >
> > calculate_sheaf_capacity() does not take into account that the CPU slab
> > was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
> > calculation to cache a number of objects similar to the CPU slab + percpu
> > partial slab list layers that SLUB previously had?
> >
> > # Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
> > # and that probably means we're caching less objects per CPU.
> >
> > Because SLUB previously assumed "slabs are half-full" when calculating
> > the number of slabs to cache per CPU, that could actually cache as twice
> > as many objects than intended when slabs are mostly empty.
> >
> > Because sheaves track the number of objects precisely, that inaccuracy
> > is gone. If the workload was previously benefiting from the inaccuracy,
> > sheaves can make CPUs cache a smaller number of objects per CPU compared
> > to the percpu slab caching layer.
> >
> > Anyway, I guess we need to check how many objects are actually
> > cached per CPU w/ and w/o sheaves, during the benchmark.
>
> In the workload `fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0`, queue depth
> is 128, so there should be 128 inflight bios on these 16 tasks/cpus.
>
> >
> > After making sure the number of objects cached per CPU is the same as
> > before, we could further investigate how much Part 2 plays into it.
> >
> > Slightly off-topic, by the way, slab currently doesn't let system admins
> > set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
> > the default capacity. I think we need to allow sys admins to set a custom
> > sheaf_capacity in the very near future.
> >
> > > Analysis
> > > ========
> > >
> > > The ublk null target workload exposes a cross-CPU slab allocation
> > > pattern: bios are allocated on the io_uring submitter CPU during block
> > > layer submission, but freed on a different CPU — the ublk daemon thread
> > > that runs the completion via io_uring_cmd_complete_in_task() task work.
> > > And the completion CPU stays in same LLC or numa node with submission CPU.
> >
> > Ok, so a submitter CPU keeps allocating objects, while a completion CPU
> > keeps freeing objects.
>
> Yes.
>
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 5:32 ` Hao Li
@ 2026-02-25 6:54 ` Harry Yoo
2026-02-25 7:06 ` Hao Li
0 siblings, 1 reply; 25+ messages in thread
From: Harry Yoo @ 2026-02-25 6:54 UTC (permalink / raw)
To: Hao Li
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, surenb
On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > Hi Harry,
> >
> > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > Hello Vlastimil and MM guys,
> > >
> > > Hi Ming, thanks for the report!
> > >
> > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > performance regression for workloads with persistent cross-CPU
> > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > drop).
> > > >
> > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > paths"), so the exact first bad commit could not be identified.
> > >
> > > Ouch. Why did it crash?
> >
> > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > [ 16.162447] PKRU: 55555554
> > [ 16.162448] Call Trace:
> > [ 16.162450] <TASK>
> > [ 16.162452] kmem_cache_free+0x410/0x490
> > [ 16.162454] do_readlinkat+0x14e/0x180
> > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> > [ 16.162461] do_syscall_64+0x7e/0x6b0
> > [ 16.162465] ? post_alloc_hook+0xb9/0x140
> > [ 16.162468] ? get_page_from_freelist+0x478/0x720
> > [ 16.162470] ? path_openat+0xb3/0x2a0
> > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> > [ 16.162474] ? count_memcg_events+0xd6/0x210
> > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> > [ 16.162481] ? charge_memcg+0x48/0x80
> > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> > [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> > [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> > [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> > [ 16.162493] ? count_memcg_events+0xd6/0x210
> > [ 16.162494] ? handle_mm_fault+0x212/0x340
> > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> > [ 16.162500] ? irqentry_exit+0x6d/0x540
> > [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> For this problem, I have a hypothesis which is inspired by a comment in the
> patch "slab: remove cpu (partial) slabs usage from allocation paths":
>
> /*
> * get a single object from the slab. This might race against __slab_free(),
> * which however has to take the list_lock if it's about to make the slab fully
> * free.
> */
>
> My understanding is that this comment is pointing out a possible race between
> __slab_free() and get_from_partial_node(). Since __slab_free() takes
> n->list_lock when it is about to make the slab fully free, and
> get_from_partial_node() also takes the same lock, the two paths should be
> mutually excluded by the lock and thus safe.
>
> However, I'm wondering if there could be another race window. Suppose CPU0's
> get_from_partial_node() has already finished __slab_update_freelist(), but has
> not yet reached remove_partial(). In that gap, another CPU1 could free an object
> to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
>
> __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> n->list_lock, trying to add this slab to the CPU partial list.
If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
for CPU0 to release the lock. And CPU0 will remove the slab from the
partial list before releasing the lock. Or am I missing something?
> In that case,
> both paths would operate on the same union field in struct slab, which might
> lead to list corruption.
Not sure how the scenario you describe could happen:
CPU 0 CPU1
- get_from_partial_node()
-> spin_lock(&n->list_lock)
- __slab_free()
-> __slab_update_freelist(),
slab becomes full
-> was_full == 1
-> spin_lock(&n->list_lock)
// starts spining
-> if (!new.freelist)
-> remove_partial()
-> spin_unlock()
-> spin_lock(&n->list_lock)
// acquired!
-> slab_update_freelist()
-> spin_unlock(&n->list_lock)
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 6:54 ` Harry Yoo
@ 2026-02-25 7:06 ` Hao Li
2026-02-25 7:19 ` Harry Yoo
0 siblings, 1 reply; 25+ messages in thread
From: Hao Li @ 2026-02-25 7:06 UTC (permalink / raw)
To: Harry Yoo
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, surenb
On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > Hi Harry,
> > >
> > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > Hello Vlastimil and MM guys,
> > > >
> > > > Hi Ming, thanks for the report!
> > > >
> > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > performance regression for workloads with persistent cross-CPU
> > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > drop).
> > > > >
> > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > paths"), so the exact first bad commit could not be identified.
> > > >
> > > > Ouch. Why did it crash?
> > >
> > > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> > > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > [ 16.162447] PKRU: 55555554
> > > [ 16.162448] Call Trace:
> > > [ 16.162450] <TASK>
> > > [ 16.162452] kmem_cache_free+0x410/0x490
> > > [ 16.162454] do_readlinkat+0x14e/0x180
> > > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> > > [ 16.162461] do_syscall_64+0x7e/0x6b0
> > > [ 16.162465] ? post_alloc_hook+0xb9/0x140
> > > [ 16.162468] ? get_page_from_freelist+0x478/0x720
> > > [ 16.162470] ? path_openat+0xb3/0x2a0
> > > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> > > [ 16.162474] ? count_memcg_events+0xd6/0x210
> > > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> > > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > [ 16.162481] ? charge_memcg+0x48/0x80
> > > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> > > [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> > > [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> > > [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> > > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> > > [ 16.162493] ? count_memcg_events+0xd6/0x210
> > > [ 16.162494] ? handle_mm_fault+0x212/0x340
> > > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> > > [ 16.162500] ? irqentry_exit+0x6d/0x540
> > > [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> > > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > For this problem, I have a hypothesis which is inspired by a comment in the
> > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> >
> > /*
> > * get a single object from the slab. This might race against __slab_free(),
> > * which however has to take the list_lock if it's about to make the slab fully
> > * free.
> > */
> >
> > My understanding is that this comment is pointing out a possible race between
> > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > n->list_lock when it is about to make the slab fully free, and
> > get_from_partial_node() also takes the same lock, the two paths should be
> > mutually excluded by the lock and thus safe.
> >
> > However, I'm wondering if there could be another race window. Suppose CPU0's
> > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> >
> > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > n->list_lock, trying to add this slab to the CPU partial list.
>
> If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> for CPU0 to release the lock. And CPU0 will remove the slab from the
> partial list before releasing the lock. Or am I missing something?
>
> > In that case,
> > both paths would operate on the same union field in struct slab, which might
> > lead to list corruption.
>
> Not sure how the scenario you describe could happen:
>
> CPU 0 CPU1
> - get_from_partial_node()
> -> spin_lock(&n->list_lock)
> - __slab_free()
> -> __slab_update_freelist(),
> slab becomes full
> -> was_full == 1
> -> spin_lock(&n->list_lock)
In __slab_free, if was_full == 1, then the condition
!(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
enter the "if" block and therefore n->list_lock is not acquired.
Does that sound right.
--
Thanks,
Hao
> // starts spining
> -> if (!new.freelist)
> -> remove_partial()
> -> spin_unlock()
> -> spin_lock(&n->list_lock)
> // acquired!
> -> slab_update_freelist()
> -> spin_unlock(&n->list_lock)
>
> --
> Cheers,
> Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 7:06 ` Hao Li
@ 2026-02-25 7:19 ` Harry Yoo
2026-02-25 8:19 ` Hao Li
2026-02-25 8:21 ` Harry Yoo
0 siblings, 2 replies; 25+ messages in thread
From: Harry Yoo @ 2026-02-25 7:19 UTC (permalink / raw)
To: Hao Li
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, surenb
On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > Hi Harry,
> > > >
> > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > Hello Vlastimil and MM guys,
> > > > >
> > > > > Hi Ming, thanks for the report!
> > > > >
> > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > drop).
> > > > > >
> > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > paths"), so the exact first bad commit could not be identified.
> > > > >
> > > > > Ouch. Why did it crash?
> > > >
> > > > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> > > > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > [ 16.162447] PKRU: 55555554
> > > > [ 16.162448] Call Trace:
> > > > [ 16.162450] <TASK>
> > > > [ 16.162452] kmem_cache_free+0x410/0x490
> > > > [ 16.162454] do_readlinkat+0x14e/0x180
> > > > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> > > > [ 16.162461] do_syscall_64+0x7e/0x6b0
> > > > [ 16.162465] ? post_alloc_hook+0xb9/0x140
> > > > [ 16.162468] ? get_page_from_freelist+0x478/0x720
> > > > [ 16.162470] ? path_openat+0xb3/0x2a0
> > > > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > [ 16.162474] ? count_memcg_events+0xd6/0x210
> > > > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> > > > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > [ 16.162481] ? charge_memcg+0x48/0x80
> > > > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> > > > [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> > > > [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> > > > [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> > > > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> > > > [ 16.162493] ? count_memcg_events+0xd6/0x210
> > > > [ 16.162494] ? handle_mm_fault+0x212/0x340
> > > > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> > > > [ 16.162500] ? irqentry_exit+0x6d/0x540
> > > > [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> > > > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > >
> > > /*
> > > * get a single object from the slab. This might race against __slab_free(),
> > > * which however has to take the list_lock if it's about to make the slab fully
> > > * free.
> > > */
> > >
> > > My understanding is that this comment is pointing out a possible race between
> > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > n->list_lock when it is about to make the slab fully free, and
> > > get_from_partial_node() also takes the same lock, the two paths should be
> > > mutually excluded by the lock and thus safe.
> > >
> > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > >
> > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > n->list_lock, trying to add this slab to the CPU partial list.
> >
> > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > partial list before releasing the lock. Or am I missing something?
> >
> > > In that case,
> > > both paths would operate on the same union field in struct slab, which might
> > > lead to list corruption.
> >
> > Not sure how the scenario you describe could happen:
> >
> > CPU 0 CPU1
> > - get_from_partial_node()
> > -> spin_lock(&n->list_lock)
> > - __slab_free()
> > -> __slab_update_freelist(),
> > slab becomes full
> > -> was_full == 1
> > -> spin_lock(&n->list_lock)
>
> In __slab_free, if was_full == 1, then the condition
> !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> enter the "if" block and therefore n->list_lock is not acquired.
> Does that sound right.
Nah, you're right. Just slipped my mind. No need to acquire the lock
if it was full, because that means it's not on the partial list.
Hmm... but the logic has been there for very long time.
Looks like we broke a premise for the percpu slab caching layer
to work correctly, while transitioning to sheaves.
I think the new behavior introduced during the sheaves transition is that
SLUB can now allocate objects from slabs without freezing it. Allocating
objects from slab without freezing it seems to confuse the free path...
But not sure if we could "fix" that because the percpu partial slab
caching layer is gone anyway :)
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 7:19 ` Harry Yoo
@ 2026-02-25 8:19 ` Hao Li
2026-02-25 8:41 ` Harry Yoo
2026-02-25 8:21 ` Harry Yoo
1 sibling, 1 reply; 25+ messages in thread
From: Hao Li @ 2026-02-25 8:19 UTC (permalink / raw)
To: Harry Yoo
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, surenb
On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > Hi Harry,
> > > > >
> > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > Hello Vlastimil and MM guys,
> > > > > >
> > > > > > Hi Ming, thanks for the report!
> > > > > >
> > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > drop).
> > > > > > >
> > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > >
> > > > > > Ouch. Why did it crash?
> > > > >
> > > > > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> > > > > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > [ 16.162447] PKRU: 55555554
> > > > > [ 16.162448] Call Trace:
> > > > > [ 16.162450] <TASK>
> > > > > [ 16.162452] kmem_cache_free+0x410/0x490
> > > > > [ 16.162454] do_readlinkat+0x14e/0x180
> > > > > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> > > > > [ 16.162461] do_syscall_64+0x7e/0x6b0
> > > > > [ 16.162465] ? post_alloc_hook+0xb9/0x140
> > > > > [ 16.162468] ? get_page_from_freelist+0x478/0x720
> > > > > [ 16.162470] ? path_openat+0xb3/0x2a0
> > > > > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > [ 16.162474] ? count_memcg_events+0xd6/0x210
> > > > > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> > > > > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > [ 16.162481] ? charge_memcg+0x48/0x80
> > > > > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> > > > > [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> > > > > [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> > > > > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> > > > > [ 16.162493] ? count_memcg_events+0xd6/0x210
> > > > > [ 16.162494] ? handle_mm_fault+0x212/0x340
> > > > > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> > > > > [ 16.162500] ? irqentry_exit+0x6d/0x540
> > > > > [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> > > > > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > >
> > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > >
> > > > /*
> > > > * get a single object from the slab. This might race against __slab_free(),
> > > > * which however has to take the list_lock if it's about to make the slab fully
> > > > * free.
> > > > */
> > > >
> > > > My understanding is that this comment is pointing out a possible race between
> > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > n->list_lock when it is about to make the slab fully free, and
> > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > mutually excluded by the lock and thus safe.
> > > >
> > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > >
> > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > n->list_lock, trying to add this slab to the CPU partial list.
> > >
> > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > partial list before releasing the lock. Or am I missing something?
> > >
> > > > In that case,
> > > > both paths would operate on the same union field in struct slab, which might
> > > > lead to list corruption.
> > >
> > > Not sure how the scenario you describe could happen:
> > >
> > > CPU 0 CPU1
> > > - get_from_partial_node()
> > > -> spin_lock(&n->list_lock)
> > > - __slab_free()
> > > -> __slab_update_freelist(),
> > > slab becomes full
> > > -> was_full == 1
> > > -> spin_lock(&n->list_lock)
> >
> > In __slab_free, if was_full == 1, then the condition
> > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > enter the "if" block and therefore n->list_lock is not acquired.
> > Does that sound right.
>
> Nah, you're right. Just slipped my mind. No need to acquire the lock
> if it was full, because that means it's not on the partial list.
Exactly.
>
> Hmm... but the logic has been there for very long time.
Yes.
>
> Looks like we broke a premise for the percpu slab caching layer
> to work correctly, while transitioning to sheaves.
>
> I think the new behavior introduced during the sheaves transition is that
> SLUB can now allocate objects from slabs without freezing it. Allocating
> objects from slab without freezing it seems to confuse the free path...
I feel it's not a big issue.
I think the root cause of this issue is as follows:
Before this commit, get_partial_node would first remove the slab from the node
list and then return the slab to the upper layer for freezing and object
allocation. Therefore, when __slab_free encounters a slab marked as was_full,
that slab would no longer be on the node list, avoiding race conditions with
list operations.
However, after this commit, get_from_partial_node first allocates an object
from the slab and then removes the slab from the node list. During the
interval between these two steps, __slab_free might encounter a slab marked as
was_full and then it want to add the slab to the CPU partial list, while at
the same time, another process is trying to remove the same slab from the node
list, leading to a race condition.
>
> But not sure if we could "fix" that because the percpu partial slab
> caching layer is gone anyway :)
Yes, this bug has already disappeared with subsequent patches...
By the way, to allow Ming Lei to continue the bisect process, maybe we should
come up with a temporary workaround, such as:
} else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
spin_lock_irqsave(&n->list_lock, flags);
/*
* Let this empty critical section push back put_cpu_partial, ensuring
* its execution happens after the critical section of
* get_from_partial_node running in parallel.
*/
spin_unlock_irqrestore(&n->list_lock, flags);
/*
* If we started with a full slab then put it onto the
* per cpu partial list.
*/
put_cpu_partial(s, slab, 1);
stat(s, CPU_PARTIAL_FREE);
}
--
Thanks,
Hao
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 7:19 ` Harry Yoo
2026-02-25 8:19 ` Hao Li
@ 2026-02-25 8:21 ` Harry Yoo
1 sibling, 0 replies; 25+ messages in thread
From: Harry Yoo @ 2026-02-25 8:21 UTC (permalink / raw)
To: Hao Li
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, surenb
On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> > > > > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > [ 16.162447] PKRU: 55555554
> > > > > [ 16.162448] Call Trace:
> > > > > [ 16.162450] <TASK>
> > > > > [ 16.162452] kmem_cache_free+0x410/0x490
> > > > > [ 16.162454] do_readlinkat+0x14e/0x180
> > > > > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> > > > > [ 16.162461] do_syscall_64+0x7e/0x6b0
> > > > > [ 16.162465] ? post_alloc_hook+0xb9/0x140
> > > > > [ 16.162468] ? get_page_from_freelist+0x478/0x720
> > > > > [ 16.162470] ? path_openat+0xb3/0x2a0
> > > > > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > [ 16.162474] ? count_memcg_events+0xd6/0x210
> > > > > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> > > > > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > [ 16.162481] ? charge_memcg+0x48/0x80
> > > > > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> > > > > [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> > > > > [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> > > > > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> > > > > [ 16.162493] ? count_memcg_events+0xd6/0x210
> > > > > [ 16.162494] ? handle_mm_fault+0x212/0x340
> > > > > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> > > > > [ 16.162500] ? irqentry_exit+0x6d/0x540
> > > > > [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> > > > > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > >
> > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > >
> > > > /*
> > > > * get a single object from the slab. This might race against __slab_free(),
> > > > * which however has to take the list_lock if it's about to make the slab fully
> > > > * free.
> > > > */
> > > >
> > > > My understanding is that this comment is pointing out a possible race between
> > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > n->list_lock when it is about to make the slab fully free, and
> > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > mutually excluded by the lock and thus safe.
> > > >
> > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > >
> > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > n->list_lock, trying to add this slab to the CPU partial list.
> > >
> > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > partial list before releasing the lock. Or am I missing something?
> > >
> > > > In that case,
> > > > both paths would operate on the same union field in struct slab, which might
> > > > lead to list corruption.
> > >
> > > Not sure how the scenario you describe could happen:
> > >
> > > CPU 0 CPU1
> > > - get_from_partial_node()
> > > -> spin_lock(&n->list_lock)
> > > - __slab_free()
> > > -> __slab_update_freelist(),
> > > slab becomes full
> > > -> was_full == 1
> > > -> spin_lock(&n->list_lock)
> >
> > In __slab_free, if was_full == 1, then the condition
> > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > enter the "if" block and therefore n->list_lock is not acquired.
> > Does that sound right.
>
> Nah, you're right. Just slipped my mind. No need to acquire the lock
> if it was full, because that means it's not on the partial list.
"because it's not on the partial list, and SLUB is going to add it
to the percpu partial slab list (to avoid acquiring the lock)"
> Hmm... but the logic has been there for very long time.
>
> Looks like we broke a premise for the percpu slab caching layer
> to work correctly, while transitioning to sheaves.
>
> I think the new behavior introduced during the sheaves transition is that
> SLUB can now allocate objects from slabs without freezing it. Allocating
> objects from slab without freezing it seems to confuse the free path...
Just elaborating the analysis a bit:
Hao Li (thankfully!) analyzed that there's a race condition between
1) alloc path removes a slab from partial list when it transitions from
partial to full and 2) free path adds the slab to percpu partial slab list
when it transitions from full to partial.
The following race could occur:
CPU 0 CPU1
- get_from_partial_node()
-> spin_lock(&n->list_lock)
- __slab_free()
-> __slab_update_freelist()
// slab becomes full
-> was_full == 1,
no lock acquired
-> slab_update_freelist()
-> if (was_frozen) // not frozen!
-> else if (was_full)
-> put_cpu_partial(slab)
// add the slab to percpu
// partial slabs
-> if (!new.freelist)
-> remove_partial(slab)
// CPU1's percpu partial slab list
is now corrupted
And later when CPU1 calls __put_partials(), it crashes while
iterating over the percpu partial slab list.
The race condition did not exist before sheaves, because
1) slabs were not on the partial list when the alloc path allocates
objects and 2) the alloc path froze them before allocating objects.
When slabs are frozen, free path doesn't call put_cpu_partial().
Commit 17c38c88294d ("slab: remove cpu (partial) slabs usage from
allocation paths") changed both 1) and 2) and introduced the race
described above. Now, 1) slabs are on partial list when the alloc path
allocates objects, and 2) it does not freeze slabs.
Because the alloc path does not freeze slabs, the free path thinks
that it can always safely add slabs to the percpu partial slab list,
but it's now racy because there's a window between it becomes full
and it's removed from the partial list.
This should be have been fixed after removing cpu partial slabs layer
from the free path, though.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 8:19 ` Hao Li
@ 2026-02-25 8:41 ` Harry Yoo
2026-02-25 8:54 ` Hao Li
0 siblings, 1 reply; 25+ messages in thread
From: Harry Yoo @ 2026-02-25 8:41 UTC (permalink / raw)
To: Hao Li
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, surenb
On Wed, Feb 25, 2026 at 04:19:49PM +0800, Hao Li wrote:
> On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> > On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > > Hi Harry,
> > > > > >
> > > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > > Hello Vlastimil and MM guys,
> > > > > > >
> > > > > > > Hi Ming, thanks for the report!
> > > > > > >
> > > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > > drop).
> > > > > > > >
> > > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > >
> > > > > > > Ouch. Why did it crash?
> > > > > >
> > > > > > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> > > > > > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > > [ 16.162447] PKRU: 55555554
> > > > > > [ 16.162448] Call Trace:
> > > > > > [ 16.162450] <TASK>
> > > > > > [ 16.162452] kmem_cache_free+0x410/0x490
> > > > > > [ 16.162454] do_readlinkat+0x14e/0x180
> > > > > > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> > > > > > [ 16.162461] do_syscall_64+0x7e/0x6b0
> > > > > > [ 16.162465] ? post_alloc_hook+0xb9/0x140
> > > > > > [ 16.162468] ? get_page_from_freelist+0x478/0x720
> > > > > > [ 16.162470] ? path_openat+0xb3/0x2a0
> > > > > > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > > [ 16.162474] ? count_memcg_events+0xd6/0x210
> > > > > > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> > > > > > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > > [ 16.162481] ? charge_memcg+0x48/0x80
> > > > > > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > > [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> > > > > > [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> > > > > > [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> > > > > > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> > > > > > [ 16.162493] ? count_memcg_events+0xd6/0x210
> > > > > > [ 16.162494] ? handle_mm_fault+0x212/0x340
> > > > > > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> > > > > > [ 16.162500] ? irqentry_exit+0x6d/0x540
> > > > > > [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> > > > > > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > >
> > > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > >
> > > > > /*
> > > > > * get a single object from the slab. This might race against __slab_free(),
> > > > > * which however has to take the list_lock if it's about to make the slab fully
> > > > > * free.
> > > > > */
> > > > >
> > > > > My understanding is that this comment is pointing out a possible race between
> > > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > > n->list_lock when it is about to make the slab fully free, and
> > > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > > mutually excluded by the lock and thus safe.
> > > > >
> > > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > > >
> > > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > >
> > > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > > partial list before releasing the lock. Or am I missing something?
> > > >
> > > > > In that case,
> > > > > both paths would operate on the same union field in struct slab, which might
> > > > > lead to list corruption.
> > > >
> > > > Not sure how the scenario you describe could happen:
> > > >
> > > > CPU 0 CPU1
> > > > - get_from_partial_node()
> > > > -> spin_lock(&n->list_lock)
> > > > - __slab_free()
> > > > -> __slab_update_freelist(),
> > > > slab becomes full
> > > > -> was_full == 1
> > > > -> spin_lock(&n->list_lock)
> > >
> > > In __slab_free, if was_full == 1, then the condition
> > > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > > enter the "if" block and therefore n->list_lock is not acquired.
> > > Does that sound right.
> >
> > Nah, you're right. Just slipped my mind. No need to acquire the lock
> > if it was full, because that means it's not on the partial list.
>
> Exactly.
>
> >
> > Hmm... but the logic has been there for very long time.
>
> Yes.
>
> >
> > Looks like we broke a premise for the percpu slab caching layer
> > to work correctly, while transitioning to sheaves.
> >
> > I think the new behavior introduced during the sheaves transition is that
> > SLUB can now allocate objects from slabs without freezing it. Allocating
> > objects from slab without freezing it seems to confuse the free path...
>
> I feel it's not a big issue.
>
> I think the root cause of this issue is as follows:
>
> Before this commit, get_partial_node would first remove the slab from the node
> list and then return the slab to the upper layer for freezing and object
> allocation. Therefore, when __slab_free encounters a slab marked as was_full,
> that slab would no longer be on the node list, avoiding race conditions with
> list operations.
Right, that's an important point. Just realized that while elaborating
the analysis :), there was a race condition between you and I!
> However, after this commit, get_from_partial_node first allocates an object
> from the slab and then removes the slab from the node list.
Right.
> During the
> interval between these two steps, __slab_free might encounter a slab marked as
> was_full and then it want to add the slab to the CPU partial list,
Right.
> while at the same time, another process is trying to remove the same slab
> from the node list, leading to a race condition.
Exactly.
> > But not sure if we could "fix" that because the percpu partial slab
> > caching layer is gone anyway :)
>
> Yes, this bug has already disappeared with subsequent patches...
>
> By the way, to allow Ming Lei to continue the bisect process, maybe we should
> come up with a temporary workaround, such as:
>
> } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
> spin_lock_irqsave(&n->list_lock, flags);
> /*
> * Let this empty critical section push back put_cpu_partial, ensuring
> * its execution happens after the critical section of
> * get_from_partial_node running in parallel.
> */
> spin_unlock_irqrestore(&n->list_lock, flags);
> /*
> * If we started with a full slab then put it onto the
> * per cpu partial list.
> */
> put_cpu_partial(s, slab, 1);
> stat(s, CPU_PARTIAL_FREE);
> }
Hmm but if that affects the performance (by always acquiring
n->list_lock), the result is probably not valid anyway.
I'd rather bet that Vlastimil's analysis is correct :)
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25 5:24 ` Harry Yoo
@ 2026-02-25 8:45 ` Vlastimil Babka (SUSE)
2026-02-25 9:31 ` Ming Lei
1 sibling, 1 reply; 25+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-25 8:45 UTC (permalink / raw)
To: Vlastimil Babka, Ming Lei, Andrew Morton
Cc: linux-mm, linux-kernel, linux-block, Harry Yoo, Hao Li,
Christoph Hellwig
On 2/24/26 21:27, Vlastimil Babka wrote:
>
> It made sense to me not to refill sheaves when we can't reclaim, but I
> didn't anticipate this interaction with mempools. We could change them
> but there might be others using a similar pattern. Maybe it would be for
> the best to just drop that heuristic from __pcs_replace_empty_main()
> (but carefully as some deadlock avoidance depends on it, we might need
> to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> tomorrow to test this theory, unless someone beats me to it (feel free to).
Could you try this then, please? Thanks!
----8<----
From b04dad02eb72feb1736241518dd4d3dd64aadc0e Mon Sep 17 00:00:00 2001
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Date: Wed, 25 Feb 2026 09:40:22 +0100
Subject: [PATCH] mm/slab: allow sheaf refill if blocking is not allowed
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
mm/slub.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 862642c165ed..258307270442 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4526,7 +4526,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
struct slab_sheaf *empty = NULL;
struct slab_sheaf *full;
struct node_barn *barn;
- bool can_alloc;
+ bool allow_spin;
lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
@@ -4547,8 +4547,9 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
return NULL;
}
- full = barn_replace_empty_sheaf(barn, pcs->main,
- gfpflags_allow_spinning(gfp));
+ allow_spin = gfpflags_allow_spinning(gfp);
+
+ full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
if (full) {
stat(s, BARN_GET);
@@ -4558,9 +4559,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
stat(s, BARN_GET_FAIL);
- can_alloc = gfpflags_allow_blocking(gfp);
-
- if (can_alloc) {
+ if (allow_spin) {
if (pcs->spare) {
empty = pcs->spare;
pcs->spare = NULL;
@@ -4571,7 +4570,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
local_unlock(&s->cpu_sheaves->lock);
- if (!can_alloc)
+ if (!allow_spin)
return NULL;
if (empty) {
@@ -4591,11 +4590,8 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
if (!full)
return NULL;
- /*
- * we can reach here only when gfpflags_allow_blocking
- * so this must not be an irq
- */
- local_lock(&s->cpu_sheaves->lock);
+ if (!local_trylock(&s->cpu_sheaves->lock))
+ goto barn_put;
pcs = this_cpu_ptr(s->cpu_sheaves);
/*
@@ -4626,6 +4622,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
return pcs;
}
+barn_put:
barn_put_full_sheaf(barn, full);
stat(s, BARN_PUT);
--
2.53.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 8:41 ` Harry Yoo
@ 2026-02-25 8:54 ` Hao Li
0 siblings, 0 replies; 25+ messages in thread
From: Hao Li @ 2026-02-25 8:54 UTC (permalink / raw)
To: Harry Yoo
Cc: Ming Lei, Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, surenb
On Wed, Feb 25, 2026 at 05:41:15PM +0900, Harry Yoo wrote:
> On Wed, Feb 25, 2026 at 04:19:49PM +0800, Hao Li wrote:
> > On Wed, Feb 25, 2026 at 04:19:41PM +0900, Harry Yoo wrote:
> > > On Wed, Feb 25, 2026 at 03:06:46PM +0800, Hao Li wrote:
> > > > On Wed, Feb 25, 2026 at 03:54:06PM +0900, Harry Yoo wrote:
> > > > > On Wed, Feb 25, 2026 at 01:32:36PM +0800, Hao Li wrote:
> > > > > > On Tue, Feb 24, 2026 at 05:07:18PM +0800, Ming Lei wrote:
> > > > > > > Hi Harry,
> > > > > > >
> > > > > > > On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> > > > > > > > On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > > > > > > > > Hello Vlastimil and MM guys,
> > > > > > > >
> > > > > > > > Hi Ming, thanks for the report!
> > > > > > > >
> > > > > > > > > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > > > > > > > > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > > > > > > > > performance regression for workloads with persistent cross-CPU
> > > > > > > > > alloc/free patterns. ublk null target benchmark IOPS drops
> > > > > > > > > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > > > > > > > > drop).
> > > > > > > > >
> > > > > > > > > Bisecting within the sheaves series is blocked by a kernel panic at
> > > > > > > > > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > > > > > > > > paths"), so the exact first bad commit could not be identified.
> > > > > > > >
> > > > > > > > Ouch. Why did it crash?
> > > > > > >
> > > > > > > [ 16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
> > > > > > > [ 16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy)
> > > > > > > [ 16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
> > > > > > > [ 16.162430] RIP: 0010:__put_partials+0x2f/0x140
> > > > > > > [ 16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
> > > > > > > [ 16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
> > > > > > > [ 16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
> > > > > > > [ 16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
> > > > > > > [ 16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
> > > > > > > [ 16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
> > > > > > > [ 16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
> > > > > > > [ 16.162445] FS: 00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
> > > > > > > [ 16.162446] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > [ 16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
> > > > > > > [ 16.162447] PKRU: 55555554
> > > > > > > [ 16.162448] Call Trace:
> > > > > > > [ 16.162450] <TASK>
> > > > > > > [ 16.162452] kmem_cache_free+0x410/0x490
> > > > > > > [ 16.162454] do_readlinkat+0x14e/0x180
> > > > > > > [ 16.162459] __x64_sys_readlinkat+0x1c/0x30
> > > > > > > [ 16.162461] do_syscall_64+0x7e/0x6b0
> > > > > > > [ 16.162465] ? post_alloc_hook+0xb9/0x140
> > > > > > > [ 16.162468] ? get_page_from_freelist+0x478/0x720
> > > > > > > [ 16.162470] ? path_openat+0xb3/0x2a0
> > > > > > > [ 16.162472] ? __alloc_frozen_pages_noprof+0x192/0x350
> > > > > > > [ 16.162474] ? count_memcg_events+0xd6/0x210
> > > > > > > [ 16.162476] ? memcg1_commit_charge+0x7a/0xa0
> > > > > > > [ 16.162479] ? mod_memcg_lruvec_state+0xe7/0x2d0
> > > > > > > [ 16.162481] ? charge_memcg+0x48/0x80
> > > > > > > [ 16.162482] ? lruvec_stat_mod_folio+0x85/0xd0
> > > > > > > [ 16.162484] ? __folio_mod_stat+0x2d/0x90
> > > > > > > [ 16.162487] ? set_ptes.isra.0+0x36/0x80
> > > > > > > [ 16.162490] ? do_anonymous_page+0x100/0x4a0
> > > > > > > [ 16.162492] ? __handle_mm_fault+0x45d/0x6f0
> > > > > > > [ 16.162493] ? count_memcg_events+0xd6/0x210
> > > > > > > [ 16.162494] ? handle_mm_fault+0x212/0x340
> > > > > > > [ 16.162495] ? do_user_addr_fault+0x2b4/0x7b0
> > > > > > > [ 16.162500] ? irqentry_exit+0x6d/0x540
> > > > > > > [ 16.162502] ? exc_page_fault+0x7e/0x1a0
> > > > > > > [ 16.162503] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > > >
> > > > > > For this problem, I have a hypothesis which is inspired by a comment in the
> > > > > > patch "slab: remove cpu (partial) slabs usage from allocation paths":
> > > > > >
> > > > > > /*
> > > > > > * get a single object from the slab. This might race against __slab_free(),
> > > > > > * which however has to take the list_lock if it's about to make the slab fully
> > > > > > * free.
> > > > > > */
> > > > > >
> > > > > > My understanding is that this comment is pointing out a possible race between
> > > > > > __slab_free() and get_from_partial_node(). Since __slab_free() takes
> > > > > > n->list_lock when it is about to make the slab fully free, and
> > > > > > get_from_partial_node() also takes the same lock, the two paths should be
> > > > > > mutually excluded by the lock and thus safe.
> > > > > >
> > > > > > However, I'm wondering if there could be another race window. Suppose CPU0's
> > > > > > get_from_partial_node() has already finished __slab_update_freelist(), but has
> > > > > > not yet reached remove_partial(). In that gap, another CPU1 could free an object
> > > > > > to the same slab via __slab_free(). CPU1 would observe was_full == 1 (due to the
> > > > > > previous get_from_partial_node()->__slab_update_freelist() on CPU0), and then
> > > > > >
> > > > > > __slab_free() will call put_cpu_partial(s, slab, 1) without holding
> > > > > > n->list_lock, trying to add this slab to the CPU partial list.
> > > > >
> > > > > If CPU1 observes was_full == 1, it should spin on n->list_lock and wait
> > > > > for CPU0 to release the lock. And CPU0 will remove the slab from the
> > > > > partial list before releasing the lock. Or am I missing something?
> > > > >
> > > > > > In that case,
> > > > > > both paths would operate on the same union field in struct slab, which might
> > > > > > lead to list corruption.
> > > > >
> > > > > Not sure how the scenario you describe could happen:
> > > > >
> > > > > CPU 0 CPU1
> > > > > - get_from_partial_node()
> > > > > -> spin_lock(&n->list_lock)
> > > > > - __slab_free()
> > > > > -> __slab_update_freelist(),
> > > > > slab becomes full
> > > > > -> was_full == 1
> > > > > -> spin_lock(&n->list_lock)
> > > >
> > > > In __slab_free, if was_full == 1, then the condition
> > > > !(IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) becomes false, so it won't
> > > > enter the "if" block and therefore n->list_lock is not acquired.
> > > > Does that sound right.
> > >
> > > Nah, you're right. Just slipped my mind. No need to acquire the lock
> > > if it was full, because that means it's not on the partial list.
> >
> > Exactly.
> >
> > >
> > > Hmm... but the logic has been there for very long time.
> >
> > Yes.
> >
> > >
> > > Looks like we broke a premise for the percpu slab caching layer
> > > to work correctly, while transitioning to sheaves.
> > >
> > > I think the new behavior introduced during the sheaves transition is that
> > > SLUB can now allocate objects from slabs without freezing it. Allocating
> > > objects from slab without freezing it seems to confuse the free path...
> >
> > I feel it's not a big issue.
> >
> > I think the root cause of this issue is as follows:
> >
> > Before this commit, get_partial_node would first remove the slab from the node
> > list and then return the slab to the upper layer for freezing and object
> > allocation. Therefore, when __slab_free encounters a slab marked as was_full,
> > that slab would no longer be on the node list, avoiding race conditions with
> > list operations.
>
> Right, that's an important point. Just realized that while elaborating
> the analysis :), there was a race condition between you and I!
Haha, true race condition - we both sent emails within a minute :D
>
> > However, after this commit, get_from_partial_node first allocates an object
> > from the slab and then removes the slab from the node list.
>
> Right.
>
> > During the
> > interval between these two steps, __slab_free might encounter a slab marked as
> > was_full and then it want to add the slab to the CPU partial list,
>
> Right.
>
> > while at the same time, another process is trying to remove the same slab
> > from the node list, leading to a race condition.
>
> Exactly.
>
> > > But not sure if we could "fix" that because the percpu partial slab
> > > caching layer is gone anyway :)
> >
> > Yes, this bug has already disappeared with subsequent patches...
> >
> > By the way, to allow Ming Lei to continue the bisect process, maybe we should
> > come up with a temporary workaround, such as:
> >
> > } else if (IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) && was_full) {
> > spin_lock_irqsave(&n->list_lock, flags);
> > /*
> > * Let this empty critical section push back put_cpu_partial, ensuring
> > * its execution happens after the critical section of
> > * get_from_partial_node running in parallel.
> > */
> > spin_unlock_irqrestore(&n->list_lock, flags);
> > /*
> > * If we started with a full slab then put it onto the
> > * per cpu partial list.
> > */
> > put_cpu_partial(s, slab, 1);
> > stat(s, CPU_PARTIAL_FREE);
> > }
>
> Hmm but if that affects the performance (by always acquiring
> n->list_lock), the result is probably not valid anyway.
>
> I'd rather bet that Vlastimil's analysis is correct :)
Indeed.
Let's look forward to the test results for Vlastimil's patch!
--
Thanks,
Hao
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 8:45 ` Vlastimil Babka (SUSE)
@ 2026-02-25 9:31 ` Ming Lei
2026-02-25 11:29 ` Vlastimil Babka (SUSE)
2026-02-26 18:02 ` Vlastimil Babka (SUSE)
0 siblings, 2 replies; 25+ messages in thread
From: Ming Lei @ 2026-02-25 9:31 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
Hi Vlastimil,
On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/24/26 21:27, Vlastimil Babka wrote:
> >
> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > didn't anticipate this interaction with mempools. We could change them
> > but there might be others using a similar pattern. Maybe it would be for
> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > (but carefully as some deadlock avoidance depends on it, we might need
> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> Could you try this then, please? Thanks!
Thanks for working on this issue!
Unfortunately the patch doesn't make a difference on IOPS in the perf test,
follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
```
04cb971e2d28 (HEAD -> master) mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
a5a9cf3f020f mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
7dff99b35460 (origin/master) Remove WARN_ALL_UNSEEDED_RANDOM kernel config option
551d44200152 default_gfp(): avoid using the "newfangled" __VA_OPT__ trick
6de23f81a5e0 (tag: v7.0-rc1) Linux 7.0-rc1
```
+ 49.03% 2.00% io_uring [kernel.kallsyms] [k] __blkdev_direct_IO_async
- 38.66% 1.16% io_uring [kernel.kallsyms] [k] bio_alloc_bioset
- 37.51% bio_alloc_bioset
- 34.98% mempool_alloc_noprof
- 34.87% kmem_cache_alloc_noprof
- 33.82% ___slab_alloc
- 30.25% get_from_any_partial
- 29.59% get_from_partial_node
- 28.42% __raw_spin_lock_irqsave
native_queued_spin_lock_slowpath
+ 2.16% allocate_slab
+ 0.60% alloc_from_new_slab
0.51% __pcs_replace_empty_main
1.58% bio_associate_blkg
+ 1.16% submitter_uring_fn
+ 35.16% 0.30% io_uring [kernel.kallsyms] [k] kmem_cache_alloc_noprof
+ 35.13% 0.12% io_uring [kernel.kallsyms] [k] mempool_alloc_noprof
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 9:31 ` Ming Lei
@ 2026-02-25 11:29 ` Vlastimil Babka (SUSE)
2026-02-25 12:24 ` Ming Lei
2026-02-26 18:02 ` Vlastimil Babka (SUSE)
1 sibling, 1 reply; 25+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-25 11:29 UTC (permalink / raw)
To: Ming Lei
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
On 2/25/26 10:31, Ming Lei wrote:
> Hi Vlastimil,
>
> On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/24/26 21:27, Vlastimil Babka wrote:
>> >
>> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> > didn't anticipate this interaction with mempools. We could change them
>> > but there might be others using a similar pattern. Maybe it would be for
>> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> > (but carefully as some deadlock avoidance depends on it, we might need
>> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> Could you try this then, please? Thanks!
>
> Thanks for working on this issue!
>
> Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
Hm that's weird, still the slowpath is prominent in your profile.
I followed your reproducer instructions, although only with a small
virtme-ng based setup. What's the output of "numactl -H" on yours, btw?
Anyway what I saw is my patch raised the IOPS substantially, and with
CONFIG_SLUB_STATS=y enabled I could see that
/sys/kernel/slab/bio-248/alloc_slowpath had substantial values before the
patch and zero afterwards.
Maybe if you could also enable CONFIG_SLUB_STATS=y and see in which cache(s)
there's significant alloc_slowpath even after the patch, it could help.
Thanks!
> ```
> 04cb971e2d28 (HEAD -> master) mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
> a5a9cf3f020f mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
> 7dff99b35460 (origin/master) Remove WARN_ALL_UNSEEDED_RANDOM kernel config option
> 551d44200152 default_gfp(): avoid using the "newfangled" __VA_OPT__ trick
> 6de23f81a5e0 (tag: v7.0-rc1) Linux 7.0-rc1
> ```
>
> + 49.03% 2.00% io_uring [kernel.kallsyms] [k] __blkdev_direct_IO_async
> - 38.66% 1.16% io_uring [kernel.kallsyms] [k] bio_alloc_bioset
> - 37.51% bio_alloc_bioset
> - 34.98% mempool_alloc_noprof
> - 34.87% kmem_cache_alloc_noprof
> - 33.82% ___slab_alloc
> - 30.25% get_from_any_partial
> - 29.59% get_from_partial_node
> - 28.42% __raw_spin_lock_irqsave
> native_queued_spin_lock_slowpath
> + 2.16% allocate_slab
> + 0.60% alloc_from_new_slab
> 0.51% __pcs_replace_empty_main
> 1.58% bio_associate_blkg
> + 1.16% submitter_uring_fn
> + 35.16% 0.30% io_uring [kernel.kallsyms] [k] kmem_cache_alloc_noprof
> + 35.13% 0.12% io_uring [kernel.kallsyms] [k] mempool_alloc_noprof
>
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 11:29 ` Vlastimil Babka (SUSE)
@ 2026-02-25 12:24 ` Ming Lei
2026-02-25 13:22 ` Vlastimil Babka (SUSE)
0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2026-02-25 12:24 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
[-- Attachment #1: Type: text/plain, Size: 3057 bytes --]
On Wed, Feb 25, 2026 at 12:29:26PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/25/26 10:31, Ming Lei wrote:
> > Hi Vlastimil,
> >
> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> >
> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> > didn't anticipate this interaction with mempools. We could change them
> >> > but there might be others using a similar pattern. Maybe it would be for
> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> Could you try this then, please? Thanks!
> >
> > Thanks for working on this issue!
> >
> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
>
> Hm that's weird, still the slowpath is prominent in your profile.
>
> I followed your reproducer instructions, although only with a small
> virtme-ng based setup. What's the output of "numactl -H" on yours, btw?
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 32 33 34 35
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 4 5 6 7 36 37 38 39
node 1 size: 31906 MB
node 1 free: 30572 MB
node 2 cpus: 8 9 10 11 40 41 42 43
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 12 13 14 15 44 45 46 47
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 16 17 18 19 48 49 50 51
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 20 21 22 23 52 53 54 55
node 5 size: 32135 MB
node 5 free: 31086 MB
node 6 cpus: 24 25 26 27 56 57 58 59
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 28 29 30 31 60 61 62 63
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 12 12 32 32 32 32
1: 12 10 12 12 32 32 32 32
2: 12 12 10 12 32 32 32 32
3: 12 12 12 10 32 32 32 32
4: 32 32 32 32 10 12 12 12
5: 32 32 32 32 12 10 12 12
6: 32 32 32 32 12 12 10 12
7: 32 32 32 32 12 12 12 10
>
> Anyway what I saw is my patch raised the IOPS substantially, and with
> CONFIG_SLUB_STATS=y enabled I could see that
> /sys/kernel/slab/bio-248/alloc_slowpath had substantial values before the
> patch and zero afterwards.
>
> Maybe if you could also enable CONFIG_SLUB_STATS=y and see in which cache(s)
> there's significant alloc_slowpath even after the patch, it could help.
Patched:
/sys/kernel/slab/bio-264
./alloc_slowpath:83555260 C0=33 C1=6717992 C2=9 C3=6611030 C8=128 C9=6802316 C11=6934363 C13=6721479 C14=66 C15=6694472 C16=96 C17=7286868 C18=128 C19=7369091 C24=128 C25=7288673 C26=51 C27=6800502 C28=129 C29=7095073 C31=7232628 C43=4 C56=1
Also config.tar.gz is attached.
Thanks,
Ming
[-- Attachment #2: config.tar.gz --]
[-- Type: application/gzip, Size: 42945 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 12:24 ` Ming Lei
@ 2026-02-25 13:22 ` Vlastimil Babka (SUSE)
0 siblings, 0 replies; 25+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-25 13:22 UTC (permalink / raw)
To: Ming Lei
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
On 2/25/26 13:24, Ming Lei wrote:
> On Wed, Feb 25, 2026 at 12:29:26PM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/25/26 10:31, Ming Lei wrote:
>> > Hi Vlastimil,
>> >
>> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> >> On 2/24/26 21:27, Vlastimil Babka wrote:
>> >> >
>> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> >> > didn't anticipate this interaction with mempools. We could change them
>> >> > but there might be others using a similar pattern. Maybe it would be for
>> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> >> > (but carefully as some deadlock avoidance depends on it, we might need
>> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> >> Could you try this then, please? Thanks!
>> >
>> > Thanks for working on this issue!
>> >
>> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
>> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
>>
>> Hm that's weird, still the slowpath is prominent in your profile.
>>
>> I followed your reproducer instructions, although only with a small
>> virtme-ng based setup. What's the output of "numactl -H" on yours, btw?
>
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 32 33 34 35
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 4 5 6 7 36 37 38 39
> node 1 size: 31906 MB
> node 1 free: 30572 MB
> node 2 cpus: 8 9 10 11 40 41 42 43
> node 2 size: 0 MB
> node 2 free: 0 MB
> node 3 cpus: 12 13 14 15 44 45 46 47
> node 3 size: 0 MB
> node 3 free: 0 MB
> node 4 cpus: 16 17 18 19 48 49 50 51
> node 4 size: 0 MB
> node 4 free: 0 MB
> node 5 cpus: 20 21 22 23 52 53 54 55
> node 5 size: 32135 MB
> node 5 free: 31086 MB
> node 6 cpus: 24 25 26 27 56 57 58 59
> node 6 size: 0 MB
> node 6 free: 0 MB
> node 7 cpus: 28 29 30 31 60 61 62 63
> node 7 size: 0 MB
> node 7 free: 0 MB
> node distances:
> node 0 1 2 3 4 5 6 7
> 0: 10 12 12 12 32 32 32 32
> 1: 12 10 12 12 32 32 32 32
> 2: 12 12 10 12 32 32 32 32
> 3: 12 12 12 10 32 32 32 32
> 4: 32 32 32 32 10 12 12 12
> 5: 32 32 32 32 12 10 12 12
> 6: 32 32 32 32 12 12 10 12
> 7: 32 32 32 32 12 12 12 10
Oh right, memory-less nodes, of course. Always so much fun.
>>
>> Anyway what I saw is my patch raised the IOPS substantially, and with
>> CONFIG_SLUB_STATS=y enabled I could see that
>> /sys/kernel/slab/bio-248/alloc_slowpath had substantial values before the
>> patch and zero afterwards.
>>
>> Maybe if you could also enable CONFIG_SLUB_STATS=y and see in which cache(s)
>> there's significant alloc_slowpath even after the patch, it could help.
>
> Patched:
>
> /sys/kernel/slab/bio-264
> ./alloc_slowpath:83555260 C0=33 C1=6717992 C2=9 C3=6611030 C8=128 C9=6802316 C11=6934363 C13=6721479 C14=66 C15=6694472 C16=96 C17=7286868 C18=128 C19=7369091 C24=128 C25=7288673 C26=51 C27=6800502 C28=129 C29=7095073 C31=7232628 C43=4 C56=1
Yean, no slowpath allocations from cpus that are *not* on a memoryless node.
Thanks, that will help to focus what to look at.
>
> Also config.tar.gz is attached.
>
> Thanks,
> Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-25 9:31 ` Ming Lei
2026-02-25 11:29 ` Vlastimil Babka (SUSE)
@ 2026-02-26 18:02 ` Vlastimil Babka (SUSE)
2026-02-27 9:23 ` Ming Lei
1 sibling, 1 reply; 25+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-02-26 18:02 UTC (permalink / raw)
To: Ming Lei
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
On 2/25/26 10:31, Ming Lei wrote:
> Hi Vlastimil,
>
> On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/24/26 21:27, Vlastimil Babka wrote:
>> >
>> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> > didn't anticipate this interaction with mempools. We could change them
>> > but there might be others using a similar pattern. Maybe it would be for
>> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> > (but carefully as some deadlock avoidance depends on it, we might need
>> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> Could you try this then, please? Thanks!
>
> Thanks for working on this issue!
>
> Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
what about this patch in addition to the previous one? Thanks.
----8<----
From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Date: Thu, 26 Feb 2026 18:59:56 +0100
Subject: [PATCH] mm/slab: put barn on every online node
Including memoryless nodes.
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
mm/slab.h | 7 ++-
mm/slub.c | 146 ++++++++++++++++++++++++++++++++----------------------
2 files changed, 94 insertions(+), 59 deletions(-)
diff --git a/mm/slab.h b/mm/slab.h
index 71c7261bf822..5b5e3ed6adae 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -191,6 +191,11 @@ struct kmem_cache_order_objects {
unsigned int x;
};
+struct kmem_cache_per_node_ptrs {
+ struct node_barn *barn;
+ struct kmem_cache_node *node;
+};
+
/*
* Slab cache management.
*/
@@ -247,7 +252,7 @@ struct kmem_cache {
struct kmem_cache_stats __percpu *cpu_stats;
#endif
- struct kmem_cache_node *node[MAX_NUMNODES];
+ struct kmem_cache_per_node_ptrs per_node[MAX_NUMNODES];
};
/*
diff --git a/mm/slub.c b/mm/slub.c
index 258307270442..24f1f12d6a37 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -59,7 +59,7 @@
* 0. cpu_hotplug_lock
* 1. slab_mutex (Global Mutex)
* 2a. kmem_cache->cpu_sheaves->lock (Local trylock)
- * 2b. node->barn->lock (Spinlock)
+ * 2b. barn->lock (Spinlock)
* 2c. node->list_lock (Spinlock)
* 3. slab_lock(slab) (Only on some arches)
* 4. object_map_lock (Only for debugging)
@@ -136,7 +136,7 @@
* or spare sheaf can handle the allocation or free, there is no other
* overhead.
*
- * node->barn->lock (spinlock)
+ * barn->lock (spinlock)
*
* This lock protects the operations on per-NUMA-node barn. It can quickly
* serve an empty or full sheaf if available, and avoid more expensive refill
@@ -436,26 +436,24 @@ struct kmem_cache_node {
atomic_long_t total_objects;
struct list_head full;
#endif
- struct node_barn *barn;
};
static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
{
- return s->node[node];
+ return s->per_node[node].node;
+}
+
+static inline struct node_barn *get_barn_node(struct kmem_cache *s, int node)
+{
+ return s->per_node[node].barn;
}
/*
- * Get the barn of the current cpu's closest memory node. It may not exist on
- * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
+ * Get the barn of the current cpu's memory node. It may be a memoryless node.
*/
static inline struct node_barn *get_barn(struct kmem_cache *s)
{
- struct kmem_cache_node *n = get_node(s, numa_mem_id());
-
- if (!n)
- return NULL;
-
- return n->barn;
+ return get_barn_node(s, numa_node_id());
}
/*
@@ -474,6 +472,12 @@ static inline struct node_barn *get_barn(struct kmem_cache *s)
*/
static nodemask_t slab_nodes;
+/*
+ * Similar to slab_nodes but for where we have node_barn allocated.
+ * Corresponds to N_ONLINE nodes.
+ */
+static nodemask_t slab_barn_nodes;
+
/*
* Workqueue used for flushing cpu and kfree_rcu sheaves.
*/
@@ -5744,7 +5748,6 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
static void rcu_free_sheaf(struct rcu_head *head)
{
- struct kmem_cache_node *n;
struct slab_sheaf *sheaf;
struct node_barn *barn = NULL;
struct kmem_cache *s;
@@ -5767,12 +5770,10 @@ static void rcu_free_sheaf(struct rcu_head *head)
if (__rcu_free_sheaf_prepare(s, sheaf))
goto flush;
- n = get_node(s, sheaf->node);
- if (!n)
+ barn = get_barn_node(s, sheaf->node);
+ if (!barn)
goto flush;
- barn = n->barn;
-
/* due to slab_free_hook() */
if (unlikely(sheaf->size == 0))
goto empty;
@@ -5894,7 +5895,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
rcu_sheaf = NULL;
} else {
pcs->rcu_free = NULL;
- rcu_sheaf->node = numa_mem_id();
+ rcu_sheaf->node = numa_node_id();
}
/*
@@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
return;
- if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
+ if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
+ || !node_isset(slab_nid(slab), slab_nodes))
&& likely(!slab_test_pfmemalloc(slab))) {
if (likely(free_to_pcs(s, object, true)))
return;
@@ -7383,7 +7385,7 @@ static inline int calculate_order(unsigned int size)
}
static void
-init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
+init_kmem_cache_node(struct kmem_cache_node *n)
{
n->nr_partial = 0;
spin_lock_init(&n->list_lock);
@@ -7393,9 +7395,6 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
atomic_long_set(&n->total_objects, 0);
INIT_LIST_HEAD(&n->full);
#endif
- n->barn = barn;
- if (barn)
- barn_init(barn);
}
#ifdef CONFIG_SLUB_STATS
@@ -7490,8 +7489,8 @@ static void early_kmem_cache_node_alloc(int node)
n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
slab->freelist = get_freepointer(kmem_cache_node, n);
slab->inuse = 1;
- kmem_cache_node->node[node] = n;
- init_kmem_cache_node(n, NULL);
+ kmem_cache_node->per_node[node].node = n;
+ init_kmem_cache_node(n);
inc_slabs_node(kmem_cache_node, node, slab->objects);
/*
@@ -7506,15 +7505,20 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
int node;
struct kmem_cache_node *n;
- for_each_kmem_cache_node(s, node, n) {
- if (n->barn) {
- WARN_ON(n->barn->nr_full);
- WARN_ON(n->barn->nr_empty);
- kfree(n->barn);
- n->barn = NULL;
- }
+ for_each_node(node) {
+ struct node_barn *barn = get_barn_node(s, node);
+
+ if (!barn)
+ continue;
- s->node[node] = NULL;
+ WARN_ON(barn->nr_full);
+ WARN_ON(barn->nr_empty);
+ kfree(barn);
+ s->per_node[node].barn = NULL;
+ }
+
+ for_each_kmem_cache_node(s, node, n) {
+ s->per_node[node].node = NULL;
kmem_cache_free(kmem_cache_node, n);
}
}
@@ -7535,31 +7539,36 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
for_each_node_mask(node, slab_nodes) {
struct kmem_cache_node *n;
- struct node_barn *barn = NULL;
if (slab_state == DOWN) {
early_kmem_cache_node_alloc(node);
continue;
}
- if (cache_has_sheaves(s)) {
- barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
-
- if (!barn)
- return 0;
- }
-
n = kmem_cache_alloc_node(kmem_cache_node,
GFP_KERNEL, node);
- if (!n) {
- kfree(barn);
+ if (!n)
return 0;
- }
- init_kmem_cache_node(n, barn);
+ init_kmem_cache_node(n);
+ s->per_node[node].node = n;
+ }
+
+ if (slab_state == DOWN || !cache_has_sheaves(s))
+ return 1;
+
+ for_each_node_mask(node, slab_barn_nodes) {
+ struct node_barn *barn;
+
+ barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+ if (!barn)
+ return 0;
- s->node[node] = n;
+ barn_init(barn);
+ s->per_node[node].barn = barn;
}
+
return 1;
}
@@ -7848,10 +7857,15 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
if (cache_has_sheaves(s))
rcu_barrier();
+ for_each_node(node) {
+ struct node_barn *barn = get_barn_node(s, node);
+
+ if (barn)
+ barn_shrink(s, barn);
+ }
+
/* Attempt to free all objects */
for_each_kmem_cache_node(s, node, n) {
- if (n->barn)
- barn_shrink(s, n->barn);
free_partial(s, n);
if (n->nr_partial || node_nr_slabs(n))
return 1;
@@ -8061,14 +8075,18 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
unsigned long flags;
int ret = 0;
+ for_each_node(node) {
+ struct node_barn *barn = get_barn_node(s, node);
+
+ if (barn)
+ barn_shrink(s, barn);
+ }
+
for_each_kmem_cache_node(s, node, n) {
INIT_LIST_HEAD(&discard);
for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
INIT_LIST_HEAD(promote + i);
- if (n->barn)
- barn_shrink(s, n->barn);
-
spin_lock_irqsave(&n->list_lock, flags);
/*
@@ -8157,7 +8175,11 @@ static int slab_mem_going_online_callback(int nid)
if (get_node(s, nid))
continue;
- if (cache_has_sheaves(s)) {
+ /*
+ * barn might already exist if the node was online but
+ * memoryless
+ */
+ if (cache_has_sheaves(s) && !node_isset(nid, slab_barn_nodes)) {
barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
if (!barn) {
@@ -8178,15 +8200,20 @@ static int slab_mem_going_online_callback(int nid)
goto out;
}
- init_kmem_cache_node(n, barn);
+ init_kmem_cache_node(n);
+ s->per_node[nid].node = n;
- s->node[nid] = n;
+ if (barn) {
+ barn_init(barn);
+ s->per_node[nid].barn = barn;
+ }
}
/*
* Any cache created after this point will also have kmem_cache_node
* initialized for the new node.
*/
node_set(nid, slab_nodes);
+ node_set(nid, slab_barn_nodes);
out:
mutex_unlock(&slab_mutex);
return ret;
@@ -8265,7 +8292,7 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
if (!capacity)
return;
- for_each_node_mask(node, slab_nodes) {
+ for_each_node_mask(node, slab_barn_nodes) {
struct node_barn *barn;
barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
@@ -8276,7 +8303,7 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
}
barn_init(barn);
- get_node(s, node)->barn = barn;
+ s->per_node[node].barn = barn;
}
for_each_possible_cpu(cpu) {
@@ -8337,6 +8364,9 @@ void __init kmem_cache_init(void)
for_each_node_state(node, N_MEMORY)
node_set(node, slab_nodes);
+ for_each_online_node(node)
+ node_set(node, slab_barn_nodes);
+
create_boot_cache(kmem_cache_node, "kmem_cache_node",
sizeof(struct kmem_cache_node),
SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
@@ -8347,8 +8377,8 @@ void __init kmem_cache_init(void)
slab_state = PARTIAL;
create_boot_cache(kmem_cache, "kmem_cache",
- offsetof(struct kmem_cache, node) +
- nr_node_ids * sizeof(struct kmem_cache_node *),
+ offsetof(struct kmem_cache, per_node) +
+ nr_node_ids * sizeof(struct kmem_cache_per_node_ptrs),
SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
kmem_cache = bootstrap(&boot_kmem_cache);
--
2.53.0
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-26 18:02 ` Vlastimil Babka (SUSE)
@ 2026-02-27 9:23 ` Ming Lei
2026-03-05 13:05 ` Vlastimil Babka (SUSE)
0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2026-02-27 9:23 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/25/26 10:31, Ming Lei wrote:
> > Hi Vlastimil,
> >
> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> >
> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> > didn't anticipate this interaction with mempools. We could change them
> >> > but there might be others using a similar pattern. Maybe it would be for
> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> Could you try this then, please? Thanks!
> >
> > Thanks for working on this issue!
> >
> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
>
> what about this patch in addition to the previous one? Thanks.
With the two patches, IOPS increases to 22M from 13M, but still much less than
36M which is obtained in v6.19-rc5, and slab-sheave PR follows v6.19-rc5.
Also alloc_slowpath can't be observed any more.
Follows perf profile with the two patches:
- 8.30% 0.19% io_uring [kernel.kallsyms] [k] mempool_alloc_noprof
- 8.11% mempool_alloc_noprof
- 7.64% kmem_cache_alloc_noprof
- 6.15% __pcs_replace_empty_main
- 5.96% refill_sheaf
+ 5.95% refill_objects
+ 8.06% 0.44% io_uring [kernel.kallsyms] [k] kmem_cache_alloc_noprof
+ 7.44% 0.00% kublk [ublk_drv] [k] 0xffffffffc140c71b
+ 6.63% 0.03% kublk [kernel.kallsyms] [k] __io_run_local_work
+ 6.19% 0.05% io_uring [kernel.kallsyms] [k] __pcs_replace_empty_main
- 5.97% 0.01% io_uring [kernel.kallsyms] [k] refill_sheaf
- 5.96% refill_sheaf
- 5.95% refill_objects
- 4.87% __refill_objects_any
- 4.76% __refill_objects_node
0.72% __slab_free
- 1.00% allocate_slab
- 0.80% __alloc_frozen_pages_noprof
- 0.79% get_page_from_freelist
+ 0.72% post_alloc_hook
+ 5.96% 0.02% io_uring [kernel.kallsyms] [k] refill_objects
thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-02-27 9:23 ` Ming Lei
@ 2026-03-05 13:05 ` Vlastimil Babka (SUSE)
2026-03-05 15:48 ` Ming Lei
0 siblings, 1 reply; 25+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-05 13:05 UTC (permalink / raw)
To: Ming Lei
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
On 2/27/26 10:23, Ming Lei wrote:
> On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
>> On 2/25/26 10:31, Ming Lei wrote:
>> > Hi Vlastimil,
>> >
>> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
>> >> On 2/24/26 21:27, Vlastimil Babka wrote:
>> >> >
>> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
>> >> > didn't anticipate this interaction with mempools. We could change them
>> >> > but there might be others using a similar pattern. Maybe it would be for
>> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
>> >> > (but carefully as some deadlock avoidance depends on it, we might need
>> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
>> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
>> >> Could you try this then, please? Thanks!
>> >
>> > Thanks for working on this issue!
>> >
>> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
>> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
>>
>> what about this patch in addition to the previous one? Thanks.
>
> With the two patches, IOPS increases to 22M from 13M, but still much less than
> 36M which is obtained in v6.19-rc5, and slab-sheave PR follows v6.19-rc5.
OK thanks! Maybe now we're approching the original theories about effective
caching capacity etc...
> Also alloc_slowpath can't be observed any more.
>
> Follows perf profile with the two patches:
What's the full perf profile of v6.19-rc5 and full profile of the patched
7.0-rc2 then? Thanks.
Also contents of all the files under /sys/kernel/slab/$cache (forgot which
particular one it was) with CONFIG_SLUB_STATS=y would be great, thanks.
>
>
> - 8.30% 0.19% io_uring [kernel.kallsyms] [k] mempool_alloc_noprof
> - 8.11% mempool_alloc_noprof
> - 7.64% kmem_cache_alloc_noprof
> - 6.15% __pcs_replace_empty_main
> - 5.96% refill_sheaf
> + 5.95% refill_objects
> + 8.06% 0.44% io_uring [kernel.kallsyms] [k] kmem_cache_alloc_noprof
> + 7.44% 0.00% kublk [ublk_drv] [k] 0xffffffffc140c71b
> + 6.63% 0.03% kublk [kernel.kallsyms] [k] __io_run_local_work
> + 6.19% 0.05% io_uring [kernel.kallsyms] [k] __pcs_replace_empty_main
> - 5.97% 0.01% io_uring [kernel.kallsyms] [k] refill_sheaf
> - 5.96% refill_sheaf
> - 5.95% refill_objects
> - 4.87% __refill_objects_any
> - 4.76% __refill_objects_node
> 0.72% __slab_free
> - 1.00% allocate_slab
> - 0.80% __alloc_frozen_pages_noprof
> - 0.79% get_page_from_freelist
> + 0.72% post_alloc_hook
> + 5.96% 0.02% io_uring [kernel.kallsyms] [k] refill_objects
>
>
> thanks,
> Ming
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
2026-03-05 13:05 ` Vlastimil Babka (SUSE)
@ 2026-03-05 15:48 ` Ming Lei
0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2026-03-05 15:48 UTC (permalink / raw)
To: Vlastimil Babka (SUSE)
Cc: Vlastimil Babka, Andrew Morton, linux-mm, linux-kernel,
linux-block, Harry Yoo, Hao Li, Christoph Hellwig
On Thu, Mar 05, 2026 at 02:05:20PM +0100, Vlastimil Babka (SUSE) wrote:
> On 2/27/26 10:23, Ming Lei wrote:
> > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 2/25/26 10:31, Ming Lei wrote:
> >> > Hi Vlastimil,
> >> >
> >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> >> >> >
> >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> >> >> > didn't anticipate this interaction with mempools. We could change them
> >> >> > but there might be others using a similar pattern. Maybe it would be for
> >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> >> >> Could you try this then, please? Thanks!
> >> >
> >> > Thanks for working on this issue!
> >> >
> >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> >>
> >> what about this patch in addition to the previous one? Thanks.
> >
> > With the two patches, IOPS increases to 22M from 13M, but still much less than
> > 36M which is obtained in v6.19-rc5, and slab-sheave PR follows v6.19-rc5.
>
> OK thanks! Maybe now we're approching the original theories about effective
> caching capacity etc...
>
> > Also alloc_slowpath can't be observed any more.
> >
> > Follows perf profile with the two patches:
>
> What's the full perf profile of v6.19-rc5 and full profile of the patched
> 7.0-rc2 then? Thanks.
>
> Also contents of all the files under /sys/kernel/slab/$cache (forgot which
> particular one it was) with CONFIG_SLUB_STATS=y would be great, thanks.
Please see the following log, and let me know if any other info is needed.
1) v6.19-rc5
- IOPS: 34M
- perf profile
+ perf report --vmlinux=/root/git/linux/vmlinux --kallsyms=/proc/kallsyms --stdio --max-stack 0
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:P'
# Event count (approx.): 1045386603400
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ........................ ..............................................
#
14.41% 14.41% kublk [kernel.kallsyms] [k] _copy_from_iter
11.25% 11.25% io_uring [kernel.kallsyms] [k] blk_mq_sched_bio_merge
3.73% 3.73% kublk [kernel.kallsyms] [k] slab_update_freelist.isra.0
3.53% 3.53% kublk [kernel.kallsyms] [k] ublk_dispatch_req
3.33% 3.33% io_uring [kernel.kallsyms] [k] blk_mq_rq_ctx_init.isra.0
2.65% 2.65% kublk [kernel.kallsyms] [k] blk_mq_free_request
2.01% 2.01% io_uring [kernel.kallsyms] [k] blkdev_read_iter
1.92% 1.92% io_uring [kernel.kallsyms] [k] __io_read
1.67% 1.67% io_uring [kernel.kallsyms] [k] blk_mq_submit_bio
1.54% 1.54% kublk [kernel.kallsyms] [k] ublk_ch_uring_cmd_local
1.36% 1.36% io_uring [kernel.kallsyms] [k] __fsnotify_parent
1.30% 1.30% io_uring [kernel.kallsyms] [k] clear_page_erms
1.19% 1.19% io_uring [kernel.kallsyms] [k] llist_reverse_order
1.11% 1.11% io_uring [kernel.kallsyms] [k] blk_cgroup_bio_start
0.98% 0.98% kublk [kernel.kallsyms] [k] __check_object_size
0.98% 0.98% kublk kublk [.] ublk_queue_io_cmd
0.97% 0.97% io_uring [kernel.kallsyms] [k] __submit_bio
0.97% 0.97% kublk [kernel.kallsyms] [k] __slab_free
0.96% 0.96% io_uring [kernel.kallsyms] [k] submit_bio_noacct_nocheck
0.92% 0.92% kublk [kernel.kallsyms] [k] io_issue_sqe
0.91% 0.91% io_uring io_uring [.] submitter_uring_fn
0.88% 0.88% io_uring io_uring [.] get_offset.part.0
0.86% 0.86% io_uring [kernel.kallsyms] [k] kmem_cache_alloc_noprof
0.85% 0.85% kublk [kernel.kallsyms] [k] ublk_copy_user_pages.isra.0
0.77% 0.77% io_uring [kernel.kallsyms] [k] blk_mq_start_request
0.74% 0.74% kublk kublk [.] ublk_null_queue_io
0.74% 0.74% io_uring [kernel.kallsyms] [k] io_import_reg_buf
0.67% 0.67% io_uring [kernel.kallsyms] [k] io_issue_sqe
0.66% 0.66% io_uring [kernel.kallsyms] [k] bio_alloc_bioset
0.66% 0.66% kublk [kernel.kallsyms] [k] kmem_cache_free
0.66% 0.66% io_uring [kernel.kallsyms] [k] __blkdev_direct_IO_async
0.64% 0.64% kublk [kernel.kallsyms] [k] __io_issue_sqe
0.61% 0.61% io_uring [kernel.kallsyms] [k] submit_bio
0.59% 0.59% kublk [kernel.kallsyms] [k] __io_uring_cmd_done
0.58% 0.58% io_uring [kernel.kallsyms] [k] blk_rq_merge_ok
0.56% 0.56% kublk [kernel.kallsyms] [k] __io_submit_flush_completions
0.54% 0.54% kublk kublk [.] __ublk_io_handler_fn.isra.0
0.53% 0.53% kublk [kernel.kallsyms] [k] io_uring_cmd
0.52% 0.52% io_uring [kernel.kallsyms] [k] __io_prep_rw
0.52% 0.52% io_uring [kernel.kallsyms] [k] io_free_batch_list
0.50% 0.50% kublk [kernel.kallsyms] [k] io_uring_cmd_prep
0.49% 0.49% kublk [kernel.kallsyms] [k] blk_account_io_done.part.0
0.49% 0.49% io_uring [kernel.kallsyms] [k] __io_submit_flush_completions
- slab stat
# (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
./remote_node_defrag_ratio:100
./free_frozen:203789653 C0=13137513 C2=16103904 C4=5312681 C6=9805649 C8=14262027 C10=13676236 C12=8700700 C14=13041782 C16=11558292 C18=13258018 C19=2 C20=2813290 C22=7752577 C24=19173693 C26=16631916 C28=21707419 C29=2 C30=16853951 C31=1
./total_objects:6732 N1=3315 N5=3417
./cpuslab_flush:0
./alloc_fastpath:1284958471 C1=80252197 C3=80197810 C4=125 C5=82882536 C6=125 C7=83898247 C8=125 C9=81412735 C11=80400026 C12=125 C13=78664565 C14=44 C15=80954403 C17=80070327 C19=75310035 C20=125 C21=83788507 C22=81 C23=84943484 C25=78466239 C26=125 C27=78389061 C29=76890573 C31=78436849 C50=1 C60=1
./cpu_partial_free:37988123 C0=2275928 C2=2190868 C4=2789178 C6=2685497 C8=2282195 C10=2266792 C12=2340158 C14=2302589 C16=2359282 C18=2154683 C20=3028332 C22=2921916 C24=2103757 C26=2157902 C28=1972836 C30=2156210
./cpu_slabs:58 N1=28 N5=30
./objects:6167 N1=3092 N5=3075
./deactivate_full:0
./sheaf_return_slow:0
./objects_partial:608 N1=287 N5=321
./sheaf_return_fast:0
./cpu_partial:52
./cmpxchg_double_cpu_fail:1 C7=1
./free_slowpath:1361594822 C0=85109840 C2=85495921 C4=86775189 C6=88474098 C8=86495486 C10=85287670 C12=82701232 C14=85802194 C16=84711284 C18=79945983 C19=2 C20=87399505 C22=89361232 C24=84116440 C26=83560456 C28=82780090 C29=2 C30=83578197 C31=1
./barn_get_fail:0
./sheaf_prefill_oversize:0
./deactivate_to_tail:0
./skip_kfence:0
./min_partial:5
./order_fallback:0
./sheaf_capacity:0
./deactivate_empty:3616332 C0=269533 C2=262401 C4=116355 C6=112383 C8=271620 C10=266348 C12=278359 C14=271083 C16=264315 C18=242601 C20=170557 C22=159604 C24=231322 C26=240708 C28=220103 C30=239040
./sheaf_flush:0
./free_rcu_sheaf:0
./alloc_from_partial:11612237 C1=660211 C3=634301 C5=949155 C6=1 C7=914355 C9=661811 C11=658753 C13=679880 C15=669226 C17=684745 C19=624788 C20=1 C21=1037955 C22=1 C23=1002678 C25=611243 C27=625403 C29=571631 C31=626099
./sheaf_alloc:0
./sheaf_free:0
./sheaf_prefill_slow:0
./sheaf_prefill_fast:0
./poison:0
./red_zone:0
./free_cpu_sheaf:0
./free_slab:3616434 C0=269535 C2=262407 C4=116368 C6=112391 C8=271622 C10=266351 C12=278359 C14=271084 C16=264354 C18=242601 C20=170559 C22=159611 C24=231322 C26=240711 C28=220114 C30=239045
./slabs:132 N1=65 N5=67
./barn_get:0
./cpu_partial_node:22759400 C1=1312100 C3=1260562 C5=1821488 C6=2 C7=1752623 C9=1315094 C11=1309216 C13=1351244 C15=1329937 C17=1360857 C19=1241554 C20=2 C21=1968791 C22=2 C23=1898000 C25=1214784 C27=1242922 C29=1136091 C31=1244131
./alloc_slowpath:76640471 C1=4857913 C3=5298367 C4=3 C5=3892806 C6=3 C7=4575965 C8=3 C9=5082878 C11=4887906 C12=3 C13=4036796 C14=1 C15=4848003 C17=4641269 C19=4636149 C20=3 C21=3611116 C22=2 C23=4417922 C25=5650460 C26=3 C27=5171520 C29=5889792 C31=5141585 C50=1 C60=1 C62=1
./destroy_by_rcu:1
./free_rcu_sheaf_fail:0
./barn_put:0
./usersize:0
./sanity_checks:0
./barn_put_fail:0
./align:64
./alloc_node_mismatch:0
./deactivate_remote_frees:0
./alloc_slab:3616566 C1=303677 C3=296031 C4=3 C5=18366 C7=18301 C8=3 C9=305344 C11=298932 C12=3 C13=309156 C14=1 C15=303522 C17=313382 C19=288344 C21=21685 C23=21353 C25=277789 C26=3 C27=289631 C29=265057 C31=285980 C50=1 C60=1 C62=1
./free_remove_partial:102 C0=2 C2=6 C4=13 C6=8 C8=2 C10=3 C14=1 C16=39 C20=2 C22=7 C26=3 C28=11 C30=5
./aliases:0
./store_user:0
./trace:0
./reclaim_account:0
./order:2
./sheaf_refill:0
./object_size:256
./alloc_refill:38652283 C1=2581925 C3=3107474 C5=1103799 C7=1890686 C9=2800630 C11=2621006 C13=1696518 C15=2545318 C17=2282285 C19=2481464 C21=582686 C23=1495892 C25=3546646 C27=3013564 C29=3917013 C31=2985377
./alloc_cpu_sheaf:0
./cpu_partial_drain:12662698 C0=758642 C2=730289 C4=929725 C6=895165 C8=760731 C10=755597 C12=780052 C14=767529 C16=786427 C18=718227 C20=1009443 C22=973972 C24=701252 C26=719300 C28=657611 C30=718736
./free_fastpath:4 C1=2 C11=2
./hwcache_align:1
./cpu_partial_alloc:22759385 C1=1312100 C3=1260561 C5=1821486 C6=2 C7=1752623 C9=1315093 C11=1309215 C13=1351242 C15=1329937 C17=1360857 C19=1241553 C20=2 C21=1968790 C22=1 C23=1897999 C25=1214782 C27=1242922 C29=1136091 C31=1244129
./cmpxchg_double_fail:6247305 C0=396268 C1=16193 C2=484201 C3=11558 C4=198887 C5=7233 C6=336779 C7=7332 C8=444665 C9=11539 C10=403230 C11=10130 C12=258163 C13=6666 C14=389004 C15=9620 C16=357182 C17=9184 C18=378255 C19=9012 C20=103655 C21=2375 C22=260015 C23=6160 C24=552885 C25=22738 C26=464990 C27=11172 C28=592307 C29=23777 C30=451529 C31=10601
./deactivate_bypass:37988161 C1=2275987 C3=2190892 C4=2 C5=2789006 C6=2 C7=2685278 C8=2 C9=2282247 C11=2266899 C12=2 C13=2340277 C15=2302684 C17=2358983 C19=2154684 C20=2 C21=3028429 C22=1 C23=2922029 C25=2103813 C26=2 C27=2157955 C29=1972778 C31=2156207
./objs_per_slab:51
./partial:23 N1=10 N5=13
./slabs_cpu_partial:1122(44) C0=51(2) C2=25(1) C3=25(1) C4=76(3) C5=51(2) C6=51(2) C8=51(2) C9=25(1) C10=25(1) C11=25(1) C12=51(2) C13=51(2) C14=51(2) C16=25(1) C18=51(2) C19=25(1) C20=76(3) C21=25(1) C22=25(1) C23=25(1) C24=25(1) C25=51(2) C26=51(2) C28=76(3) C30=51(2) C31=51(2)
./free_add_partial:34371762 C0=2006393 C2=1928466 C4=2672820 C6=2573112 C8=2010573 C10=2000443 C12=2061797 C14=2031504 C16=2094966 C18=1912080 C20=2857772 C22=2762312 C24=1872434 C26=1917192 C28=1752730 C30=1917168
./slab_size:320
./cache_dma:0
./deactivate_to_head:0
2) v7.0-rc2(commit c107785c7e8d) + two patches
- IOPS: 23M
- perf profile
+ perf report --vmlinux=/root/git/linux/vmlinux --kallsyms=/proc/kallsyms --stdio --max-stack 0
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 858K of event 'cycles:P'
# Event count (approx.): 667558170118
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... .................................. ..............................................
#
10.81% 10.81% kublk [kernel.kallsyms] [k] _copy_from_iter
5.23% 5.23% io_uring [kernel.kallsyms] [k] blk_mq_submit_bio
3.97% 3.97% io_uring [kernel.kallsyms] [k] __refill_objects_node
2.69% 2.69% io_uring [kernel.kallsyms] [k] io_rw_init_file
2.61% 2.61% io_uring [kernel.kallsyms] [k] blk_cgroup_bio_start
2.55% 2.55% io_uring [kernel.kallsyms] [k] blk_mq_rq_ctx_init.isra.0
2.52% 2.52% kublk [kernel.kallsyms] [k] blk_mq_free_request
2.45% 2.45% kublk [kernel.kallsyms] [k] ublk_dispatch_req
2.18% 2.18% io_uring [kernel.kallsyms] [k] __fsnotify_parent
1.87% 1.87% kublk [kernel.kallsyms] [k] __slab_free
1.82% 1.82% io_uring [kernel.kallsyms] [k] __io_read
1.77% 1.77% kublk [kernel.kallsyms] [k] slab_update_freelist.isra.0
1.72% 1.72% kublk [kernel.kallsyms] [k] __io_uring_cmd_done
1.70% 1.70% io_uring [kernel.kallsyms] [k] security_file_permission
1.68% 1.68% io_uring [kernel.kallsyms] [k] io_req_task_complete
1.51% 1.51% kublk [kernel.kallsyms] [k] ublk_start_io
1.32% 1.32% io_uring [kernel.kallsyms] [k] llist_reverse_order
1.30% 1.30% io_uring [kernel.kallsyms] [k] submit_bio_noacct_nocheck
1.22% 1.22% kublk [kernel.kallsyms] [k] blk_account_io_done.part.0
1.15% 1.15% io_uring [kernel.kallsyms] [k] kernel_init_pages
1.11% 1.11% kublk [kernel.kallsyms] [k] __local_bh_enable_ip
1.03% 1.03% io_uring [kernel.kallsyms] [k] io_import_reg_buf
1.03% 1.03% kublk [kernel.kallsyms] [k] ublk_ch_uring_cmd_local
1.01% 1.01% io_uring [kernel.kallsyms] [k] wbt_issue
0.97% 0.97% io_uring [kernel.kallsyms] [k] __submit_bio
0.81% 0.81% kublk [kernel.kallsyms] [k] avc_has_perm
0.80% 0.80% io_uring [kernel.kallsyms] [k] __rq_qos_issue
0.76% 0.76% kublk [kernel.kallsyms] [k] __blk_mq_free_request
0.73% 0.73% kublk kublk [.] ublk_queue_io_cmd
0.73% 0.73% io_uring io_uring [.] submitter_uring_fn
0.67% 0.67% io_uring [kernel.kallsyms] [k] kmem_cache_alloc_noprof
0.65% 0.65% kublk [kernel.kallsyms] [k] __io_submit_flush_completions
0.62% 0.62% kublk [kernel.kallsyms] [k] blk_stat_add
0.62% 0.62% kublk [kernel.kallsyms] [k] __ublk_complete_rq
0.61% 0.61% kublk [kernel.kallsyms] [k] blk_update_request
0.60% 0.60% kublk [kernel.kallsyms] [k] __blk_mq_end_request
0.58% 0.58% io_uring [kernel.kallsyms] [k] bio_alloc_bioset
0.56% 0.56% kublk [kernel.kallsyms] [k] __rcu_read_lock
0.54% 0.54% io_uring [kernel.kallsyms] [k] io_req_rw_complete
0.54% 0.54% io_uring [kernel.kallsyms] [k] io_free_batch_list
0.53% 0.53% io_uring [kernel.kallsyms] [k] __io_submit_flush_completions
0.53% 0.53% io_uring [kernel.kallsyms] [k] io_init_req
0.53% 0.53% io_uring [kernel.kallsyms] [k] __blkdev_direct_IO_async
0.53% 0.53% kublk [kernel.kallsyms] [k] io_issue_sqe
0.51% 0.51% io_uring [kernel.kallsyms] [k] blk_mq_start_request
0.51% 0.51% kublk [kernel.kallsyms] [k] io_req_local_work_add
0.51% 0.51% kublk [kernel.kallsyms] [k] kmem_cache_free
0.49% 0.49% io_uring [kernel.kallsyms] [k] io_import_fixed
- slab stat
# (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
./remote_node_defrag_ratio:100
./total_objects:9078 N1=4233 N5=4845
./alloc_fastpath:897715187 C1=45250242 C3=50602079 C5=89955493 C6=128 C7=81923744 C8=128 C9=46275792 C10=128 C11=46037573 C12=128 C13=53037806 C14=128 C15=49291969 C16=128 C17=49716073 C18=4 C19=45475417 C20=130 C21=75693223 C22=128 C23=69595236 C24=128 C25=52992066 C26=1 C27=51082176 C28=66 C29=44931239 C30=2 C31=45853827 C48=2 C59=2 C63=1
./cpu_slabs:0
./objects:5404 N1=2665 N5=2739
./sheaf_return_slow:0
./objects_partial:3772 N1=1849 N5=1923
./sheaf_return_fast:0
./cpu_partial:0
./free_slowpath:580544104 C0=45249992 C2=50601817 C4=2 C6=2 C8=46275666 C10=46037443 C12=53037685 C14=49291858 C16=49715937 C18=45475167 C20=13 C22=21 C24=52991949 C26=51081920 C28=44931147 C30=45853478 C49=2 C59=2 C61=2 C63=1
./barn_get_fail:20733914 C1=1616081 C3=1807218 C5=23 C6=1 C7=10 C8=5 C9=1652707 C10=5 C11=1644200 C12=5 C13=1894208 C14=5 C15=1760428 C16=5 C17=1775575 C18=1 C19=1624123 C20=4 C21=6 C22=5 C23=21 C24=5 C25=1892574 C26=1 C27=1824364 C28=3 C29=1604692 C31=1637636 C48=1 C59=1 C63=1
./sheaf_prefill_oversize:0
./skip_kfence:0
./min_partial:5
./order_fallback:0
./sheaf_capacity:28
./sheaf_flush:84 C4=28 C20=28 C22=28
./free_rcu_sheaf:0
./sheaf_alloc:120 C1=1 C3=1 C4=10 C5=1 C6=3 C8=1 C9=1 C10=1 C11=1 C12=1 C13=1 C14=1 C15=2 C16=1 C17=1 C18=65 C19=1 C20=6 C22=7 C23=1 C24=1 C25=1 C26=1 C27=2 C28=1 C29=3 C31=1 C48=1 C59=1 C63=1
./sheaf_free:0
./sheaf_prefill_slow:0
./sheaf_prefill_fast:0
./poison:0
./red_zone:0
./free_slab:2252626 C0=217768 C2=178064 C8=177352 C10=188763 C12=172593 C14=195156 C16=189975 C18=178757 C24=179036 C26=187498 C28=189290 C30=198374
./slabs:178 N1=83 N5=95
./barn_get:11327366 C5=3212674 C6=4 C7=2925838 C20=1 C21=2703324 C23=2485524 C31=1
./alloc_slowpath:0
./destroy_by_rcu:1
./free_rcu_sheaf_fail:0
./barn_put:11327384 C4=3212682 C6=2925843 C20=2703321 C22=2485537 C28=1
./usersize:0
./sanity_checks:0
./barn_put_fail:3 C4=1 C20=1 C22=1
./align:64
./alloc_node_mismatch:0
./alloc_slab:2252805 C1=175514 C3=194092 C5=9 C8=2 C9=178894 C10=1 C11=177472 C12=5 C13=202580 C14=1 C15=190184 C16=2 C17=194266 C18=1 C19=179219 C20=1 C24=1 C25=205268 C27=197542 C29=177638 C30=1 C31=180109 C48=1 C59=1 C63=1
./free_remove_partial:2252626 C0=217768 C2=178064 C8=177352 C10=188763 C12=172593 C14=195156 C16=189975 C18=178757 C24=179036 C26=187498 C28=189290 C30=198374
./aliases:0
./store_user:0
./trace:0
./reclaim_account:0
./order:2
./sheaf_refill:580549592 C1=45250268 C3=50602104 C5=644 C6=28 C7=280 C8=140 C9=46275796 C10=140 C11=46037600 C12=140 C13=53037824 C14=140 C15=49291984 C16=140 C17=49716100 C18=28 C19=45475444 C20=112 C21=168 C22=140 C23=588 C24=140 C25=52992072 C26=28 C27=51082192 C28=140 C29=44931264 C30=56 C31=45853808 C48=28 C59=28 C63=28
./object_size:256
./free_fastpath:317166967 C4=89955177 C6=81923631 C20=75693051 C22=69595106 C53=2
./hwcache_align:1
./cmpxchg_double_fail:2518173 C0=176664 C1=2695 C2=217757 C3=3198 C5=1 C7=2 C8=201087 C9=2783 C10=199017 C11=2839 C12=222146 C13=3146 C14=215883 C15=3113 C16=208327 C17=3279 C18=199045 C19=3150 C21=1 C24=225357 C25=3445 C26=218118 C27=3183 C28=197089 C29=3083 C30=200710 C31=3055
./objs_per_slab:51
./partial:146 N1=67 N5=79
./slabs_cpu_partial:0(0)
./free_add_partial:29301755 C0=958223 C1=1323669 C2=1094832 C3=1480537 C4=1 C5=11 C7=7 C8=990458 C9=1354237 C10=983110 C11=1348401 C12=1142550 C13=1553516 C14=1063308 C15=1441153 C16=1029143 C17=1463851 C18=949985 C19=1335529 C20=7 C21=3 C22=7 C23=17 C24=1107242 C25=1562991 C26=1056543 C27=1505571 C28=939243 C29=1319586 C30=948581 C31=1349443
./slab_size:320
Thanks,
Ming
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2026-03-05 15:48 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24 2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24 5:00 ` Harry Yoo
2026-02-24 9:07 ` Ming Lei
2026-02-25 5:32 ` Hao Li
2026-02-25 6:54 ` Harry Yoo
2026-02-25 7:06 ` Hao Li
2026-02-25 7:19 ` Harry Yoo
2026-02-25 8:19 ` Hao Li
2026-02-25 8:41 ` Harry Yoo
2026-02-25 8:54 ` Hao Li
2026-02-25 8:21 ` Harry Yoo
2026-02-24 6:51 ` Hao Li
2026-02-24 7:10 ` Harry Yoo
2026-02-24 7:41 ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25 5:24 ` Harry Yoo
2026-02-25 8:45 ` Vlastimil Babka (SUSE)
2026-02-25 9:31 ` Ming Lei
2026-02-25 11:29 ` Vlastimil Babka (SUSE)
2026-02-25 12:24 ` Ming Lei
2026-02-25 13:22 ` Vlastimil Babka (SUSE)
2026-02-26 18:02 ` Vlastimil Babka (SUSE)
2026-02-27 9:23 ` Ming Lei
2026-03-05 13:05 ` Vlastimil Babka (SUSE)
2026-03-05 15:48 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox