linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Ming Lei <ming.lei@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, Hao Li <hao.li@linux.dev>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
Date: Wed, 25 Feb 2026 14:24:51 +0900	[thread overview]
Message-ID: <aZ6Ho9YXzLzVOZUz@hyeyoo> (raw)
In-Reply-To: <5cf75a95-4bb9-48e5-af94-ef8ec02dcd4d@suse.cz>

On Tue, Feb 24, 2026 at 09:27:40PM +0100, Vlastimil Babka wrote:
> On 2/24/26 3:52 AM, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> > 
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> > 
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> > 
> > Reproducer
> > ==========
> > 
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > 
> >     # build kublk selftest
> >     make -C tools/testing/selftests/ublk/
> > 
> >     # create ublk null target device with 16 queues
> >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > 
> >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> >     # cleanup
> >     tools/testing/selftests/ublk/kublk del -n 0
> > 
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> > 
> > perf profile (bad kernel)
> > =========================
> > 
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> > 
> > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> >    - 44.41% kmem_cache_alloc_noprof
> >       - 43.89% ___slab_alloc
> >          + 41.16% get_from_any_partial
> 
> So this function is not used in the sheaf refill path, but in the
> fallback slowpath when alloc_from_pcs() fastpath fails.

Good point.

> >            0.91% get_from_partial_node
> >          + 0.87% alloc_from_new_slab
> >          + 0.65% allocate_slab
> > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> >    - 44.49% mempool_alloc_noprof
> >       - 44.43% kmem_cache_alloc_noprof
> 
> And I'd guess alloc_from_pcs() fails because in
> __pcs_replace_empty_main() we have gfpflags_allow_blocking() false,
> because mempool_alloc_noprof() tries the first attempt without
> __GFP_DIRECT_RECLAIM. So that will succeed, but we end up relying on the
> slowpath all the time and performance will drop.

That's very good point. I was missing that aspect.

> It made sense to me not to refill sheaves when we can't reclaim, but I
> didn't anticipate this interaction with mempools.

Me neither :)

> We could change them but there might be others using a similar pattern.

Probably, yes.

> Maybe it would be for the best to just drop that heuristic from
> __pcs_replace_empty_main()

Sounds fair.

> (but carefully as some deadlock avoidance depends on it, we might need
> to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> tomorrow to test this theory, unless someone beats me to it (feel free to).

I think your point is valid. Let's give it a try.

> Until then IMHO we can dismiss the AI explanation and also the
> insufficient sheaf capacity theories.

Yeah :) let's first see how it performs after addressing your point.

-- 
Cheers,
Harry / Hyeonggon


      reply	other threads:[~2026-02-25  5:25 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24  2:52 Ming Lei
2026-02-24  5:00 ` Harry Yoo
2026-02-24  9:07   ` Ming Lei
2026-02-25  5:32     ` Hao Li
2026-02-25  6:54       ` Harry Yoo
2026-02-25  7:06         ` Hao Li
2026-02-25  7:19           ` Harry Yoo
2026-02-25  8:19             ` Hao Li
2026-02-25  8:21             ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo
2026-02-24  7:41     ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25  5:24   ` Harry Yoo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ6Ho9YXzLzVOZUz@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=hao.li@linux.dev \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.lei@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox