From: Hao Li <hao.li@linux.dev>
To: "Harry Yoo (Oracle)" <harry@kernel.org>
Cc: vbabka@kernel.org, akpm@linux-foundation.org, cl@gentwo.org,
rientjes@google.com, roman.gushchin@linux.dev,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
"Liam R. Howlett" <Liam.Howlett@oracle.com>
Subject: Re: [RFC PATCH] slub: spill refill leftover objects into percpu sheaves
Date: Mon, 20 Apr 2026 19:40:55 +0800 [thread overview]
Message-ID: <ybffdlcvyls3cmen67b3ewno3vwdag6timnqjcoomipd2ei5sg@r3goxchyn3ib> (raw)
In-Reply-To: <aeHMfXldGQ0ANL-K@hyeyoo>
On Fri, Apr 17, 2026 at 03:00:29PM +0900, Harry Yoo (Oracle) wrote:
> On Thu, Apr 16, 2026 at 03:58:46PM +0800, Hao Li wrote:
> > On Wed, Apr 15, 2026 at 07:20:21PM +0900, Harry Yoo (Oracle) wrote:
> > > On Tue, Apr 14, 2026 at 05:59:48PM +0800, Hao Li wrote:
> > > > On Tue, Apr 14, 2026 at 05:39:40PM +0900, Harry Yoo (Oracle) wrote:
> > > > > On Fri, Apr 10, 2026 at 07:16:57PM +0800, Hao Li wrote:
> > > > > Where do you think the improvement comes from? (hopefully w/ some data)
> > > >
> > > > Yes, this is necessary.
> > > >
> > > > > e.g.:
> > > > > 1. the benefit is from largely or partly from
> > > > > reduced contention on n->list_lock.
> > > >
> > > > Before this patch is applied, the mmap benchmark shows the following hot path:
> > > >
> > > > - 7.85% native_queued_spin_lock_slowpath
> > > > -7.85% _raw_spin_lock_irqsave
> > > > - 3.69% __slab_free
> > > > + 1.84% __refill_objects_node
> > > > + 1.77% __kmem_cache_free_bulk
> > > > + 3.27% __refill_objects_node
> > > >
> > > > With the patch applied, the __refill_objects_node -> __slab_free hotspot goes
> > > > away, and the native_queued_spin_lock_slowpath drops to roughly 3.5%.
> > >
> > > Sounds like returning slabs back indeed increases contention on slowpath.
> >
> > Indeed!
> >
> > > > The
> > > > remaining lock contention is mostly between __refill_objects_node ->
> > > > add_partial and __kmem_cache_free_bulk -> __slab_free.
> > > >
> > > > >
> > > > > 2. this change reduces # of alloc slowpath at the cost of increased
> > > > > of free slowpath hits, but that's better because the slowpath frees
> > > > > are mostly lockless.
> > > >
> > > > The alloc slowpath remains at 0 both w/ or w/o the patch, whereas the
> > >
> > > (assuming you used SLUB_STATS for this)
> >
> > Yes, I enable it.
> >
> > > That's weird, I think we should check SHEAF_REFILL instead of
> > > ALLOC_SLOWPATH.
> >
> > Yes, I will compare each metrics for later testing. Maybe we can see more
> > clues.
> >
> > > > free slowpath increases by 2x after applying the patch.
> > >
> > > from which cache was this stat collected?
> >
> > It's for /sys/kernel/slab/maple_node/
>
> Ack. And you also mentioned (off-list) that kmem_cache_refill_sheaf()
> is not on the profile. That's good to know.
>
> > > > > 3. the alloc/free pattern of the workload is benefiting from
> > > > > spilling objects to the CPU's sheaves.
> > > > >
> > > > > or something else?
> > > >
> > > > The 2-5% throughput improvement does seem to come with some trade-offs.
> > > > The main one is that leftover objects get hidden in the percpu sheaves now,
> > > > which reduces the objects on the node partial list and thus indirectly
> > > > increases slab alloc/free frequency to about 4x of the baseline.
> > > >
> > > > This is a drawback of the current approach. :/
> > >
> > > Sounds like s->min_partial is too small now that we cache more objects
> > > per CPU.
> >
> > Exactly. for the mmap test case, the slab partial list keeps thrashing. It
> > makes me wonder whether SLUB might handle transient pressure better if empty
> > slabs could be regulated with a "dynamic burst threshold"
>
> Haha, we'll be constantly challenged to find balance between "sacrifice
> memory to make every benchmark happy" vs. "provide reasonable
> scalability in general but let users tune it themselves".
>
> If we could implement a reasonably simple yet effective automatic tuning
> method, having one in the kernel would be nice (though of course having
> it userspace would be the best).
Yes, I'm gradually feeling that SLUB's flow is so tight and simple that doing a
one-size-fits-all optimization is super hard.
It might be better to just export some parameters to userspace and let users
tune them.
After all, introducing an auto-tuning mechanism into a core allocator like SLUB
might make it as unpredictable and hard to control as memory reclaim. :P
>
> > > /me wonders if increasing sheaf capacity would make more sense
> > > rather than optimizing slowpath (if it comes with increased memory
> > > usage anyway),
> >
> > Yes, finding ways to avoid falling onto the slowpath is also very worthwhile.
>
> Could you please take a look at how much changing 1) sheaf capacity and
> 2) nr of full/empty sheaves at the barn affects the performance of
> mmap / ublk performance?
>
> I've been trying to reproduce the regression on my machine but haven't
> had much success so far :(
>
> (I'll try to post the RFC patchset to allow changing those parameters
> at runtime in few weeks but if you're eager you could try experimenting
> by changing the code :D)
Sure thing. I'll run some tests and organize the data.
>
> > > but then stares at his (yet) unfinished patch series...
> > >
> > > > I experimented with several alternative ideas, and the pattern seems fairly
> > > > consistent: as soon as leftover objects are hidden at the percpu level, slab
> > > > alloc/free churn tends to go up.
> > > >
> > > > > > Signed-off-by: Hao Li <hao.li@linux.dev>
> > > > > > ---
> > > > > >
> > > > > > This patch is an exploratory attempt to address the leftover objects and
> > > > > > partial slab issues in the refill path, and it is marked as RFC to warmly
> > > > > > welcome any feedback, suggestions, and discussion!
> > > > >
> > > > > Yeah, let's discuss!
> > > >
> > > > Sure! Thanks for the discussion!
> > > >
> > > > >
> > > > > By the way, have you also been considering having min-max capacity
> > > > > for sheaves? (that I think Vlastimil suggested somewhere)
> > > >
> > > > Yes, I also tried it.
> > > >
> > > > I experimented with using a manually chosen threshold to allow refill to leave
> > > > the sheaf in a partially filled state. However, since concurrent frees are
> > > > inherently unpredictable, this seems can only reduce the probability of
> > > > generating leftover objects,
> > >
> > > If concurrent frees are a problem we could probably grab slab->freelist
> > > under n->list_lock (e.g. keep them at the end of the sheaf) and fill the
> > > sheaf outside the lock to avoid grabbing too many objects.
> >
> > Do you mean doing an on-list bulk allocation?
>
> Just brainstorming... it's quite messy :)
> something like
>
> __refill_objects_node(s, p, gfp, min, max, n, allow_spin) {
> // in practice we don't know how many slabs we'll grab.
> // so probably keep them somewhere e.g.) the end of `p` array?
> void *freelists[min];
> nr_freelists = 0;
> nr_objs = 0;
>
> spin_lock_irqsave();
> for each slab in n->partial {
> freelist = slab->freelist;
> do {
> [...]
> old.freelist = slab->freelist;
> [...]
> } while (!__slab_update_freelist(...));
>
> freelists[nr_freelists++] = old.freelist;
> nr_objs += (old.objects - old.inuse);
> if (!new.inuse)
> remove_partial();
> if (nr_objs >= min)
> break;
> }
> spin_unlock_irqrestore();
>
> i = 0;
> j = 0;
> while (i < nr_freelists) {
> freelist = freelists[i++];
> while (freelist != NULL) {
> if (j == max) {
> // free remaining objects
> }
> next = get_freepointer(s, freelist);
> p[j++] = freelist;
> freelist = next;
> }
> }
> }
>
> This way, we know how many objects we grabbed but yeah it's tricky.
Thanks for this brainstorming.
If we do an atomic operation like __slab_update_freelist under the lock, I'm
worried it might prolong the critical section. But testing is the best way to
know for sure. I'm quite curious, so it's definitely worth a try.
>
> > > > while at the same time affecting alloc-side throughput.
> > >
> > > Shouldn't we set sheaf's min capacity as the same as
> > > s->sheaf_capacity and allow higher max capcity to avoid this?
> >
> > I'm not sure I fully understand this. since the array size is fixed, how would
> > we allow more entries to be filled?
>
> I don't really want to speak on behalf of Vlastimil but I was imagining
> something like:
>
> before: sheaf->capacity (32, min = max);
> after: sheaf->capacity (48 or 64, max), sheaf->threshold (32, min)
>
> so that sheaf refill will succeed if at least ->threshold objects
> are filled, but the threshold better not be smaller than 32 (the
> previous sheaf->capacity)?
I feel that simply increasing the sheaf capacity should generally improve
overall performance, although we'd likely see more slab alloc/free churn
compared to the baseline.
One thing I'm wondering about is how we determine if an optimization is truly
worth doing.
Micro-optimizations like this rarely have purely positive effects; they often
come with fluctuations or regressions in other metrics-such as more frequent
slab allocations and frees, which adds pressure to the buddy system.
So this kind of ties our hands a bit :/
>
> > > > In my testing, the results were not very encouraging: it seems hard
> > > > to observe improvement, and in most cases it ended up causing a performance
> > > > regression.
> > > >
> > > > my impression is that it could be difficult to prevent leftovers proactively.
> > > It may be easier to deal with them after they appear.
> > >
> > > Either way doesn't work if the slab order is too high...
> > >
> > > IIRC using higher slab order used to have some benefit
> > > but now that we have sheaves, it probably doesn't make sense anymore
> > > to have oo_objects(s->oo) > s->sheaf_capacity?
> >
> > Do you mean considering making the capacity of each sheaf larger than
> > oo_objects?
>
> I mean the other way around. calculate_order() tends to increase slab
> order with higher number of CPUs (by setting higher `min_objects`),
> but is it still worth having oo_objects higher than the sheaf capacity?
Sorry, I just got confused for a second. Why is it a good thing for oo_objects
to be larger than the sheaf capacity? I didn't quite catch the logic behind
that...
--
Thanks,
Hao
next prev parent reply other threads:[~2026-04-20 11:41 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-10 11:16 Hao Li
2026-04-14 8:39 ` Harry Yoo (Oracle)
2026-04-14 9:59 ` Hao Li
2026-04-15 10:20 ` Harry Yoo (Oracle)
2026-04-16 7:58 ` Hao Li
2026-04-17 6:00 ` Harry Yoo (Oracle)
2026-04-20 11:40 ` Hao Li [this message]
2026-04-21 3:35 ` Harry Yoo (Oracle)
2026-04-16 8:13 ` Hao Li
2026-04-15 20:55 ` Vinicius Costa Gomes
2026-04-16 5:49 ` Hao Li
2026-04-17 8:18 ` Vlastimil Babka (SUSE)
2026-04-17 9:40 ` Harry Yoo (Oracle)
2026-04-20 3:18 ` Hao Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ybffdlcvyls3cmen67b3ewno3vwdag6timnqjcoomipd2ei5sg@r3goxchyn3ib \
--to=hao.li@linux.dev \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=harry@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox