linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hao Li <hao.li@linux.dev>
To: "Harry Yoo (Oracle)" <harry@kernel.org>
Cc: vbabka@kernel.org, akpm@linux-foundation.org, cl@gentwo.org,
	 rientjes@google.com, roman.gushchin@linux.dev,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>
Subject: Re: [RFC PATCH] slub: spill refill leftover objects into percpu sheaves
Date: Mon, 20 Apr 2026 19:40:55 +0800	[thread overview]
Message-ID: <ybffdlcvyls3cmen67b3ewno3vwdag6timnqjcoomipd2ei5sg@r3goxchyn3ib> (raw)
In-Reply-To: <aeHMfXldGQ0ANL-K@hyeyoo>

On Fri, Apr 17, 2026 at 03:00:29PM +0900, Harry Yoo (Oracle) wrote:
> On Thu, Apr 16, 2026 at 03:58:46PM +0800, Hao Li wrote:
> > On Wed, Apr 15, 2026 at 07:20:21PM +0900, Harry Yoo (Oracle) wrote:
> > > On Tue, Apr 14, 2026 at 05:59:48PM +0800, Hao Li wrote:
> > > > On Tue, Apr 14, 2026 at 05:39:40PM +0900, Harry Yoo (Oracle) wrote:
> > > > > On Fri, Apr 10, 2026 at 07:16:57PM +0800, Hao Li wrote:
> > > > > Where do you think the improvement comes from? (hopefully w/ some data)
> > > > 
> > > > Yes, this is necessary.
> > > > 
> > > > > e.g.:
> > > > >   1. the benefit is from largely or partly from
> > > > >      reduced contention on n->list_lock.
> > > > 
> > > > Before this patch is applied, the mmap benchmark shows the following hot path:
> > > > 
> > > > - 7.85% native_queued_spin_lock_slowpath
> > > >     -7.85% _raw_spin_lock_irqsave
> > > >         - 3.69% __slab_free
> > > >             + 1.84% __refill_objects_node
> > > >             + 1.77% __kmem_cache_free_bulk
> > > >         + 3.27% __refill_objects_node
> > > > 
> > > > With the patch applied, the __refill_objects_node -> __slab_free hotspot goes
> > > > away, and the native_queued_spin_lock_slowpath drops to roughly 3.5%.
> > > 
> > > Sounds like returning slabs back indeed increases contention on slowpath.
> > 
> > Indeed!
> >
> > > > The
> > > > remaining lock contention is mostly between __refill_objects_node ->
> > > > add_partial and __kmem_cache_free_bulk -> __slab_free.
> > > > 
> > > > > 
> > > > >   2. this change reduces # of alloc slowpath at the cost of increased
> > > > >      of free slowpath hits, but that's better because the slowpath frees
> > > > >      are mostly lockless.
> > > > 
> > > > The alloc slowpath remains at 0 both w/ or w/o the patch, whereas the
> > > 
> > > (assuming you used SLUB_STATS for this)
> > 
> > Yes, I enable it.
> > 
> > > That's weird, I think we should check SHEAF_REFILL instead of
> > > ALLOC_SLOWPATH.
> > 
> > Yes, I will compare each metrics for later testing. Maybe we can see more
> > clues.
> > 
> > > > free slowpath increases by 2x after applying the patch.
> > > 
> > > from which cache was this stat collected?
> > 
> > It's for /sys/kernel/slab/maple_node/
> 
> Ack. And you also mentioned (off-list) that kmem_cache_refill_sheaf()
> is not on the profile. That's good to know.
> 
> > > > >   3. the alloc/free pattern of the workload is benefiting from
> > > > >      spilling objects to the CPU's sheaves.
> > > > > 
> > > > > or something else?
> > > > 
> > > > The 2-5% throughput improvement does seem to come with some trade-offs.
> > > > The main one is that leftover objects get hidden in the percpu sheaves now,
> > > > which reduces the objects on the node partial list and thus indirectly
> > > > increases slab alloc/free frequency to about 4x of the baseline.
> > > > 
> > > > This is a drawback of the current approach. :/
> > > 
> > > Sounds like s->min_partial is too small now that we cache more objects
> > > per CPU.
> > 
> > Exactly. for the mmap test case, the slab partial list keeps thrashing. It
> > makes me wonder whether SLUB might handle transient pressure better if empty
> > slabs could be regulated with a "dynamic burst threshold"
> 
> Haha, we'll be constantly challenged to find balance between "sacrifice
> memory to make every benchmark happy" vs. "provide reasonable
> scalability in general but let users tune it themselves". 
> 
> If we could implement a reasonably simple yet effective automatic tuning
> method, having one in the kernel would be nice (though of course having
> it userspace would be the best).

Yes, I'm gradually feeling that SLUB's flow is so tight and simple that doing a
one-size-fits-all optimization is super hard.
It might be better to just export some parameters to userspace and let users
tune them.
After all, introducing an auto-tuning mechanism into a core allocator like SLUB
might make it as unpredictable and hard to control as memory reclaim. :P

> 
> > > /me wonders if increasing sheaf capacity would make more sense
> > > rather than optimizing slowpath (if it comes with increased memory
> > > usage anyway),
> > 
> > Yes, finding ways to avoid falling onto the slowpath is also very worthwhile.
> 
> Could you please take a look at how much changing 1) sheaf capacity and
> 2) nr of full/empty sheaves at the barn affects the performance of
> mmap / ublk performance?
> 
> I've been trying to reproduce the regression on my machine but haven't
> had much success so far :(
> 
> (I'll try to post the RFC patchset to allow changing those parameters
> at runtime in few weeks but if you're eager you could try experimenting
> by changing the code :D)

Sure thing. I'll run some tests and organize the data.

>  
> > > but then stares at his (yet) unfinished patch series...
> > > 
> > > > I experimented with several alternative ideas, and the pattern seems fairly
> > > > consistent: as soon as leftover objects are hidden at the percpu level, slab
> > > > alloc/free churn tends to go up.
> > > > 
> > > > > > Signed-off-by: Hao Li <hao.li@linux.dev>
> > > > > > ---
> > > > > > 
> > > > > > This patch is an exploratory attempt to address the leftover objects and
> > > > > > partial slab issues in the refill path, and it is marked as RFC to warmly
> > > > > > welcome any feedback, suggestions, and discussion!
> > > > > 
> > > > > Yeah, let's discuss!
> > > > 
> > > > Sure! Thanks for the discussion!
> > > > 
> > > > > 
> > > > > By the way, have you also been considering having min-max capacity
> > > > > for sheaves? (that I think Vlastimil suggested somewhere)
> > > > 
> > > > Yes, I also tried it.
> > > > 
> > > > I experimented with using a manually chosen threshold to allow refill to leave
> > > > the sheaf in a partially filled state. However, since concurrent frees are
> > > > inherently unpredictable, this seems can only reduce the probability of
> > > > generating leftover objects,
> > > 
> > > If concurrent frees are a problem we could probably grab slab->freelist
> > > under n->list_lock (e.g. keep them at the end of the sheaf) and fill the
> > > sheaf outside the lock to avoid grabbing too many objects.
> > 
> > Do you mean doing an on-list bulk allocation?
> 
> Just brainstorming... it's quite messy :)
> something like
> 
> __refill_objects_node(s, p, gfp, min, max, n, allow_spin) {
> 	// in practice we don't know how many slabs we'll grab.
> 	// so probably keep them somewhere e.g.) the end of `p` array?
> 	void *freelists[min];
> 	nr_freelists = 0;
> 	nr_objs = 0;
> 
> 	spin_lock_irqsave();
> 	for each slab in n->partial {
> 		freelist = slab->freelist;
> 		do {
> 			[...]
> 			old.freelist = slab->freelist;
> 			[...]
> 		} while (!__slab_update_freelist(...));
> 
> 		freelists[nr_freelists++] = old.freelist;
> 		nr_objs += (old.objects - old.inuse);
> 		if (!new.inuse)
> 			remove_partial();
> 		if (nr_objs >= min)
> 			break;
> 	}
> 	spin_unlock_irqrestore();
> 
> 	i = 0;
> 	j = 0;
> 	while (i < nr_freelists) {
> 		freelist = freelists[i++];
> 		while (freelist != NULL) {
> 			if (j == max) {
> 				// free remaining objects
> 			}
> 			next = get_freepointer(s, freelist);
> 			p[j++] = freelist;
> 			freelist = next;
> 		}
> 	}
> }
> 
> This way, we know how many objects we grabbed but yeah it's tricky.

Thanks for this brainstorming.
If we do an atomic operation like __slab_update_freelist under the lock, I'm
worried it might prolong the critical section. But testing is the best way to
know for sure. I'm quite curious, so it's definitely worth a try.

> 
> > > > while at the same time affecting alloc-side throughput.
> > > 
> > > Shouldn't we set sheaf's min capacity as the same as
> > > s->sheaf_capacity and allow higher max capcity to avoid this?
> > 
> > I'm not sure I fully understand this. since the array size is fixed, how would
> > we allow more entries to be filled?
> 
> I don't really want to speak on behalf of Vlastimil but I was imagining
> something like:
> 
> before: sheaf->capacity (32, min = max); 
> after: sheaf->capacity (48 or 64, max), sheaf->threshold (32, min)
> 
> so that sheaf refill will succeed if at least ->threshold objects
> are filled, but the threshold better not be smaller than 32 (the
> previous sheaf->capacity)?

I feel that simply increasing the sheaf capacity should generally improve
overall performance, although we'd likely see more slab alloc/free churn
compared to the baseline.

One thing I'm wondering about is how we determine if an optimization is truly
worth doing.

Micro-optimizations like this rarely have purely positive effects; they often
come with fluctuations or regressions in other metrics-such as more frequent
slab allocations and frees, which adds pressure to the buddy system.
So this kind of ties our hands a bit :/

> 
> > > > In my testing, the results were not very encouraging: it seems hard
> > > > to observe improvement, and in most cases it ended up causing a performance
> > > > regression.
> > > > 
> > > > my impression is that it could be difficult to prevent leftovers proactively.
> > > It may be easier to deal with them after they appear.
> > > 
> > > Either way doesn't work if the slab order is too high...
> > > 
> > > IIRC using higher slab order used to have some benefit
> > > but now that we have sheaves, it probably doesn't make sense anymore
> > > to have oo_objects(s->oo) > s->sheaf_capacity?
> > 
> > Do you mean considering making the capacity of each sheaf larger than
> > oo_objects?
> 
> I mean the other way around. calculate_order() tends to increase slab
> order with higher number of CPUs (by setting higher `min_objects`),
> but is it still worth having oo_objects higher than the sheaf capacity?

Sorry, I just got confused for a second. Why is it a good thing for oo_objects
to be larger than the sheaf capacity? I didn't quite catch the logic behind
that...

-- 
Thanks,
Hao


  reply	other threads:[~2026-04-20 11:41 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-10 11:16 Hao Li
2026-04-14  8:39 ` Harry Yoo (Oracle)
2026-04-14  9:59   ` Hao Li
2026-04-15 10:20     ` Harry Yoo (Oracle)
2026-04-16  7:58       ` Hao Li
2026-04-17  6:00         ` Harry Yoo (Oracle)
2026-04-20 11:40           ` Hao Li [this message]
2026-04-21  3:35             ` Harry Yoo (Oracle)
2026-04-16  8:13       ` Hao Li
2026-04-15 20:55 ` Vinicius Costa Gomes
2026-04-16  5:49   ` Hao Li
2026-04-17  8:18     ` Vlastimil Babka (SUSE)
2026-04-17  9:40     ` Harry Yoo (Oracle)
2026-04-20  3:18   ` Hao Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ybffdlcvyls3cmen67b3ewno3vwdag6timnqjcoomipd2ei5sg@r3goxchyn3ib \
    --to=hao.li@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=harry@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox