Re: [RFC PATCH] slub: spill refill leftover objects into percpu sheaves

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Harry Yoo (Oracle)" <harry@kernel.org>
To: Hao Li <hao.li@linux.dev>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>,
	vbabka@kernel.org, akpm@linux-foundation.org, cl@gentwo.org,
	rientjes@google.com, roman.gushchin@linux.dev,
	linux-mm@kvack.org, linux-kernel@vger.kernel.orgg
Subject: Re: [RFC PATCH] slub: spill refill leftover objects into percpu sheaves
Date: Fri, 17 Apr 2026 18:40:08 +0900	[thread overview]
Message-ID: <aeH_-EYzI3IkeOv4@hyeyoo> (raw)
In-Reply-To: <wamsxmuhcbr6hj5y53bheub4cmkq7pd3psntl66o74k4bmtygi@qhyse67jvr3f>

On Thu, Apr 16, 2026 at 01:49:01PM +0800, Hao Li wrote:
> On Wed, Apr 15, 2026 at 01:55:54PM -0700, Vinicius Costa Gomes wrote:
> > Hao Li <hao.li@linux.dev> writes:
> > 
> > > When performing objects refill, we tend to optimistically assume that
> > > there will be more allocation requests coming next; this is the
> > > fundamental assumption behind this optimization.
> > >
> > > When __refill_objects_node() isolates a partial slab and satisfies a
> > > bulk allocation from its freelist, the slab can still have a small tail
> > > of free objects left over. Today those objects are freed back to the
> > > slab immediately.
> > >
> > > If the leftover tail is local and small enough to fit, keep it in the
> > > current CPU's sheaves instead. This avoids pushing those objects back
> > > through the __slab_free slowpath.
> > >
> > > Add a helper to obtain both the freelist and its free-object count, and
> > > then spill the remaining objects into a percpu sheaf when:
> > > - the tail fits in a sheaf
> > > - the slab is local to the current CPU
> > > - the slab is not pfmemalloc
> > > - the target sheaf has enough free space
> > >
> > > Otherwise keep the existing fallback and free the tail back to the slab.
> > >
> > > Also add a SHEAF_SPILL stat so the new path can be observed in SLUB
> > > stats.
> > >
> > > On the mmap2 case in the will-it-scale benchmark suite, this patch can
> > > improve performance by about 2~5%.
> > >
> > > Signed-off-by: Hao Li <hao.li@linux.dev>
> > > ---
> > >
> > > This patch is an exploratory attempt to address the leftover objects and
> > > partial slab issues in the refill path, and it is marked as RFC to warmly
> > > welcome any feedback, suggestions, and discussion!
> > >
> > 
> > I was also looking at these regressions, but I went from a different
> > direction, and ended up with 3 patches:
> >
> > 1. the regressions showed a lot of increase in the cache misses,
> >    which gave me the idea that a cache would help (and it seemed to help)

I really appreciate looking into the performance change but think we
should first try fixing existing corner cases and/or try tuning existing
parameters (s->sheaf_capacity, MAX_{FULL,EMPTY}_SHEAVES, s->min_partial,
and s->remote_node_defrag_ratio) before making such design changes.

Exploring a design change too soon without fully exploring
the limitation of current design isn't worth the effort.

in the first patch description:
| When the sheaf allocator needs to refill from the node partial list, it
| calls __refill_objects_node() which walks the freelist of a cold slab
| page — one that has not been in any CPU's cache since it was last freed.
| On NUMA systems with many concurrent threads, the majority of these walks
| hit remote DRAM, causing a significant increase in LLC misses.

IIUC you're arguing that iterating over slab->freelist just to return
the slab back to the list unnecessarily results in higher cache
footprint, right? (and even worse those slabs are from remote nodes)
(unlike Hao who argued it's more of a n->list_lock contention thing)

in __refill_objects_node():
| __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
|                       unsigned int max, struct kmem_cache_node *n,
|                       bool allow_spin)
| {
|         struct partial_bulk_context pc;
|         struct slab *slab, *slab2;
|         unsigned int refilled = 0;
|         unsigned long flags;
|         void *object;
| 
|         pc.flags = gfp;
|         pc.min_objects = min;
|         pc.max_objects = max;
| 
|         if (!get_partial_node_bulk(s, n, &pc, allow_spin))
|                 return 0;
| 
|         list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
| 
|                 list_del(&slab->slab_list);
| 
|                 object = get_freelist_nofreeze(s, slab);
| 
|                 while (object && refilled < max) {
|                         p[refilled] = object;
|                         object = get_freepointer(s, object);
|                         maybe_wipe_obj_freeptr(s, p[refilled]);
| 
|                         refilled++;
|                 }
| 
|                 /*
|                  * Freelist had more objects than we can accommodate, we need to
|                  * free them back. We can treat it like a detached freelist, just
|                  * need to find the tail object.
|                  */
|                 if (unlikely(object)) {
|                         void *head = object;
|                         void *tail;
|                         int cnt = 0;
| 
|                         do {
|                                 tail = object;
|                                 cnt++;
|                                 object = get_freepointer(s, object);
|                         } while (object);

So here we make the slab "warm" although we're not going to use it,
just to get the tail object.

As Vlastimil suggested off-list, we could probably assume that nobody
is has freed objects to the slab and try __slab_update_freelist()
and avoid iterating over the freelist? (a kind of blind Compare-And-Swap)

and then fall back if that fails.

|                         __slab_free(s, slab, head, tail, cnt, _RET_IP_);
|                 }

Back to the description:
| Add a per-CPU warm slab stash: a single (slab, freelist-head) pair stored
| in struct slub_percpu_sheaves.

Oh, calling it "warm slab" is very misleading. Warming the slab in
__refill_objects_node() when it has more objects than
(sheaf->capacity - sheaf->size) is current behavior

| When __refill_objects_node() drains a slab
| from the partial list but has excess objects (more than it needs for the
| current refill), it stashes the remainder instead of returning them to
| the partial list. On the next refill, drain_warm_slab() serves the
| stashed objects first, skipping the cold partial-list walk entirely.

and your patch changes that. It's not warm anymore.

> > 2. Allowing smaller refills (but potentially more frequent);
> > 
> > 3. A cute (but with small impact) use of prefetch();
> 
> Great!
> Thanks for sharing those infos!
> 
> > The numbers are here (the commentary from the bot are very hit or miss,
> > so don't pay too much attention to them):
> > 
> > https://github.com/vcgomes/linux/commit/c898c39ee8def5252942281353eda6acdd83d4ea
> > 
> > I am re-running the tests against a more recent tree, but if you
> > want to take a look:
> > 
> > https://github.com/vcgomes/linux/tree/mm-sheaves-regression-timerfd
> > 
> > Also, if you feel it's useful, I can send a RFC.
> 
> I also tried stashing leftover objects into the PCS before, but at the time I
> observed that this could quickly drain the node partial list, which then led to
> slab alloc/free churn, and the end result was a performance regression. So I
> gave up this direction :/
> 
> I took a quick look at the code and performance report in your GitHub repo, and
> the performance gains you showed there are really interesting to me!
> I'm going to try testing it on my own machine as well.

Comparing the patch 1 with Hao's patch (spilling objects)... it doesn't
touch those leftover objects (reduced cache footprint) and also does not
spill them into sheaves (hit the free slowpath less frequently).

-- 
Cheers,
Harry / Hyeonggon

     prev parent reply	other threads:[~2026-04-17  9:40 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-10 11:16 Hao Li
2026-04-14  8:39 ` Harry Yoo (Oracle)
2026-04-14  9:59   ` Hao Li
2026-04-15 10:20     ` Harry Yoo (Oracle)
2026-04-16  7:58       ` Hao Li
2026-04-17  6:00         ` Harry Yoo (Oracle)
2026-04-16  8:13       ` Hao Li
2026-04-15 20:55 ` Vinicius Costa Gomes
2026-04-16  5:49   ` Hao Li
2026-04-17  8:18     ` Vlastimil Babka (SUSE)
2026-04-17  9:40     ` Harry Yoo (Oracle) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeH_-EYzI3IkeOv4@hyeyoo \
    --to=harry@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=hao.li@linux.dev \
    --cc=linux-kernel@vger.kernel.orgg \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=vinicius.gomes@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox