From: <hu.shengming@zte.com.cn>
To: <vbabka@kernel.org>
Cc: <harry@kernel.org>, <akpm@linux-foundation.org>,
<hao.li@linux.dev>, <cl@gentwo.org>, <rientjes@google.com>,
<roman.gushchin@linux.dev>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>, <zhang.run@zte.com.cn>,
<xu.xin16@zte.com.cn>, <yang.tao172@zte.com.cn>,
<yang.yang29@zte.com.cn>
Subject: Re: [PATCH v6] mm/slub: defer freelist construction until after bulk allocation from a new slab
Date: Tue, 14 Apr 2026 21:40:50 +0800 (CST) [thread overview]
Message-ID: <20260414214050619-bhr4zGY4E0xLWc0olEiI@zte.com.cn> (raw)
In-Reply-To: <411f9b29-9b1c-4e2f-a212-f583280a6788@kernel.org>
Vlastimil wrote:
> On 4/13/26 17:04, hu.shengming@zte.com.cn wrote:
> > From: Shengming Hu <hu.shengming@zte.com.cn>
> >
> > Allocations from a fresh slab can consume all of its objects, and the
> > freelist built during slab allocation is discarded immediately as a result.
> >
> > Instead of special-casing the whole-slab bulk refill case, defer freelist
> > construction until after objects are emitted from a fresh slab.
> > new_slab() now only allocates the slab and initializes its metadata.
> > refill_objects() then obtains a fresh slab and lets alloc_from_new_slab()
> > emit objects directly, building a freelist only for the objects left
> > unallocated; the same change is applied to alloc_single_from_new_slab().
> >
> > To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> > small iterator abstraction for walking free objects in allocation order.
> > The iterator is used both for filling the sheaf and for building the
> > freelist of the remaining objects.
> >
> > Also mark setup_object() inline. After this optimization, the compiler no
> > longer consistently inlines this helper in the hot path, which can hurt
> > performance. Explicitly marking it inline restores the expected code
> > generation.
> >
> > This reduces per-object overhead when allocating from a fresh slab.
> > The most direct benefit is in the paths that allocate objects first and
> > only build a freelist for the remainder afterward: bulk allocation from
> > a new slab in refill_objects(), single-object allocation from a new slab
> > in ___slab_alloc(), and the corresponding early-boot paths that now use
> > the same deferred-freelist scheme. Since refill_objects() is also used to
> > refill sheaves, the optimization is not limited to the small set of
> > kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation
> > workloads may benefit as well when they refill from a fresh slab.
> >
> > In slub_bulk_bench, the time per object drops by about 32% to 70% with
> > CONFIG_SLAB_FREELIST_RANDOM=n, and by about 50% to 67% with
> > CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the
> > cost removed by this change: each iteration allocates exactly
> > slab->objects from a fresh slab. That makes it a near best-case scenario
> > for deferred freelist construction, because the old path still built a
> > full freelist even when no objects remained, while the new path avoids
> > that work. Realistic workloads may see smaller end-to-end gains depending
> > on how often allocations reach this fresh-slab refill path.
> >
> > Benchmark results (slub_bulk_bench):
> > Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> > Kernel: Linux 7.0.0-rc7-next-20260407
> > Config: x86_64_defconfig
> > Cpu: 0
> > Rounds: 20
> > Total: 256MB
> >
> > - CONFIG_SLAB_FREELIST_RANDOM=n -
> >
> > obj_size=16, batch=256:
> > before: 4.91 +- 0.07 ns/object
> > after: 3.29 +- 0.03 ns/object
> > delta: -32.8%
> >
> > obj_size=32, batch=128:
> > before: 6.96 +- 0.07 ns/object
> > after: 3.73 +- 0.05 ns/object
> > delta: -46.4%
> >
> > obj_size=64, batch=64:
> > before: 10.77 +- 0.12 ns/object
> > after: 4.65 +- 0.06 ns/object
> > delta: -56.8%
> >
> > obj_size=128, batch=32:
> > before: 19.04 +- 0.22 ns/object
> > after: 6.30 +- 0.07 ns/object
> > delta: -66.9%
> >
> > obj_size=256, batch=32:
> > before: 22.20 +- 0.26 ns/object
> > after: 6.68 +- 0.06 ns/object
> > delta: -69.9%
> >
> > obj_size=512, batch=32:
> > before: 20.03 +- 0.62 ns/object
> > after: 6.83 +- 0.09 ns/object
> > delta: -65.9%
> >
> > - CONFIG_SLAB_FREELIST_RANDOM=y -
> >
> > obj_size=16, batch=256:
> > before: 8.72 +- 0.06 ns/object
> > after: 4.31 +- 0.05 ns/object
> > delta: -50.5%
> >
> > obj_size=32, batch=128:
> > before: 11.29 +- 0.13 ns/object
> > after: 4.93 +- 0.05 ns/object
> > delta: -56.3%
> >
> > obj_size=64, batch=64:
> > before: 15.36 +- 0.24 ns/object
> > after: 5.95 +- 0.10 ns/object
> > delta: -61.3%
> >
> > obj_size=128, batch=32:
> > before: 21.75 +- 0.26 ns/object
> > after: 8.10 +- 0.14 ns/object
> > delta: -62.8%
> >
> > obj_size=256, batch=32:
> > before: 26.62 +- 0.26 ns/object
> > after: 8.58 +- 0.22 ns/object
> > delta: -67.8%
> >
> > obj_size=512, batch=32:
> > before: 26.88 +- 0.36 ns/object
> > after: 8.81 +- 0.11 ns/object
> > delta: -67.2%
> >
> > Link: https://github.com/HSM6236/slub_bulk_test.git
> > Suggested-by: Harry Yoo (Oracle) <harry@kernel.org>
> > Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> > Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
>
> Nice!
>
> > mm/slab.h | 10 ++
> > mm/slub.c | 283 +++++++++++++++++++++++++++---------------------------
> > 2 files changed, 151 insertions(+), 142 deletions(-)
>
> And the delta is just 19 new lines of code, good!
>
> Just some nits.
>
Hi Vlastimil,
Thanks for the review.
>
> > diff --git a/mm/slab.h b/mm/slab.h
> > index bf2f87acf5e3..ada3f9c3909f 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -91,6 +91,16 @@ struct slab {
> > #endif
> > };
> >
> > +struct slab_obj_iter {
> > + unsigned long pos;
> > + void *start;
> > +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> > + unsigned long freelist_count;
> > + unsigned long page_limit;
> > + bool random;
> > +#endif
> > +};
>
> I think this struct could live in slub.c as mothing else needs it?
>
Agreed. I'll move struct slab_obj_iter into mm/slub.c.
> > /*
> > * Called only for kmem_cache_debug() caches to allocate from a freshly
> > * allocated slab. Allocate a single object instead of whole freelist
> > * and put the slab to the partial (or full) list.
> > */
> > static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > - int orig_size, gfp_t gfpflags)
> > + int orig_size, bool allow_spin)
> > {
> > - bool allow_spin = gfpflags_allow_spinning(gfpflags);
> > - int nid = slab_nid(slab);
> > - struct kmem_cache_node *n = get_node(s, nid);
> > + struct kmem_cache_node *n;
> > + struct slab_obj_iter iter;
> > + bool needs_add_partial;
> > unsigned long flags;
> > void *object;
> >
> > - if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
> > - /* Unlucky, discard newly allocated slab. */
> > - free_new_slab_nolock(s, slab);
> > - return NULL;
> > - }
> > -
> > - object = slab->freelist;
> > - slab->freelist = get_freepointer(s, object);
> > + init_slab_obj_iter(s, slab, &iter, allow_spin);
> > + object = next_slab_obj(s, &iter);
> > slab->inuse = 1;
> >
> > + needs_add_partial = (slab->objects > 1);
> > + build_slab_freelist(s, slab, &iter);
> > +
> > if (!alloc_debug_processing(s, slab, object, orig_size)) {
> > /*
> > * It's not really expected that this would fail on a
> > @@ -3696,20 +3686,32 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > * corruption in theory could cause that.
> > * Leak memory of allocated slab.
> > */
> > - if (!allow_spin)
> > - spin_unlock_irqrestore(&n->list_lock, flags);
> > return NULL;
> > }
> >
> > - if (allow_spin)
> > + n = get_node(s, slab_nid(slab));
> > + if (allow_spin) {
> > spin_lock_irqsave(&n->list_lock, flags);
> > + } else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> > + /*
> > + * Unlucky, discard newly allocated slab.
> > + * The slab is not fully free, but it's fine as
> > + * objects are not allocated to users.
> > + */
> > + free_new_slab_nolock(s, slab);
>
> I was going to complain that we can leave alloc_debug_processing() without a
> corresponding free_debug_processing(). But it seems it can't have any bad
> effect. Only with SLAB_TRACE we would print alloc without corresponding
> free. But I doubt anyone uses it anyway.
>
Agreed. This should not cause any functional issues, and SLAB_TRACE is
rarely enabled, so I think it is fine as-is. Please let me know
if you have any other comments.
> > + return NULL;
> > + }
> >
> > - if (slab->inuse == slab->objects)
> > - add_full(s, n, slab);
> > - else
> > + if (needs_add_partial)
> > add_partial(n, slab, ADD_TO_HEAD);
> > + else
> > + add_full(s, n, slab);
> >
> > - inc_slabs_node(s, nid, slab->objects);
> > + /*
> > + * Debug caches require nr_slabs updates under n->list_lock so validation
> > + * cannot race with slab (de)allocations and observe inconsistent state.
> > + */
> > + inc_slabs_node(s, slab_nid(slab), slab->objects);
> > spin_unlock_irqrestore(&n->list_lock, flags);
> >
> > return object;
> > @@ -4349,9 +4351,10 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > {
> > unsigned int allocated = 0;
> > struct kmem_cache_node *n;
> > + struct slab_obj_iter iter;
> > bool needs_add_partial;
> > unsigned long flags;
> > - void *object;
> > + unsigned int target_inuse;
> >
> > /*
> > * Are we going to put the slab on the partial list?
> > @@ -4359,33 +4362,30 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > */
> > needs_add_partial = (slab->objects > count);
>
> How about
>
> bool needs_add_partial = true;
>
> ...
>
> if (count >= slab->objects) {
> needs_add_partial = false;
> count = slab->objects;
> }
>
> Then we don't need target_inuse and can use count.
>
Good suggestion. I'll reuse count and drop target_inuse.
> > - if (!allow_spin && needs_add_partial) {
> > + /* Target inuse count after allocating from this new slab. */
> > + target_inuse = needs_add_partial ? count : slab->objects;
> >
> > - n = get_node(s, slab_nid(slab));
> > + init_slab_obj_iter(s, slab, &iter, allow_spin);
> >
> > - if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> > - /* Unlucky, discard newly allocated slab */
> > - free_new_slab_nolock(s, slab);
> > - return 0;
> > - }
> > - }
> > -
> > - object = slab->freelist;
> > - while (object && allocated < count) {
> > - p[allocated] = object;
> > - object = get_freepointer(s, object);
> > - maybe_wipe_obj_freeptr(s, p[allocated]);
> > -
> > - slab->inuse++;
> > + while (allocated < target_inuse) {
> > + p[allocated] = next_slab_obj(s, &iter);
> > allocated++;
> > }
> > - slab->freelist = object;
> > + slab->inuse = target_inuse;
> > + build_slab_freelist(s, slab, &iter);
> >
> > if (needs_add_partial) {
> > -
> > + n = get_node(s, slab_nid(slab));
>
> The declaration of 'n' could move here.
>
Right, I'll move it into that block.
--
With Best Regards,
Shengming
> > if (allow_spin) {
> > - n = get_node(s, slab_nid(slab));
> > spin_lock_irqsave(&n->list_lock, flags);
> > + } else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> > + /*
> > + * Unlucky, discard newly allocated slab.
> > + * The slab is not fully free, but it's fine as
> > + * objects are not allocated to users.
> > + */
> > + free_new_slab_nolock(s, slab);
>
> Yeah I think this is ok.
>
> > + return 0;
> > }
> > add_partial(n, slab, ADD_TO_HEAD);
> > spin_unlock_irqrestore(&n->list_lock, flags);
> > @@ -4456,15 +4456,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > stat(s, ALLOC_SLAB);
> >
> > if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> > - object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
> > + object = alloc_single_from_new_slab(s, slab, orig_size, allow_spin);
> >
> > if (likely(object))
> > goto success;
> > } else {
> > - alloc_from_new_slab(s, slab, &object, 1, allow_spin);
> > -
> > /* we don't need to check SLAB_STORE_USER here */
> > - if (likely(object))
> > + if (alloc_from_new_slab(s, slab, &object, 1, allow_spin))
> > return object;
> > }
> >
> > @@ -7251,10 +7249,6 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> >
> > stat(s, ALLOC_SLAB);
> >
> > - /*
> > - * TODO: possible optimization - if we know we will consume the whole
> > - * slab we might skip creating the freelist?
> > - */
> > refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> > /* allow_spin = */ true);
> >
> > @@ -7585,6 +7579,7 @@ static void early_kmem_cache_node_alloc(int node)
> > {
> > struct slab *slab;
> > struct kmem_cache_node *n;
> > + struct slab_obj_iter iter;
> >
> > BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
> >
> > @@ -7596,14 +7591,18 @@ static void early_kmem_cache_node_alloc(int node)
> > pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
> > }
> >
> > - n = slab->freelist;
> > + init_slab_obj_iter(kmem_cache_node, slab, &iter, true);
> > +
> > + n = next_slab_obj(kmem_cache_node, &iter);
> > BUG_ON(!n);
> > +
> > + slab->inuse = 1;
> > + build_slab_freelist(kmem_cache_node, slab, &iter);
> > +
> > #ifdef CONFIG_SLUB_DEBUG
> > init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
> > #endif
> > n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
> > - slab->freelist = get_freepointer(kmem_cache_node, n);
> > - slab->inuse = 1;
> > kmem_cache_node->per_node[node].node = n;
> > init_kmem_cache_node(n);
> > inc_slabs_node(kmem_cache_node, node, slab->objects);
prev parent reply other threads:[~2026-04-14 13:41 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 15:04 hu.shengming
2026-04-14 3:55 ` Hao Li
2026-04-14 8:35 ` hu.shengming
2026-04-14 12:50 ` Vlastimil Babka (SUSE)
2026-04-14 13:40 ` hu.shengming [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260414214050619-bhr4zGY4E0xLWc0olEiI@zte.com.cn \
--to=hu.shengming@zte.com.cn \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=hao.li@linux.dev \
--cc=harry@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=vbabka@kernel.org \
--cc=xu.xin16@zte.com.cn \
--cc=yang.tao172@zte.com.cn \
--cc=yang.yang29@zte.com.cn \
--cc=zhang.run@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox