From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: hu.shengming@zte.com.cn, harry@kernel.org, akpm@linux-foundation.org
Cc: hao.li@linux.dev, cl@gentwo.org, rientjes@google.com,
roman.gushchin@linux.dev, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, zhang.run@zte.com.cn,
xu.xin16@zte.com.cn, yang.tao172@zte.com.cn,
yang.yang29@zte.com.cn
Subject: Re: [PATCH v6] mm/slub: defer freelist construction until after bulk allocation from a new slab
Date: Tue, 14 Apr 2026 14:50:47 +0200 [thread overview]
Message-ID: <411f9b29-9b1c-4e2f-a212-f583280a6788@kernel.org> (raw)
In-Reply-To: <20260413230417835rnfiEWduEx44lc3u4uR9_@zte.com.cn>
On 4/13/26 17:04, hu.shengming@zte.com.cn wrote:
> From: Shengming Hu <hu.shengming@zte.com.cn>
>
> Allocations from a fresh slab can consume all of its objects, and the
> freelist built during slab allocation is discarded immediately as a result.
>
> Instead of special-casing the whole-slab bulk refill case, defer freelist
> construction until after objects are emitted from a fresh slab.
> new_slab() now only allocates the slab and initializes its metadata.
> refill_objects() then obtains a fresh slab and lets alloc_from_new_slab()
> emit objects directly, building a freelist only for the objects left
> unallocated; the same change is applied to alloc_single_from_new_slab().
>
> To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> small iterator abstraction for walking free objects in allocation order.
> The iterator is used both for filling the sheaf and for building the
> freelist of the remaining objects.
>
> Also mark setup_object() inline. After this optimization, the compiler no
> longer consistently inlines this helper in the hot path, which can hurt
> performance. Explicitly marking it inline restores the expected code
> generation.
>
> This reduces per-object overhead when allocating from a fresh slab.
> The most direct benefit is in the paths that allocate objects first and
> only build a freelist for the remainder afterward: bulk allocation from
> a new slab in refill_objects(), single-object allocation from a new slab
> in ___slab_alloc(), and the corresponding early-boot paths that now use
> the same deferred-freelist scheme. Since refill_objects() is also used to
> refill sheaves, the optimization is not limited to the small set of
> kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation
> workloads may benefit as well when they refill from a fresh slab.
>
> In slub_bulk_bench, the time per object drops by about 32% to 70% with
> CONFIG_SLAB_FREELIST_RANDOM=n, and by about 50% to 67% with
> CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the
> cost removed by this change: each iteration allocates exactly
> slab->objects from a fresh slab. That makes it a near best-case scenario
> for deferred freelist construction, because the old path still built a
> full freelist even when no objects remained, while the new path avoids
> that work. Realistic workloads may see smaller end-to-end gains depending
> on how often allocations reach this fresh-slab refill path.
>
> Benchmark results (slub_bulk_bench):
> Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> Kernel: Linux 7.0.0-rc7-next-20260407
> Config: x86_64_defconfig
> Cpu: 0
> Rounds: 20
> Total: 256MB
>
> - CONFIG_SLAB_FREELIST_RANDOM=n -
>
> obj_size=16, batch=256:
> before: 4.91 +- 0.07 ns/object
> after: 3.29 +- 0.03 ns/object
> delta: -32.8%
>
> obj_size=32, batch=128:
> before: 6.96 +- 0.07 ns/object
> after: 3.73 +- 0.05 ns/object
> delta: -46.4%
>
> obj_size=64, batch=64:
> before: 10.77 +- 0.12 ns/object
> after: 4.65 +- 0.06 ns/object
> delta: -56.8%
>
> obj_size=128, batch=32:
> before: 19.04 +- 0.22 ns/object
> after: 6.30 +- 0.07 ns/object
> delta: -66.9%
>
> obj_size=256, batch=32:
> before: 22.20 +- 0.26 ns/object
> after: 6.68 +- 0.06 ns/object
> delta: -69.9%
>
> obj_size=512, batch=32:
> before: 20.03 +- 0.62 ns/object
> after: 6.83 +- 0.09 ns/object
> delta: -65.9%
>
> - CONFIG_SLAB_FREELIST_RANDOM=y -
>
> obj_size=16, batch=256:
> before: 8.72 +- 0.06 ns/object
> after: 4.31 +- 0.05 ns/object
> delta: -50.5%
>
> obj_size=32, batch=128:
> before: 11.29 +- 0.13 ns/object
> after: 4.93 +- 0.05 ns/object
> delta: -56.3%
>
> obj_size=64, batch=64:
> before: 15.36 +- 0.24 ns/object
> after: 5.95 +- 0.10 ns/object
> delta: -61.3%
>
> obj_size=128, batch=32:
> before: 21.75 +- 0.26 ns/object
> after: 8.10 +- 0.14 ns/object
> delta: -62.8%
>
> obj_size=256, batch=32:
> before: 26.62 +- 0.26 ns/object
> after: 8.58 +- 0.22 ns/object
> delta: -67.8%
>
> obj_size=512, batch=32:
> before: 26.88 +- 0.36 ns/object
> after: 8.81 +- 0.11 ns/object
> delta: -67.2%
>
> Link: https://github.com/HSM6236/slub_bulk_test.git
> Suggested-by: Harry Yoo (Oracle) <harry@kernel.org>
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Nice!
> mm/slab.h | 10 ++
> mm/slub.c | 283 +++++++++++++++++++++++++++---------------------------
> 2 files changed, 151 insertions(+), 142 deletions(-)
And the delta is just 19 new lines of code, good!
Just some nits.
> diff --git a/mm/slab.h b/mm/slab.h
> index bf2f87acf5e3..ada3f9c3909f 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -91,6 +91,16 @@ struct slab {
> #endif
> };
>
> +struct slab_obj_iter {
> + unsigned long pos;
> + void *start;
> +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> + unsigned long freelist_count;
> + unsigned long page_limit;
> + bool random;
> +#endif
> +};
I think this struct could live in slub.c as mothing else needs it?
> /*
> * Called only for kmem_cache_debug() caches to allocate from a freshly
> * allocated slab. Allocate a single object instead of whole freelist
> * and put the slab to the partial (or full) list.
> */
> static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> - int orig_size, gfp_t gfpflags)
> + int orig_size, bool allow_spin)
> {
> - bool allow_spin = gfpflags_allow_spinning(gfpflags);
> - int nid = slab_nid(slab);
> - struct kmem_cache_node *n = get_node(s, nid);
> + struct kmem_cache_node *n;
> + struct slab_obj_iter iter;
> + bool needs_add_partial;
> unsigned long flags;
> void *object;
>
> - if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
> - /* Unlucky, discard newly allocated slab. */
> - free_new_slab_nolock(s, slab);
> - return NULL;
> - }
> -
> - object = slab->freelist;
> - slab->freelist = get_freepointer(s, object);
> + init_slab_obj_iter(s, slab, &iter, allow_spin);
> + object = next_slab_obj(s, &iter);
> slab->inuse = 1;
>
> + needs_add_partial = (slab->objects > 1);
> + build_slab_freelist(s, slab, &iter);
> +
> if (!alloc_debug_processing(s, slab, object, orig_size)) {
> /*
> * It's not really expected that this would fail on a
> @@ -3696,20 +3686,32 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> * corruption in theory could cause that.
> * Leak memory of allocated slab.
> */
> - if (!allow_spin)
> - spin_unlock_irqrestore(&n->list_lock, flags);
> return NULL;
> }
>
> - if (allow_spin)
> + n = get_node(s, slab_nid(slab));
> + if (allow_spin) {
> spin_lock_irqsave(&n->list_lock, flags);
> + } else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> + /*
> + * Unlucky, discard newly allocated slab.
> + * The slab is not fully free, but it's fine as
> + * objects are not allocated to users.
> + */
> + free_new_slab_nolock(s, slab);
I was going to complain that we can leave alloc_debug_processing() without a
corresponding free_debug_processing(). But it seems it can't have any bad
effect. Only with SLAB_TRACE we would print alloc without corresponding
free. But I doubt anyone uses it anyway.
> + return NULL;
> + }
>
> - if (slab->inuse == slab->objects)
> - add_full(s, n, slab);
> - else
> + if (needs_add_partial)
> add_partial(n, slab, ADD_TO_HEAD);
> + else
> + add_full(s, n, slab);
>
> - inc_slabs_node(s, nid, slab->objects);
> + /*
> + * Debug caches require nr_slabs updates under n->list_lock so validation
> + * cannot race with slab (de)allocations and observe inconsistent state.
> + */
> + inc_slabs_node(s, slab_nid(slab), slab->objects);
> spin_unlock_irqrestore(&n->list_lock, flags);
>
> return object;
> @@ -4349,9 +4351,10 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> {
> unsigned int allocated = 0;
> struct kmem_cache_node *n;
> + struct slab_obj_iter iter;
> bool needs_add_partial;
> unsigned long flags;
> - void *object;
> + unsigned int target_inuse;
>
> /*
> * Are we going to put the slab on the partial list?
> @@ -4359,33 +4362,30 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> */
> needs_add_partial = (slab->objects > count);
How about
bool needs_add_partial = true;
...
if (count >= slab->objects) {
needs_add_partial = false;
count = slab->objects;
}
Then we don't need target_inuse and can use count.
> - if (!allow_spin && needs_add_partial) {
> + /* Target inuse count after allocating from this new slab. */
> + target_inuse = needs_add_partial ? count : slab->objects;
>
> - n = get_node(s, slab_nid(slab));
> + init_slab_obj_iter(s, slab, &iter, allow_spin);
>
> - if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> - /* Unlucky, discard newly allocated slab */
> - free_new_slab_nolock(s, slab);
> - return 0;
> - }
> - }
> -
> - object = slab->freelist;
> - while (object && allocated < count) {
> - p[allocated] = object;
> - object = get_freepointer(s, object);
> - maybe_wipe_obj_freeptr(s, p[allocated]);
> -
> - slab->inuse++;
> + while (allocated < target_inuse) {
> + p[allocated] = next_slab_obj(s, &iter);
> allocated++;
> }
> - slab->freelist = object;
> + slab->inuse = target_inuse;
> + build_slab_freelist(s, slab, &iter);
>
> if (needs_add_partial) {
> -
> + n = get_node(s, slab_nid(slab));
The declaration of 'n' could move here.
> if (allow_spin) {
> - n = get_node(s, slab_nid(slab));
> spin_lock_irqsave(&n->list_lock, flags);
> + } else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> + /*
> + * Unlucky, discard newly allocated slab.
> + * The slab is not fully free, but it's fine as
> + * objects are not allocated to users.
> + */
> + free_new_slab_nolock(s, slab);
Yeah I think this is ok.
> + return 0;
> }
> add_partial(n, slab, ADD_TO_HEAD);
> spin_unlock_irqrestore(&n->list_lock, flags);
> @@ -4456,15 +4456,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> stat(s, ALLOC_SLAB);
>
> if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> - object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
> + object = alloc_single_from_new_slab(s, slab, orig_size, allow_spin);
>
> if (likely(object))
> goto success;
> } else {
> - alloc_from_new_slab(s, slab, &object, 1, allow_spin);
> -
> /* we don't need to check SLAB_STORE_USER here */
> - if (likely(object))
> + if (alloc_from_new_slab(s, slab, &object, 1, allow_spin))
> return object;
> }
>
> @@ -7251,10 +7249,6 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>
> stat(s, ALLOC_SLAB);
>
> - /*
> - * TODO: possible optimization - if we know we will consume the whole
> - * slab we might skip creating the freelist?
> - */
> refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> /* allow_spin = */ true);
>
> @@ -7585,6 +7579,7 @@ static void early_kmem_cache_node_alloc(int node)
> {
> struct slab *slab;
> struct kmem_cache_node *n;
> + struct slab_obj_iter iter;
>
> BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
>
> @@ -7596,14 +7591,18 @@ static void early_kmem_cache_node_alloc(int node)
> pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
> }
>
> - n = slab->freelist;
> + init_slab_obj_iter(kmem_cache_node, slab, &iter, true);
> +
> + n = next_slab_obj(kmem_cache_node, &iter);
> BUG_ON(!n);
> +
> + slab->inuse = 1;
> + build_slab_freelist(kmem_cache_node, slab, &iter);
> +
> #ifdef CONFIG_SLUB_DEBUG
> init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
> #endif
> n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
> - slab->freelist = get_freepointer(kmem_cache_node, n);
> - slab->inuse = 1;
> kmem_cache_node->per_node[node].node = n;
> init_kmem_cache_node(n);
> inc_slabs_node(kmem_cache_node, node, slab->objects);
next prev parent reply other threads:[~2026-04-14 12:50 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 15:04 hu.shengming
2026-04-14 3:55 ` Hao Li
2026-04-14 8:35 ` hu.shengming
2026-04-14 12:50 ` Vlastimil Babka (SUSE) [this message]
2026-04-14 13:40 ` hu.shengming
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=411f9b29-9b1c-4e2f-a212-f583280a6788@kernel.org \
--to=vbabka@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=hao.li@linux.dev \
--cc=harry@kernel.org \
--cc=hu.shengming@zte.com.cn \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=xu.xin16@zte.com.cn \
--cc=yang.tao172@zte.com.cn \
--cc=yang.yang29@zte.com.cn \
--cc=zhang.run@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox