Re: [PATCH v6] mm/slub: defer freelist construction until after bulk allocation from a new slab

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: hu.shengming@zte.com.cn, harry@kernel.org, akpm@linux-foundation.org
Cc: hao.li@linux.dev, cl@gentwo.org, rientjes@google.com,
	roman.gushchin@linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, zhang.run@zte.com.cn,
	xu.xin16@zte.com.cn, yang.tao172@zte.com.cn,
	yang.yang29@zte.com.cn
Subject: Re: [PATCH v6] mm/slub: defer freelist construction until after bulk allocation from a new slab
Date: Tue, 14 Apr 2026 14:50:47 +0200	[thread overview]
Message-ID: <411f9b29-9b1c-4e2f-a212-f583280a6788@kernel.org> (raw)
In-Reply-To: <20260413230417835rnfiEWduEx44lc3u4uR9_@zte.com.cn>

On 4/13/26 17:04, hu.shengming@zte.com.cn wrote:
> From: Shengming Hu <hu.shengming@zte.com.cn>
> 
> Allocations from a fresh slab can consume all of its objects, and the
> freelist built during slab allocation is discarded immediately as a result.
> 
> Instead of special-casing the whole-slab bulk refill case, defer freelist
> construction until after objects are emitted from a fresh slab.
> new_slab() now only allocates the slab and initializes its metadata.
> refill_objects() then obtains a fresh slab and lets alloc_from_new_slab()
> emit objects directly, building a freelist only for the objects left
> unallocated; the same change is applied to alloc_single_from_new_slab().
> 
> To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> small iterator abstraction for walking free objects in allocation order.
> The iterator is used both for filling the sheaf and for building the
> freelist of the remaining objects.
> 
> Also mark setup_object() inline. After this optimization, the compiler no
> longer consistently inlines this helper in the hot path, which can hurt
> performance. Explicitly marking it inline restores the expected code
> generation.
> 
> This reduces per-object overhead when allocating from a fresh slab.
> The most direct benefit is in the paths that allocate objects first and
> only build a freelist for the remainder afterward: bulk allocation from
> a new slab in refill_objects(), single-object allocation from a new slab
> in ___slab_alloc(), and the corresponding early-boot paths that now use
> the same deferred-freelist scheme. Since refill_objects() is also used to
> refill sheaves, the optimization is not limited to the small set of
> kmem_cache_alloc_bulk()/kmem_cache_free_bulk() users; regular allocation
> workloads may benefit as well when they refill from a fresh slab.
> 
> In slub_bulk_bench, the time per object drops by about 32% to 70% with
> CONFIG_SLAB_FREELIST_RANDOM=n, and by about 50% to 67% with
> CONFIG_SLAB_FREELIST_RANDOM=y. This benchmark is intended to isolate the
> cost removed by this change: each iteration allocates exactly
> slab->objects from a fresh slab. That makes it a near best-case scenario
> for deferred freelist construction, because the old path still built a
> full freelist even when no objects remained, while the new path avoids
> that work. Realistic workloads may see smaller end-to-end gains depending
> on how often allocations reach this fresh-slab refill path.
> 
> Benchmark results (slub_bulk_bench):
> Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> Kernel: Linux 7.0.0-rc7-next-20260407
> Config: x86_64_defconfig
> Cpu: 0
> Rounds: 20
> Total: 256MB
> 
> - CONFIG_SLAB_FREELIST_RANDOM=n -
> 
> obj_size=16, batch=256:
> before: 4.91 +- 0.07 ns/object
> after: 3.29 +- 0.03 ns/object
> delta: -32.8%
> 
> obj_size=32, batch=128:
> before: 6.96 +- 0.07 ns/object
> after: 3.73 +- 0.05 ns/object
> delta: -46.4%
> 
> obj_size=64, batch=64:
> before: 10.77 +- 0.12 ns/object
> after: 4.65 +- 0.06 ns/object
> delta: -56.8%
> 
> obj_size=128, batch=32:
> before: 19.04 +- 0.22 ns/object
> after: 6.30 +- 0.07 ns/object
> delta: -66.9%
> 
> obj_size=256, batch=32:
> before: 22.20 +- 0.26 ns/object
> after: 6.68 +- 0.06 ns/object
> delta: -69.9%
> 
> obj_size=512, batch=32:
> before: 20.03 +- 0.62 ns/object
> after: 6.83 +- 0.09 ns/object
> delta: -65.9%
> 
> - CONFIG_SLAB_FREELIST_RANDOM=y -
> 
> obj_size=16, batch=256:
> before: 8.72 +- 0.06 ns/object
> after: 4.31 +- 0.05 ns/object
> delta: -50.5%
> 
> obj_size=32, batch=128:
> before: 11.29 +- 0.13 ns/object
> after: 4.93 +- 0.05 ns/object
> delta: -56.3%
> 
> obj_size=64, batch=64:
> before: 15.36 +- 0.24 ns/object
> after: 5.95 +- 0.10 ns/object
> delta: -61.3%
> 
> obj_size=128, batch=32:
> before: 21.75 +- 0.26 ns/object
> after: 8.10 +- 0.14 ns/object
> delta: -62.8%
> 
> obj_size=256, batch=32:
> before: 26.62 +- 0.26 ns/object
> after: 8.58 +- 0.22 ns/object
> delta: -67.8%
> 
> obj_size=512, batch=32:
> before: 26.88 +- 0.36 ns/object
> after: 8.81 +- 0.11 ns/object
> delta: -67.2%
> 
> Link: https://github.com/HSM6236/slub_bulk_test.git
> Suggested-by: Harry Yoo (Oracle) <harry@kernel.org>
> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>

Nice!

>  mm/slab.h |  10 ++
>  mm/slub.c | 283 +++++++++++++++++++++++++++---------------------------
>  2 files changed, 151 insertions(+), 142 deletions(-)

And the delta is just 19 new lines of code, good!

Just some nits.


> diff --git a/mm/slab.h b/mm/slab.h
> index bf2f87acf5e3..ada3f9c3909f 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -91,6 +91,16 @@ struct slab {
>  #endif
>  };
> 
> +struct slab_obj_iter {
> +	unsigned long pos;
> +	void *start;
> +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> +	unsigned long freelist_count;
> +	unsigned long page_limit;
> +	bool random;
> +#endif
> +};

I think this struct could live in slub.c as mothing else needs it?

>  /*
>   * Called only for kmem_cache_debug() caches to allocate from a freshly
>   * allocated slab. Allocate a single object instead of whole freelist
>   * and put the slab to the partial (or full) list.
>   */
>  static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> -					int orig_size, gfp_t gfpflags)
> +					int orig_size, bool allow_spin)
>  {
> -	bool allow_spin = gfpflags_allow_spinning(gfpflags);
> -	int nid = slab_nid(slab);
> -	struct kmem_cache_node *n = get_node(s, nid);
> +	struct kmem_cache_node *n;
> +	struct slab_obj_iter iter;
> +	bool needs_add_partial;
>  	unsigned long flags;
>  	void *object;
> 
> -	if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
> -		/* Unlucky, discard newly allocated slab. */
> -		free_new_slab_nolock(s, slab);
> -		return NULL;
> -	}
> -
> -	object = slab->freelist;
> -	slab->freelist = get_freepointer(s, object);
> +	init_slab_obj_iter(s, slab, &iter, allow_spin);
> +	object = next_slab_obj(s, &iter);
>  	slab->inuse = 1;
> 
> +	needs_add_partial = (slab->objects > 1);
> +	build_slab_freelist(s, slab, &iter);
> +
>  	if (!alloc_debug_processing(s, slab, object, orig_size)) {
>  		/*
>  		 * It's not really expected that this would fail on a
> @@ -3696,20 +3686,32 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  		 * corruption in theory could cause that.
>  		 * Leak memory of allocated slab.
>  		 */
> -		if (!allow_spin)
> -			spin_unlock_irqrestore(&n->list_lock, flags);
>  		return NULL;
>  	}
> 
> -	if (allow_spin)
> +	n = get_node(s, slab_nid(slab));
> +	if (allow_spin) {
>  		spin_lock_irqsave(&n->list_lock, flags);
> +	} else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> +		/*
> +		 * Unlucky, discard newly allocated slab.
> +		 * The slab is not fully free, but it's fine as
> +		 * objects are not allocated to users.
> +		 */
> +		free_new_slab_nolock(s, slab);

I was going to complain that we can leave alloc_debug_processing() without a
corresponding free_debug_processing(). But it seems it can't have any bad
effect. Only with SLAB_TRACE we would print alloc without corresponding
free. But I doubt anyone uses it anyway.

> +		return NULL;
> +	}
> 
> -	if (slab->inuse == slab->objects)
> -		add_full(s, n, slab);
> -	else
> +	if (needs_add_partial)
>  		add_partial(n, slab, ADD_TO_HEAD);
> +	else
> +		add_full(s, n, slab);
> 
> -	inc_slabs_node(s, nid, slab->objects);
> +	/*
> +	 * Debug caches require nr_slabs updates under n->list_lock so validation
> +	 * cannot race with slab (de)allocations and observe inconsistent state.
> +	 */
> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
>  	spin_unlock_irqrestore(&n->list_lock, flags);
> 
>  	return object;
> @@ -4349,9 +4351,10 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  {
>  	unsigned int allocated = 0;
>  	struct kmem_cache_node *n;
> +	struct slab_obj_iter iter;
>  	bool needs_add_partial;
>  	unsigned long flags;
> -	void *object;
> +	unsigned int target_inuse;
> 
>  	/*
>  	 * Are we going to put the slab on the partial list?
> @@ -4359,33 +4362,30 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  	 */
>  	needs_add_partial = (slab->objects > count);

How about

	bool needs_add_partial = true;

	...

	if (count >= slab->objects) {
		needs_add_partial = false;
		count = slab->objects;
	}

Then we don't need target_inuse and can use count.

> -	if (!allow_spin && needs_add_partial) {
> +	/* Target inuse count after allocating from this new slab. */
> +	target_inuse = needs_add_partial ? count : slab->objects;
> 
> -		n = get_node(s, slab_nid(slab));
> +	init_slab_obj_iter(s, slab, &iter, allow_spin);
> 
> -		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> -			/* Unlucky, discard newly allocated slab */
> -			free_new_slab_nolock(s, slab);
> -			return 0;
> -		}
> -	}
> -
> -	object = slab->freelist;
> -	while (object && allocated < count) {
> -		p[allocated] = object;
> -		object = get_freepointer(s, object);
> -		maybe_wipe_obj_freeptr(s, p[allocated]);
> -
> -		slab->inuse++;
> +	while (allocated < target_inuse) {
> +		p[allocated] = next_slab_obj(s, &iter);
>  		allocated++;
>  	}
> -	slab->freelist = object;
> +	slab->inuse = target_inuse;
> +	build_slab_freelist(s, slab, &iter);
> 
>  	if (needs_add_partial) {
> -
> +		n = get_node(s, slab_nid(slab));

The declaration of 'n' could move here.

>  		if (allow_spin) {
> -			n = get_node(s, slab_nid(slab));
>  			spin_lock_irqsave(&n->list_lock, flags);
> +		} else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> +			/*
> +			 * Unlucky, discard newly allocated slab.
> +			 * The slab is not fully free, but it's fine as
> +			 * objects are not allocated to users.
> +			 */
> +			free_new_slab_nolock(s, slab);

Yeah I think this is ok.

> +			return 0;
>  		}
>  		add_partial(n, slab, ADD_TO_HEAD);
>  		spin_unlock_irqrestore(&n->list_lock, flags);
> @@ -4456,15 +4456,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	stat(s, ALLOC_SLAB);
> 
>  	if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> -		object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
> +		object = alloc_single_from_new_slab(s, slab, orig_size, allow_spin);
> 
>  		if (likely(object))
>  			goto success;
>  	} else {
> -		alloc_from_new_slab(s, slab, &object, 1, allow_spin);
> -
>  		/* we don't need to check SLAB_STORE_USER here */
> -		if (likely(object))
> +		if (alloc_from_new_slab(s, slab, &object, 1, allow_spin))
>  			return object;
>  	}
> 
> @@ -7251,10 +7249,6 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> 
>  	stat(s, ALLOC_SLAB);
> 
> -	/*
> -	 * TODO: possible optimization - if we know we will consume the whole
> -	 * slab we might skip creating the freelist?
> -	 */
>  	refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
>  					/* allow_spin = */ true);
> 
> @@ -7585,6 +7579,7 @@ static void early_kmem_cache_node_alloc(int node)
>  {
>  	struct slab *slab;
>  	struct kmem_cache_node *n;
> +	struct slab_obj_iter iter;
> 
>  	BUG_ON(kmem_cache_node->size < sizeof(struct kmem_cache_node));
> 
> @@ -7596,14 +7591,18 @@ static void early_kmem_cache_node_alloc(int node)
>  		pr_err("SLUB: Allocating a useless per node structure in order to be able to continue\n");
>  	}
> 
> -	n = slab->freelist;
> +	init_slab_obj_iter(kmem_cache_node, slab, &iter, true);
> +
> +	n = next_slab_obj(kmem_cache_node, &iter);
>  	BUG_ON(!n);
> +
> +	slab->inuse = 1;
> +	build_slab_freelist(kmem_cache_node, slab, &iter);
> +
>  #ifdef CONFIG_SLUB_DEBUG
>  	init_object(kmem_cache_node, n, SLUB_RED_ACTIVE);
>  #endif
>  	n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
> -	slab->freelist = get_freepointer(kmem_cache_node, n);
> -	slab->inuse = 1;
>  	kmem_cache_node->per_node[node].node = n;
>  	init_kmem_cache_node(n);
>  	inc_slabs_node(kmem_cache_node, node, slab->objects);

next prev parent reply	other threads:[~2026-04-14 12:50 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-13 15:04 hu.shengming
2026-04-14  3:55 ` Hao Li
2026-04-14  8:35   ` hu.shengming
2026-04-14 12:50 ` Vlastimil Babka (SUSE) [this message]
2026-04-14 13:40   ` hu.shengming

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=411f9b29-9b1c-4e2f-a212-f583280a6788@kernel.org \
    --to=vbabka@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=hao.li@linux.dev \
    --cc=harry@kernel.org \
    --cc=hu.shengming@zte.com.cn \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=xu.xin16@zte.com.cn \
    --cc=yang.tao172@zte.com.cn \
    --cc=yang.yang29@zte.com.cn \
    --cc=zhang.run@zte.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox