linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hao Li <hao.li@linux.dev>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>,
	Petr Tesarik <ptesarik@suse.com>,
	 Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Uladzislau Rezki <urezki@gmail.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	 Alexei Starovoitov <ast@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 linux-rt-devel@lists.linux.dev, bpf@vger.kernel.org,
	kasan-dev@googlegroups.com
Subject: Re: [PATCH v4 10/22] slab: add optimized sheaf refill from partial list
Date: Mon, 26 Jan 2026 15:12:03 +0800	[thread overview]
Message-ID: <jgmmllqopl4rpihfe4jdnuifzexlffef5gehsocdcdu2xdj62j@xuz56etxseza> (raw)
In-Reply-To: <20260123-sheaves-for-all-v4-10-041323d506f7@suse.cz>

On Fri, Jan 23, 2026 at 07:52:48AM +0100, Vlastimil Babka wrote:
> At this point we have sheaves enabled for all caches, but their refill
> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> slabs - now a redundant caching layer that we are about to remove.
> 
> The refill will thus be done from slabs on the node partial list.
> Introduce new functions that can do that in an optimized way as it's
> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> 
> Introduce struct partial_bulk_context, a variant of struct
> partial_context that can return a list of slabs from the partial list
> with the sum of free objects in them within the requested min and max.
> 
> Introduce get_partial_node_bulk() that removes the slabs from freelist
> and returns them in the list. There is a racy read of slab->counters
> so make sure the non-atomic write in __update_freelist_slow() is not
> tearing.
> 
> Introduce get_freelist_nofreeze() which grabs the freelist without
> freezing the slab.
> 
> Introduce alloc_from_new_slab() which can allocate multiple objects from
> a newly allocated slab where we don't need to synchronize with freeing.
> In some aspects it's similar to alloc_single_from_new_slab() but assumes
> the cache is a non-debug one so it can avoid some actions. It supports
> the allow_spin parameter, which we always set true here, but the
> followup change will reuse the function in a context where it may be
> false.
> 
> Introduce __refill_objects() that uses the functions above to fill an
> array of objects. It has to handle the possibility that the slabs will
> contain more objects that were requested, due to concurrent freeing of
> objects to those slabs. When no more slabs on partial lists are
> available, it will allocate new slabs. It is intended to be only used
> in context where spinning is allowed, so add a WARN_ON_ONCE check there.
> 
> Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are
> only refilled from contexts that allow spinning, or even blocking.
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 293 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 272 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 22acc249f9c0..142a1099bbc1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -248,6 +248,14 @@ struct partial_context {
>  	void *object;
>  };
>  
> +/* Structure holding parameters for get_partial_node_bulk() */
> +struct partial_bulk_context {
> +	gfp_t flags;
> +	unsigned int min_objects;
> +	unsigned int max_objects;
> +	struct list_head slabs;
> +};
> +
>  static inline bool kmem_cache_debug(struct kmem_cache *s)
>  {
>  	return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS);
> @@ -778,7 +786,8 @@ __update_freelist_slow(struct slab *slab, struct freelist_counters *old,
>  	slab_lock(slab);
>  	if (slab->freelist == old->freelist &&
>  	    slab->counters == old->counters) {
> -		slab->freelist = new->freelist;
> +		/* prevent tearing for the read in get_partial_node_bulk() */
> +		WRITE_ONCE(slab->freelist, new->freelist);

Should this perhaps be WRITE_ONCE(slab->counters, new->counters) here?

Everything else looks good to me.

Reviewed-by: Hao Li <hao.li@linux.dev>

-- 
Thanks,
Hao

>  		slab->counters = new->counters;
>  		ret = true;
>  	}
> @@ -2638,9 +2647,9 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
>  	stat(s, SHEAF_FREE);
>  }
>  
> -static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> -				   size_t size, void **p);
> -
> +static unsigned int
> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +		 unsigned int max);
>  
>  static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>  			 gfp_t gfp)
> @@ -2651,8 +2660,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>  	if (!to_fill)
>  		return 0;
>  
> -	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> -					 &sheaf->objects[sheaf->size]);
> +	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
> +			to_fill, to_fill);
>  
>  	sheaf->size += filled;
>  
> @@ -3518,6 +3527,57 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
>  #endif
>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>  
> +static bool get_partial_node_bulk(struct kmem_cache *s,
> +				  struct kmem_cache_node *n,
> +				  struct partial_bulk_context *pc)
> +{
> +	struct slab *slab, *slab2;
> +	unsigned int total_free = 0;
> +	unsigned long flags;
> +
> +	/* Racy check to avoid taking the lock unnecessarily. */
> +	if (!n || data_race(!n->nr_partial))
> +		return false;
> +
> +	INIT_LIST_HEAD(&pc->slabs);
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +
> +	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> +		struct freelist_counters flc;
> +		unsigned int slab_free;
> +
> +		if (!pfmemalloc_match(slab, pc->flags))
> +			continue;
> +
> +		/*
> +		 * determine the number of free objects in the slab racily
> +		 *
> +		 * slab_free is a lower bound due to possible subsequent
> +		 * concurrent freeing, so the caller may get more objects than
> +		 * requested and must handle that
> +		 */
> +		flc.counters = data_race(READ_ONCE(slab->counters));
> +		slab_free = flc.objects - flc.inuse;
> +
> +		/* we have already min and this would get us over the max */
> +		if (total_free >= pc->min_objects
> +		    && total_free + slab_free > pc->max_objects)
> +			break;
> +
> +		remove_partial(n, slab);
> +
> +		list_add(&slab->slab_list, &pc->slabs);
> +
> +		total_free += slab_free;
> +		if (total_free >= pc->max_objects)
> +			break;
> +	}
> +
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +	return total_free > 0;
> +}
> +
>  /*
>   * Try to allocate a partial slab from a specific node.
>   */
> @@ -4444,6 +4504,33 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
>  	return old.freelist;
>  }
>  
> +/*
> + * Get the slab's freelist and do not freeze it.
> + *
> + * Assumes the slab is isolated from node partial list and not frozen.
> + *
> + * Assumes this is performed only for caches without debugging so we
> + * don't need to worry about adding the slab to the full list.
> + */
> +static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *slab)
> +{
> +	struct freelist_counters old, new;
> +
> +	do {
> +		old.freelist = slab->freelist;
> +		old.counters = slab->counters;
> +
> +		new.freelist = NULL;
> +		new.counters = old.counters;
> +		VM_WARN_ON_ONCE(new.frozen);
> +
> +		new.inuse = old.objects;
> +
> +	} while (!slab_update_freelist(s, slab, &old, &new, "get_freelist_nofreeze"));
> +
> +	return old.freelist;
> +}
> +
>  /*
>   * Freeze the partial slab and return the pointer to the freelist.
>   */
> @@ -4467,6 +4554,72 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
>  	return old.freelist;
>  }
>  
> +/*
> + * If the object has been wiped upon free, make sure it's fully initialized by
> + * zeroing out freelist pointer.
> + *
> + * Note that we also wipe custom freelist pointers.
> + */
> +static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> +						   void *obj)
> +{
> +	if (unlikely(slab_want_init_on_free(s)) && obj &&
> +	    !freeptr_outside_object(s))
> +		memset((void *)((char *)kasan_reset_tag(obj) + s->offset),
> +			0, sizeof(void *));
> +}
> +
> +static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> +		void **p, unsigned int count, bool allow_spin)
> +{
> +	unsigned int allocated = 0;
> +	struct kmem_cache_node *n;
> +	bool needs_add_partial;
> +	unsigned long flags;
> +	void *object;
> +
> +	/*
> +	 * Are we going to put the slab on the partial list?
> +	 * Note slab->inuse is 0 on a new slab.
> +	 */
> +	needs_add_partial = (slab->objects > count);
> +
> +	if (!allow_spin && needs_add_partial) {
> +
> +		n = get_node(s, slab_nid(slab));
> +
> +		if (!spin_trylock_irqsave(&n->list_lock, flags)) {
> +			/* Unlucky, discard newly allocated slab */
> +			defer_deactivate_slab(slab, NULL);
> +			return 0;
> +		}
> +	}
> +
> +	object = slab->freelist;
> +	while (object && allocated < count) {
> +		p[allocated] = object;
> +		object = get_freepointer(s, object);
> +		maybe_wipe_obj_freeptr(s, p[allocated]);
> +
> +		slab->inuse++;
> +		allocated++;
> +	}
> +	slab->freelist = object;
> +
> +	if (needs_add_partial) {
> +
> +		if (allow_spin) {
> +			n = get_node(s, slab_nid(slab));
> +			spin_lock_irqsave(&n->list_lock, flags);
> +		}
> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +	}
> +
> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> +	return allocated;
> +}
> +
>  /*
>   * Slow path. The lockless freelist is empty or we need to perform
>   * debugging duties.
> @@ -4909,21 +5062,6 @@ static __always_inline void *__slab_alloc_node(struct kmem_cache *s,
>  	return object;
>  }
>  
> -/*
> - * If the object has been wiped upon free, make sure it's fully initialized by
> - * zeroing out freelist pointer.
> - *
> - * Note that we also wipe custom freelist pointers.
> - */
> -static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> -						   void *obj)
> -{
> -	if (unlikely(slab_want_init_on_free(s)) && obj &&
> -	    !freeptr_outside_object(s))
> -		memset((void *)((char *)kasan_reset_tag(obj) + s->offset),
> -			0, sizeof(void *));
> -}
> -
>  static __fastpath_inline
>  struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
>  {
> @@ -5384,6 +5522,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
>  	return ret;
>  }
>  
> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> +				   size_t size, void **p);
> +
>  /*
>   * returns a sheaf that has at least the requested size
>   * when prefilling is needed, do so with given gfp flags
> @@ -7484,6 +7625,116 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  }
>  EXPORT_SYMBOL(kmem_cache_free_bulk);
>  
> +static unsigned int
> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +		 unsigned int max)
> +{
> +	struct partial_bulk_context pc;
> +	struct slab *slab, *slab2;
> +	unsigned int refilled = 0;
> +	unsigned long flags;
> +	void *object;
> +	int node;
> +
> +	pc.flags = gfp;
> +	pc.min_objects = min;
> +	pc.max_objects = max;
> +
> +	node = numa_mem_id();
> +
> +	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
> +		return 0;
> +
> +	/* TODO: consider also other nodes? */
> +	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> +		goto new_slab;
> +
> +	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +		list_del(&slab->slab_list);
> +
> +		object = get_freelist_nofreeze(s, slab);
> +
> +		while (object && refilled < max) {
> +			p[refilled] = object;
> +			object = get_freepointer(s, object);
> +			maybe_wipe_obj_freeptr(s, p[refilled]);
> +
> +			refilled++;
> +		}
> +
> +		/*
> +		 * Freelist had more objects than we can accommodate, we need to
> +		 * free them back. We can treat it like a detached freelist, just
> +		 * need to find the tail object.
> +		 */
> +		if (unlikely(object)) {
> +			void *head = object;
> +			void *tail;
> +			int cnt = 0;
> +
> +			do {
> +				tail = object;
> +				cnt++;
> +				object = get_freepointer(s, object);
> +			} while (object);
> +			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
> +		}
> +
> +		if (refilled >= max)
> +			break;
> +	}
> +
> +	if (unlikely(!list_empty(&pc.slabs))) {
> +		struct kmem_cache_node *n = get_node(s, node);
> +
> +		spin_lock_irqsave(&n->list_lock, flags);
> +
> +		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +			if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial))
> +				continue;
> +
> +			list_del(&slab->slab_list);
> +			add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +		}
> +
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +		/* any slabs left are completely free and for discard */
> +		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +			list_del(&slab->slab_list);
> +			discard_slab(s, slab);
> +		}
> +	}
> +
> +
> +	if (likely(refilled >= min))
> +		goto out;
> +
> +new_slab:
> +
> +	slab = new_slab(s, pc.flags, node);
> +	if (!slab)
> +		goto out;
> +
> +	stat(s, ALLOC_SLAB);
> +
> +	/*
> +	 * TODO: possible optimization - if we know we will consume the whole
> +	 * slab we might skip creating the freelist?
> +	 */
> +	refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> +					/* allow_spin = */ true);
> +
> +	if (refilled < min)
> +		goto new_slab;
> +out:
> +
> +	return refilled;
> +}
> +
>  static inline
>  int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  			    void **p)
> 
> -- 
> 2.52.0
> 


  reply	other threads:[~2026-01-26  7:12 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-23  6:52 [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
2026-01-27 16:08   ` Liam R. Howlett
2026-01-23  6:52 ` [PATCH v4 02/22] mm/slab: fix false lockdep warning in __kfree_rcu_sheaf() Vlastimil Babka
2026-01-23 12:03   ` Sebastian Andrzej Siewior
2026-01-24 10:58     ` Harry Yoo
2026-01-23  6:52 ` [PATCH v4 03/22] slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 04/22] mm/slab: move and refactor __kmem_cache_alias() Vlastimil Babka
2026-01-27 16:17   ` Liam R. Howlett
2026-01-27 16:59     ` Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 05/22] mm/slab: make caches with sheaves mergeable Vlastimil Babka
2026-01-27 16:23   ` Liam R. Howlett
2026-01-23  6:52 ` [PATCH v4 06/22] slab: add sheaves to most caches Vlastimil Babka
2026-01-26  6:36   ` Hao Li
2026-01-26  8:39     ` Vlastimil Babka
2026-01-26 13:59   ` Breno Leitao
2026-01-27 16:34   ` Liam R. Howlett
2026-01-27 17:01     ` Vlastimil Babka
2026-01-29  7:24   ` Zhao Liu
2026-01-29  8:21     ` Vlastimil Babka
2026-01-30  7:15       ` Zhao Liu
2026-02-04 18:01         ` Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 07/22] slab: introduce percpu sheaves bootstrap Vlastimil Babka
2026-01-26  6:13   ` Hao Li
2026-01-26  8:42     ` Vlastimil Babka
2026-01-27 17:31   ` Liam R. Howlett
2026-01-23  6:52 ` [PATCH v4 08/22] slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock() Vlastimil Babka
2026-01-23 18:05   ` Alexei Starovoitov
2026-01-27 17:36   ` Liam R. Howlett
2026-01-29  8:25     ` Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 09/22] slab: handle kmalloc sheaves bootstrap Vlastimil Babka
2026-01-27 18:30   ` Liam R. Howlett
2026-01-23  6:52 ` [PATCH v4 10/22] slab: add optimized sheaf refill from partial list Vlastimil Babka
2026-01-26  7:12   ` Hao Li [this message]
2026-01-29  7:43     ` Harry Yoo
2026-01-29  8:29       ` Vlastimil Babka
2026-01-27 20:05   ` Liam R. Howlett
2026-01-29  8:01   ` Harry Yoo
2026-01-23  6:52 ` [PATCH v4 11/22] slab: remove cpu (partial) slabs usage from allocation paths Vlastimil Babka
2026-01-23 18:17   ` Alexei Starovoitov
2026-01-23  6:52 ` [PATCH v4 12/22] slab: remove SLUB_CPU_PARTIAL Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 13/22] slab: remove the do_slab_free() fastpath Vlastimil Babka
2026-01-23 18:15   ` Alexei Starovoitov
2026-01-23  6:52 ` [PATCH v4 14/22] slab: remove defer_deactivate_slab() Vlastimil Babka
2026-01-23 17:31   ` Alexei Starovoitov
2026-01-23  6:52 ` [PATCH v4 15/22] slab: simplify kmalloc_nolock() Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 16/22] slab: remove struct kmem_cache_cpu Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 17/22] slab: remove unused PREEMPT_RT specific macros Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 18/22] slab: refill sheaves from all nodes Vlastimil Babka
2026-01-27 14:28   ` Mateusz Guzik
2026-01-27 22:04     ` Vlastimil Babka
2026-01-29  9:16   ` Harry Yoo
2026-01-23  6:52 ` [PATCH v4 19/22] slab: update overview comments Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 20/22] slab: remove frozen slab checks from __slab_free() Vlastimil Babka
2026-01-29  7:16   ` Harry Yoo
2026-01-23  6:52 ` [PATCH v4 21/22] mm/slub: remove DEACTIVATE_TO_* stat items Vlastimil Babka
2026-01-29  7:21   ` Harry Yoo
2026-01-23  6:53 ` [PATCH v4 22/22] mm/slub: cleanup and repurpose some " Vlastimil Babka
2026-01-29  7:40   ` Harry Yoo
2026-01-29 15:18 ` [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Hao Li
2026-01-29 15:28   ` Vlastimil Babka
2026-01-29 16:06     ` Hao Li
2026-01-29 16:44       ` Liam R. Howlett
2026-01-30  4:38         ` Hao Li
2026-01-30  4:50     ` Hao Li
2026-01-30  6:17       ` Hao Li
2026-02-04 18:02       ` Vlastimil Babka
2026-02-04 18:24         ` Christoph Lameter (Ampere)
2026-02-06 16:44           ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jgmmllqopl4rpihfe4jdnuifzexlffef5gehsocdcdu2xdj62j@xuz56etxseza \
    --to=hao.li@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=cl@gentwo.org \
    --cc=harry.yoo@oracle.com \
    --cc=kasan-dev@googlegroups.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=ptesarik@suse.com \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox