linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Christoph Lameter <cl@linux.com>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Uladzislau Rezki <urezki@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	rcu@vger.kernel.org, maple-tree@lists.infradead.org
Subject: Re: [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu sheaves
Date: Thu, 3 Apr 2025 17:31:20 +0900	[thread overview]
Message-ID: <Z-5HWApFjrOr7Q8_@harry> (raw)
In-Reply-To: <20250317-slub-percpu-caches-v3-2-9d9884d8b643@suse.cz>

On Mon, Mar 17, 2025 at 03:33:03PM +0100, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> refill one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with a specific
> node (not NUMA_NO_NODE), sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> allocated or freed using the sheaves. Counters sheaf_refill,
> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> from or to slab pages, and can be used to assess how effective the
> caching is. The refill and flush operations will also count towards the
> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> counters.
> 
> Access to the percpu sheaves is protected by localtry_trylock() when
> potential callers include irq context, and localtry_lock() otherwise
> (such as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> A current limitation is that when slub_debug is enabled for a cache with
> percpu sheaves, the objects in the array are considered as allocated from
> the slub_debug perspective, and the alloc/free debugging hooks occur
> when moving the objects between the array and slab pages. This means
> that e.g. an use-after-free that occurs for an object cached in the
> array is undetected. Collected alloc/free stacktraces might also be less
> useful. This limitation could be changed in the future.
> 
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |   34 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1029 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1020 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -453,12 +489,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>   */
>  static nodemask_t slab_nodes;
>  
> -#ifndef CONFIG_SLUB_TINY
>  /*
>   * Workqueue used for flush_cpu_slab().
>   */
>  static struct workqueue_struct *flushwq;
> -#endif
> +
> +struct slub_flush_work {
> +	struct work_struct work;
> +	struct kmem_cache *s;
> +	bool skip;
> +};
> +
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>  
>  /********************************************************************
>   * 			Core slab cache functions
> @@ -2410,6 +2453,358 @@ static void *setup_object(struct kmem_cache *s, void *object)
>  	return object;
>  }

> +/*
> + * Bulk free objects to the percpu sheaves.
> + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> + * and the fallback to freeing to slab pages.
> + */
> +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{

[...snip...]

> +next_batch:
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		goto fallback;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +		struct slab_sheaf *empty;
> +
> +		if (!pcs->spare) {
> +			empty = barn_get_empty_sheaf(pcs->barn);
> +			if (empty) {
> +				pcs->spare = pcs->main;
> +				pcs->main = empty;
> +				goto do_free;
> +			}
> +			goto no_empty;

Maybe a silly question, but if neither of alloc_from_pcs_bulk() or
free_to_pcs_bulk() allocates empty sheaves (and sometimes put empty or full
sheaves in the barn), you should expect usually sheaves not to be in the barn
when using bulk interfces?

> +		}
> +
> +		if (pcs->spare->size < s->sheaf_capacity) {
> +			stat(s, SHEAF_SWAP);
> +			swap(pcs->main, pcs->spare);
> +			goto do_free;
> +		}
> +
> +		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +		if (!IS_ERR(empty)) {
> +			pcs->main = empty;
> +			goto do_free;
> +		}
> +
> +no_empty:
> +		localtry_unlock(&s->cpu_sheaves->lock);
> +
> +		/*
> +		 * if we depleted all empty sheaves in the barn or there are too
> +		 * many full sheaves, free the rest to slab pages
> +		 */
> +fallback:
> +		__kmem_cache_free_bulk(s, size, p);
> +		return;
> +	}
> +
> +do_free:
> +	main = pcs->main;
> +	batch = min(size, s->sheaf_capacity - main->size);
> +
> +	memcpy(main->objects + main->size, p, batch * sizeof(void *));
> +	main->size += batch;
> +
> +	localtry_unlock(&s->cpu_sheaves->lock);
> +
> +	stat_add(s, FREE_PCS, batch);
> +
> +	if (batch < size) {
> +		p += batch;
> +		size -= batch;
> +		goto next_batch;
> +	}
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -5309,8 +6145,8 @@ static inline int calculate_order(unsigned int size)
>  	return -ENOSYS;
>  }
>  
> -static void
> -init_kmem_cache_node(struct kmem_cache_node *n)
> +static bool
> +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
>  {

Why is the return type bool, when it always succeeds?

>  	n->nr_partial = 0;
>  	spin_lock_init(&n->list_lock);
> @@ -5320,6 +6156,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
>  	atomic_long_set(&n->total_objects, 0);
>  	INIT_LIST_HEAD(&n->full);
>  #endif
> +	n->barn = barn;
> +	if (barn)
> +		barn_init(barn);
> +
> +	return true;
>  }
>  
>  #ifndef CONFIG_SLUB_TINY
> @@ -5385,7 +6250,7 @@ static void early_kmem_cache_node_alloc(int node)
>  	slab->freelist = get_freepointer(kmem_cache_node, n);
>  	slab->inuse = 1;
>  	kmem_cache_node->node[node] = n;
> -	init_kmem_cache_node(n);
> +	init_kmem_cache_node(n, NULL);
>  	inc_slabs_node(kmem_cache_node, node, slab->objects);
>  
>  	/*
> @@ -5421,20 +6295,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>  
>  	for_each_node_mask(node, slab_nodes) {
>  		struct kmem_cache_node *n;
> +		struct node_barn *barn = NULL;
>  
>  		if (slab_state == DOWN) {
>  			early_kmem_cache_node_alloc(node);
>  			continue;
>  		}
> +
> +		if (s->cpu_sheaves) {
> +			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> +
> +			if (!barn)
> +				return 0;
> +		}
> +
>  		n = kmem_cache_alloc_node(kmem_cache_node,
>  						GFP_KERNEL, node);
> -
> -		if (!n) {
> -			free_kmem_cache_nodes(s);
> +		if (!n)
>  			return 0;
> -		}

Looks like it's leaking the barn
if the allocation of kmem_cache_node fails?

> -		init_kmem_cache_node(n);
> +		init_kmem_cache_node(n, barn);
> +
>  		s->node[node] = n;
>  	}
>  	return 1;
> @@ -6005,12 +6891,24 @@ static int slab_mem_going_online_callback(void *arg)
>  	 */
>  	mutex_lock(&slab_mutex);
>  	list_for_each_entry(s, &slab_caches, list) {
> +		struct node_barn *barn = NULL;
> +
>  		/*
>  		 * The structure may already exist if the node was previously
>  		 * onlined and offlined.
>  		 */
>  		if (get_node(s, nid))
>  			continue;
> +
> +		if (s->cpu_sheaves) {
> +			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> +
> +			if (!barn) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +		}
> +

Ditto.

Otherwise looks good to me :)

>  		/*
>  		 * XXX: kmem_cache_alloc_node will fallback to other nodes
>  		 *      since memory is not yet available from the node that
> @@ -6021,7 +6919,9 @@ static int slab_mem_going_online_callback(void *arg)
>  			ret = -ENOMEM;
>  			goto out;
>  		}
> -		init_kmem_cache_node(n);
> +
> +		init_kmem_cache_node(n, barn);
> +
>  		s->node[nid] = n;
>  	}
>  	/*

-- 
Cheers,
Harry (formerly known as Hyeonggon)


  reply	other threads:[~2025-04-03  8:31 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-17 14:33 [PATCH RFC v3 0/8] SLUB " Vlastimil Babka
2025-03-17 14:33 ` [PATCH RFC v3 1/8] locking/local_lock: Introduce localtry_lock_t Vlastimil Babka
2025-03-17 14:33 ` [PATCH RFC v3 2/8] slab: add opt-in caching layer of percpu sheaves Vlastimil Babka
2025-04-03  8:31   ` Harry Yoo [this message]
2025-04-03 14:11     ` Vlastimil Babka
2025-04-10 19:51   ` Suren Baghdasaryan
2025-04-22 15:02     ` Vlastimil Babka
2025-03-17 14:33 ` [PATCH RFC v3 3/8] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-04-09  1:50   ` Harry Yoo
2025-04-09 15:09     ` Vlastimil Babka
2025-04-10 20:24   ` Suren Baghdasaryan
2025-04-22 15:18     ` Vlastimil Babka
2025-03-17 14:33 ` [PATCH RFC v3 4/8] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-04-10 20:47   ` Suren Baghdasaryan
2025-04-23 13:06     ` Vlastimil Babka
2025-04-23 17:13       ` Suren Baghdasaryan
2025-03-17 14:33 ` [PATCH RFC v3 5/8] slab: determine barn status racily outside of lock Vlastimil Babka
2025-03-17 14:33 ` [PATCH RFC v3 6/8] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
2025-03-17 14:33 ` [PATCH RFC v3 7/8] tools: Add sheafs support to testing infrastructure Vlastimil Babka
2025-03-17 14:33 ` [PATCH RFC v3 8/8] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z-5HWApFjrOr7Q8_@harry \
    --to=harry.yoo@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=cl@linux.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox