[PATCH RFC v2 00/10] SLUB percpu sheaves

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v2 00/10] SLUB percpu sheaves
@ 2025-02-14 16:27 Vlastimil Babka
  2025-02-14 16:27 ` [PATCH RFC v2 01/10] slab: add opt-in caching layer of " Vlastimil Babka
                   ` (11 more replies)
  0 siblings, 12 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Liam R. Howlett

Hi,

This is the v2 RFC to add an opt-in percpu array-based caching layer to
SLUB. The name "sheaf" was invented by Matthew so we don't call it
magazine like the original Bonwick paper. The per-NUMA-node cache of
sheaves is thus called "barn".

This may seem similar to the arrays in SLAB, but the main differences
are:

- opt-in, not used for every cache
- does not distinguish NUMA locality, thus no "alien" arrays that would
  need periodical flushing
- improves kfree_rcu() handling
- API for obtaining a preallocated sheaf that can be used for guaranteed
  and efficient allocations in a restricted context, when the upper
  bound for needed objects is known but rarely reached

The motivation comes mainly from the ongoing work related to VMA
scalability and the related maple tree operations. This is why maple
tree nodes are sheaf-enabled in the RFC, but it's not a full conversion
that would take benefits of the improved preallocation API. The VMA part
is currently left out as it's expected that Suren will land the VMA
TYPESAFE_BY_RCU conversion [3] soon and there would be conflict with that.
With both series applied it means just adding a line to kmem_cache_args
in proc_caches_init().

Some performance benefits were measured by Suren and Liam in previous
versions. I hope to have those numbers posted public as both this work
and the VMA and maple tree changes stabilize.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  after Patch 5 it's preempt_disable() and no atomic operations. Same for
  freeing, which is normally a local double cmpxchg only for a short
  term allocations (so the same slab is still active on the same cpu when
  freeing the object) and a more costly locked double cmpxchg otherwise.
  The downside is the lack of NUMA locality guarantees for the allocated
  objects.

- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
  separate percpu sheaf and only submit the whole sheaf to call_rcu()
  when full. After the grace period, the sheaf can be used for
  allocations, which is more efficient than freeing and reallocating
  individual slab objects (even with the batching done by kfree_rcu()
  implementation itself). In case only some cpus are allowed to handle rcu
  callbacks, the sheaf can still be made available to other cpus on the
  same node via the shared barn. The maple_node cache uses kfree_rcu() and
  thus can benefit from this.

- Preallocation support. A prefilled sheaf can be privately borrowed for
  a short term operation that is not allowed to block in the middle and
  may need to allocate some objects. If an upper bound (worst case) for
  the number of allocations is known, but only much fewer allocations
  actually needed on average, borrowing and returning a sheaf is much more
  efficient then a bulk allocation for the worst case followed by a bulk
  free of the many unused objects. Maple tree write operations should
  benefit from this.

Patch 1 implements the basic sheaf functionality and using
local_lock_irqsave() for percpu sheaf locking.

Patch 2 adds the kfree_rcu() support.

Patch 3 is copied from the series "bpf, mm: Introduce try_alloc_pages()"
[2] to introduce a variant of local_lock that has a trylock operation.
Patch 4 adds a variant of the trylock without _irqsave. Patch 5 converts
percpu sheaves locking to the new variant of the lock.

Patch 6 implements borrowing prefilled sheaves, with maple tree being the
ancticipated user.

Patch 7 seeks to reduce barn spinlock contention. Separately for
possible evaluation.

Patches 8 and 9 by Liam add testing stubs that maple tree will use in
its userspace tests.

Patch 10 enables sheaves for the maple tree node cache, but does not
take advantage of prefilling yet.

(RFC) LIMITATIONS:

- with slub_debug enabled, objects in sheaves are considered allocated
  so allocation/free stacktraces may become imprecise and checking of
  e.g. redzone violations may be delayed

GIT TREES:

this series: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v2

To avoid conflicts, the series requires (and the branch above is based
on) the kfree_rcu() code refactoring scheduled for 6.15:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-6.15/kfree_rcu_tiny

To facilitate testing/benchmarking, there's also a branch with Liam's
maple tree changes from [4] adapted to the current code:

https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v2-maple

There are also two optimization patches for sheaves by Liam for
evaluation as I suspect they might not be universal win.

Vlastimil

[1] https://lore.kernel.org/all/20241111205506.3404479-1-surenb@google.com/
[2] https://lore.kernel.org/all/20250213033556.9534-4-alexei.starovoitov@gmail.com/
[3] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
[4] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/slub-percpu-sheaves-v2

---
Changes in v2:
- Removed kfree_rcu() destructors support as VMAs will not need it
  anymore after [3] is merged.
- Changed to localtry_lock_t borrowed from [2] instead of an own
  implementation of the same idea.
- Many fixes and improvements thanks to Liam's adoption for maple tree
  nodes.
- Userspace Testing stubs by Liam.
- Reduced limitations/todos - hooking to kfree_rcu() is complete,
  prefilled sheaves can exceed cache's sheaf_capacity.
- Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz

---
Liam R. Howlett (2):
      tools: Add testing support for changes to rcu and slab for sheaves
      tools: Add sheafs support to testing infrastructure

Sebastian Andrzej Siewior (1):
      locking/local_lock: Introduce localtry_lock_t

Vlastimil Babka (7):
      slab: add opt-in caching layer of percpu sheaves
      slab: add sheaf support for batching kfree_rcu() operations
      locking/local_lock: add localtry_trylock()
      slab: switch percpu sheaves locking to localtry_lock
      slab: sheaf prefilling for guaranteed allocations
      slab: determine barn status racily outside of lock
      maple_tree: use percpu sheaves for maple_node_cache

 include/linux/local_lock.h            |   70 ++
 include/linux/local_lock_internal.h   |  146 ++++
 include/linux/slab.h                  |   50 ++
 lib/maple_tree.c                      |   11 +-
 mm/slab.h                             |    4 +
 mm/slab_common.c                      |   26 +-
 mm/slub.c                             | 1403 +++++++++++++++++++++++++++++++--
 tools/include/linux/slab.h            |   65 +-
 tools/testing/shared/linux.c          |  108 ++-
 tools/testing/shared/linux/rcupdate.h |   22 +
 10 files changed, 1840 insertions(+), 65 deletions(-)
---
base-commit: 379487e17ca406b47392e7ab6cf35d1c3bacb371
change-id: 20231128-slub-percpu-caches-9441892011d7
prerequisite-message-id: 20250203-slub-tiny-kfree_rcu-v1-0-d4428bf9a8a1@suse.cz
prerequisite-patch-id: 1a4af92b5eb1b8bfc86bac8d7fc1ef0963e7d9d6
prerequisite-patch-id: f24a39c38103b7e09fbf2e6f84e6108499ab7980
prerequisite-patch-id: 23e90b23482f4775c95295821dd779ba4e3712e9
prerequisite-patch-id: 5c53a619477acdce07071abec0f40e79501ea40b

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-22 22:46   ` Suren Baghdasaryan
  2025-02-24  8:04   ` Harry Yoo
  2025-02-14 16:27 ` [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.

Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
refill one of the sheaves.

When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.

The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() with a specific
node (not NUMA_NO_NODE), sheaves are bypassed.

The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.

Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
allocated or freed using the sheaves. Counters sheaf_refill,
sheaf_flush_main and sheaf_flush_other count objects filled or flushed
from or to slab pages, and can be used to assess how effective the
caching is. The refill and flush operations will also count towards the
usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
counters.

Access to the percpu sheaves is protected by local_lock_irqsave()
operations, each per-NUMA-node barn has a spin_lock.

A current limitation is that when slub_debug is enabled for a cache with
percpu sheaves, the objects in the array are considered as allocated from
the slub_debug perspective, and the alloc/free debugging hooks occur
when moving the objects between the array and slab pages. This means
that e.g. an use-after-free that occurs for an object cached in the
array is undetected. Collected alloc/free stacktraces might also be less
useful. This limitation could be changed in the future.

On the other hand, KASAN, kmemcg and other hooks are executed on actual
allocations and frees by kmem_cache users even if those use the array,
so their debugging or accounting accuracy should be unaffected.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  34 ++
 mm/slab.h            |   2 +
 mm/slab_common.c     |   5 +-
 mm/slub.c            | 982 ++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 973 insertions(+), 50 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -332,6 +332,40 @@ struct kmem_cache_args {
 	 * %NULL means no constructor.
 	 */
 	void (*ctor)(void *);
+	/**
+	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
+	 *
+	 * With a non-zero value, allocations from the cache go through caching
+	 * arrays called sheaves. Each cpu has a main sheaf that's always
+	 * present, and a spare sheaf thay may be not present. When both become
+	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
+	 * from the per-node barn.
+	 *
+	 * When no full sheaf is available, and gfp flags allow blocking, a
+	 * sheaf is allocated and filled from slab(s) using bulk allocation.
+	 * Otherwise the allocation falls back to the normal operation
+	 * allocating a single object from a slab.
+	 *
+	 * Analogically when freeing and both percpu sheaves are full, the barn
+	 * may replace it with an empty sheaf, unless it's over capacity. In
+	 * that case a sheaf is bulk freed to slab pages.
+	 *
+	 * The sheaves does not distinguish NUMA placement of objects, so
+	 * allocations via kmem_cache_alloc_node() with a node specified other
+	 * than NUMA_NO_NODE will bypass them.
+	 *
+	 * Bulk allocation and free operations also try to use the cpu sheaves
+	 * and barn, but fallback to using slab pages directly.
+	 *
+	 * Limitations: when slub_debug is enabled for the cache, all relevant
+	 * actions (i.e. poisoning, obtaining stacktraces) and checks happen
+	 * when objects move between sheaves and slab pages, which may result in
+	 * e.g. not detecting a use-after-free while the object is in the array
+	 * cache, and the stacktraces may be less useful.
+	 *
+	 * %0 means no sheaves will be created
+	 */
+	unsigned int sheaf_capacity;
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/mm/slab.h b/mm/slab.h
index 2f01c7317988ce036f0b22807403226a59f0f708..8daaec53b6ecfc44171191d421adb12e5cba2c58 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -259,6 +259,7 @@ struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
 #endif
+	struct slub_percpu_sheaves __percpu *cpu_sheaves;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
 	unsigned long min_partial;
@@ -272,6 +273,7 @@ struct kmem_cache {
 	/* Number of per cpu partial slabs to keep around */
 	unsigned int cpu_partial_slabs;
 #endif
+	unsigned int sheaf_capacity;
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46d0a4cd33b5982fd79c307d572f231fdea9514a..ceeefb287899a82f30ad79b403556001c1860311 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
 		return 1;
 #endif
 
+	if (s->cpu_sheaves)
+		return 1;
+
 	/*
 	 * We may have set a slab to be unmergeable during bootstrap.
 	 */
@@ -328,7 +331,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
-	if (!args->usersize)
+	if (!args->usersize && !args->sheaf_capacity)
 		s = __kmem_cache_alias(name, object_size, args->align, flags,
 				       args->ctor);
 	if (s)
diff --git a/mm/slub.c b/mm/slub.c
index e8273f28656936c05d015c53923f8fe69cd161b2..c06734912972b799f537359f7fe6a750918ffe9e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -346,8 +346,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
 #endif
 
 enum stat_item {
+	ALLOC_PCS,		/* Allocation from percpu sheaf */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
+	FREE_PCS,		/* Free to percpu sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -372,6 +374,12 @@ enum stat_item {
 	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
 	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
+	SHEAF_FLUSH_MAIN,	/* Objects flushed from main percpu sheaf */
+	SHEAF_FLUSH_OTHER,	/* Objects flushed from other sheaves */
+	SHEAF_REFILL,		/* Objects refilled to a sheaf */
+	SHEAF_SWAP,		/* Swapping main and spare sheaf */
+	SHEAF_ALLOC,		/* Allocation of an empty sheaf */
+	SHEAF_FREE,		/* Freeing of an empty sheaf */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -418,6 +426,35 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
 #endif
 }
 
+#define MAX_FULL_SHEAVES	10
+#define MAX_EMPTY_SHEAVES	10
+
+struct node_barn {
+	spinlock_t lock;
+	struct list_head sheaves_full;
+	struct list_head sheaves_empty;
+	unsigned int nr_full;
+	unsigned int nr_empty;
+};
+
+struct slab_sheaf {
+	union {
+		struct rcu_head rcu_head;
+		struct list_head barn_list;
+	};
+	struct kmem_cache *cache;
+	unsigned int size;
+	void *objects[];
+};
+
+struct slub_percpu_sheaves {
+	local_lock_t lock;
+	struct slab_sheaf *main; /* never NULL when unlocked */
+	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct slab_sheaf *rcu_free;
+	struct node_barn *barn;
+};
+
 /*
  * The slab lists for all objects.
  */
@@ -430,6 +467,7 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+	struct node_barn *barn;
 };
 
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
@@ -453,12 +491,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
  */
 static nodemask_t slab_nodes;
 
-#ifndef CONFIG_SLUB_TINY
 /*
  * Workqueue used for flush_cpu_slab().
  */
 static struct workqueue_struct *flushwq;
-#endif
+
+struct slub_flush_work {
+	struct work_struct work;
+	struct kmem_cache *s;
+	bool skip;
+};
+
+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
 
 /********************************************************************
  * 			Core slab cache functions
@@ -2410,6 +2455,349 @@ static void *setup_object(struct kmem_cache *s, void *object)
 	return object;
 }
 
+static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
+					s->sheaf_capacity), gfp);
+
+	if (unlikely(!sheaf))
+		return NULL;
+
+	sheaf->cache = s;
+
+	stat(s, SHEAF_ALLOC);
+
+	return sheaf;
+}
+
+static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	kfree(sheaf);
+
+	stat(s, SHEAF_FREE);
+}
+
+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+				   size_t size, void **p);
+
+
+static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
+			 gfp_t gfp)
+{
+	int to_fill = s->sheaf_capacity - sheaf->size;
+	int filled;
+
+	if (!to_fill)
+		return 0;
+
+	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
+					 &sheaf->objects[sheaf->size]);
+
+	if (!filled)
+		return -ENOMEM;
+
+	sheaf->size = s->sheaf_capacity;
+
+	stat_add(s, SHEAF_REFILL, filled);
+
+	return 0;
+}
+
+
+static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (!sheaf)
+		return NULL;
+
+	if (refill_sheaf(s, sheaf, gfp)) {
+		free_empty_sheaf(s, sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+/*
+ * Maximum number of objects freed during a single flush of main pcs sheaf.
+ * Translates directly to an on-stack array size.
+ */
+#define PCS_BATCH_MAX	32U
+
+static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
+
+static void sheaf_flush_main(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned int batch, remaining;
+	void *objects[PCS_BATCH_MAX];
+	struct slab_sheaf *sheaf;
+	unsigned long flags;
+
+next_batch:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	sheaf = pcs->main;
+
+	batch = min(PCS_BATCH_MAX, sheaf->size);
+
+	sheaf->size -= batch;
+	memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
+
+	remaining = sheaf->size;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	__kmem_cache_free_bulk(s, batch, &objects[0]);
+
+	stat_add(s, SHEAF_FLUSH_MAIN, batch);
+
+	if (remaining)
+		goto next_batch;
+}
+
+static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	if (!sheaf->size)
+		return;
+
+	stat_add(s, SHEAF_FLUSH_OTHER, sheaf->size);
+
+	__kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+
+	sheaf->size = 0;
+}
+
+/*
+ * Caller needs to make sure migration is disabled in order to fully flush
+ * single cpu's sheaves
+ *
+ * flushing operations are rare so let's keep it simple and flush to slabs
+ * directly, skipping the barn
+ */
+static void pcs_flush_all(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *spare, *rcu_free;
+	unsigned long flags;
+
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	spare = pcs->spare;
+	pcs->spare = NULL;
+
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	if (spare) {
+		sheaf_flush(s, spare);
+		free_empty_sheaf(s, spare);
+	}
+
+	// TODO: handle rcu_free
+	BUG_ON(rcu_free);
+
+	sheaf_flush_main(s);
+}
+
+static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	if (pcs->spare) {
+		sheaf_flush(s, pcs->spare);
+		free_empty_sheaf(s, pcs->spare);
+		pcs->spare = NULL;
+	}
+
+	// TODO: handle rcu_free
+	BUG_ON(pcs->rcu_free);
+
+	sheaf_flush_main(s);
+}
+
+static void pcs_destroy(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		/* can happen when unwinding failed create */
+		if (!pcs->main)
+			continue;
+
+		WARN_ON(pcs->spare);
+		WARN_ON(pcs->rcu_free);
+
+		if (!WARN_ON(pcs->main->size)) {
+			free_empty_sheaf(s, pcs->main);
+			pcs->main = NULL;
+		}
+	}
+
+	free_percpu(s->cpu_sheaves);
+	s->cpu_sheaves = NULL;
+}
+
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *empty = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_empty) {
+		empty = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&empty->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+static int barn_put_empty_sheaf(struct node_barn *barn,
+				struct slab_sheaf *sheaf, bool ignore_limit)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
+		ret = -E2BIG;
+	} else {
+		list_add(&sheaf->barn_list, &barn->sheaves_empty);
+		barn->nr_empty++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+	return ret;
+}
+
+static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
+			       bool ignore_limit)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
+		ret = -E2BIG;
+	} else {
+		list_add(&sheaf->barn_list, &barn->sheaves_full);
+		barn->nr_full++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+	return ret;
+}
+
+/*
+ * If a full sheaf is available, return it and put the supplied empty one to
+ * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
+ * change.
+ */
+static struct slab_sheaf *
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&full->barn_list);
+		list_add(&empty->barn_list, &barn->sheaves_empty);
+		barn->nr_full--;
+		barn->nr_empty++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+/*
+ * If a empty sheaf is available, return it and put the supplied full one to
+ * barn. But if there are too many full sheaves, reject this with -E2BIG.
+ */
+static struct slab_sheaf *
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+{
+	struct slab_sheaf *empty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full >= MAX_FULL_SHEAVES) {
+		empty = ERR_PTR(-E2BIG);
+	} else if (!barn->nr_empty) {
+		empty = ERR_PTR(-ENOMEM);
+	} else {
+		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
+					 barn_list);
+		list_del(&empty->barn_list);
+		list_add(&full->barn_list, &barn->sheaves_full);
+		barn->nr_empty--;
+		barn->nr_full++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+static void barn_init(struct node_barn *barn)
+{
+	spin_lock_init(&barn->lock);
+	INIT_LIST_HEAD(&barn->sheaves_full);
+	INIT_LIST_HEAD(&barn->sheaves_empty);
+	barn->nr_full = 0;
+	barn->nr_empty = 0;
+}
+
+static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
+{
+	struct list_head empty_list;
+	struct list_head full_list;
+	struct slab_sheaf *sheaf, *sheaf2;
+	unsigned long flags;
+
+	INIT_LIST_HEAD(&empty_list);
+	INIT_LIST_HEAD(&full_list);
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_splice_init(&barn->sheaves_full, &full_list);
+	barn->nr_full = 0;
+	list_splice_init(&barn->sheaves_empty, &empty_list);
+	barn->nr_empty = 0;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
+		sheaf_flush(s, sheaf);
+		list_move(&sheaf->barn_list, &empty_list);
+	}
+
+	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
+		free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Slab allocation and freeing
  */
@@ -3280,11 +3668,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 	put_partials_cpu(s, c);
 }
 
-struct slub_flush_work {
-	struct work_struct work;
-	struct kmem_cache *s;
-	bool skip;
-};
+static inline void flush_this_cpu_slab(struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+
+	if (c->slab)
+		flush_slab(s, c);
+
+	put_partials(s);
+}
+
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+	return c->slab || slub_percpu_partial(c);
+}
+
+#else /* CONFIG_SLUB_TINY */
+static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
+static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
+static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
+#endif /* CONFIG_SLUB_TINY */
+
+static bool has_pcs_used(int cpu, struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	if (!s->cpu_sheaves)
+		return false;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	return (pcs->spare || pcs->rcu_free || pcs->main->size);
+}
+
+static void pcs_flush_all(struct kmem_cache *s);
 
 /*
  * Flush cpu slab.
@@ -3294,30 +3713,18 @@ struct slub_flush_work {
 static void flush_cpu_slab(struct work_struct *w)
 {
 	struct kmem_cache *s;
-	struct kmem_cache_cpu *c;
 	struct slub_flush_work *sfw;
 
 	sfw = container_of(w, struct slub_flush_work, work);
 
 	s = sfw->s;
-	c = this_cpu_ptr(s->cpu_slab);
 
-	if (c->slab)
-		flush_slab(s, c);
-
-	put_partials(s);
-}
-
-static bool has_cpu_slab(int cpu, struct kmem_cache *s)
-{
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+	if (s->cpu_sheaves)
+		pcs_flush_all(s);
 
-	return c->slab || slub_percpu_partial(c);
+	flush_this_cpu_slab(s);
 }
 
-static DEFINE_MUTEX(flush_lock);
-static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
-
 static void flush_all_cpus_locked(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
@@ -3328,7 +3735,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		if (!has_cpu_slab(cpu, s)) {
+		if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
@@ -3364,19 +3771,14 @@ static int slub_cpu_dead(unsigned int cpu)
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list)
+	list_for_each_entry(s, &slab_caches, list) {
 		__flush_cpu_slab(s, cpu);
+		__pcs_flush_all_cpu(s, cpu);
+	}
 	mutex_unlock(&slab_mutex);
 	return 0;
 }
 
-#else /* CONFIG_SLUB_TINY */
-static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
-static inline void flush_all(struct kmem_cache *s) { }
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
-static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
-#endif /* CONFIG_SLUB_TINY */
-
 /*
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
@@ -4126,6 +4528,173 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
 }
 
+static __fastpath_inline
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned long flags;
+	void *object;
+
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *empty = NULL;
+		struct slab_sheaf *full;
+		bool can_alloc;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+		if (full) {
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		can_alloc = gfpflags_allow_blocking(gfp);
+
+		if (can_alloc) {
+			if (pcs->spare) {
+				empty = pcs->spare;
+				pcs->spare = NULL;
+			} else {
+				empty = barn_get_empty_sheaf(pcs->barn);
+			}
+		}
+
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		if (!can_alloc)
+			return NULL;
+
+		if (empty) {
+			if (!refill_sheaf(s, empty, gfp)) {
+				full = empty;
+			} else {
+				/*
+				 * we must be very low on memory so don't bother
+				 * with the barn
+				 */
+				free_empty_sheaf(s, empty);
+			}
+		} else {
+			full = alloc_full_sheaf(s, gfp);
+		}
+
+		if (!full)
+			return NULL;
+
+		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		/*
+		 * If we are returning empty sheaf, we either got it from the
+		 * barn or had to allocate one. If we are returning a full
+		 * sheaf, it's due to racing or being migrated to a different
+		 * cpu. Breaching the barn's sheaf limits should be thus rare
+		 * enough so just ignore them to simplify the recovery.
+		 */
+
+		if (pcs->main->size == 0) {
+			barn_put_empty_sheaf(pcs->barn, pcs->main, true);
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		if (!pcs->spare) {
+			pcs->spare = full;
+			goto do_alloc;
+		}
+
+		if (pcs->spare->size == 0) {
+			barn_put_empty_sheaf(pcs->barn, pcs->spare, true);
+			pcs->spare = full;
+			goto do_alloc;
+		}
+
+		barn_put_full_sheaf(pcs->barn, full, true);
+	}
+
+do_alloc:
+	object = pcs->main->objects[--pcs->main->size];
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat(s, ALLOC_PCS);
+
+	return object;
+}
+
+static __fastpath_inline
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main;
+	unsigned long flags;
+	unsigned int allocated = 0;
+	unsigned int batch;
+
+next_batch:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *full;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+		if (full) {
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		/*
+		 * Once full sheaves in barn are depleted, let the bulk
+		 * allocation continue from slab pages, otherwise we would just
+		 * be copying arrays of pointers twice.
+		 */
+		return allocated;
+	}
+
+do_alloc:
+
+	main = pcs->main;
+	batch = min(size, main->size);
+
+	main->size -= batch;
+	memcpy(p, main->objects + main->size, batch * sizeof(void *));
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat_add(s, ALLOC_PCS, batch);
+
+	allocated += batch;
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return allocated;
+}
+
+
 /*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
@@ -4150,7 +4719,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (s->cpu_sheaves && (node == NUMA_NO_NODE))
+		object = alloc_from_pcs(s, gfpflags);
+
+	if (!object)
+		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);
@@ -4521,6 +5094,196 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+/*
+ * Free an object to the percpu sheaves.
+ * The object is expected to have passed slab_free_hook() already.
+ */
+static __fastpath_inline
+void free_to_pcs(struct kmem_cache *s, void *object)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned long flags;
+
+restart:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
+
+		struct slab_sheaf *empty;
+
+		if (!pcs->spare) {
+			empty = barn_get_empty_sheaf(pcs->barn);
+			if (empty) {
+				pcs->spare = pcs->main;
+				pcs->main = empty;
+				goto do_free;
+			}
+			goto alloc_empty;
+		}
+
+		if (pcs->spare->size < s->sheaf_capacity) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_free;
+		}
+
+		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+
+		if (!IS_ERR(empty)) {
+			pcs->main = empty;
+			goto do_free;
+		}
+
+		if (PTR_ERR(empty) == -E2BIG) {
+			/* Since we got here, spare exists and is full */
+			struct slab_sheaf *to_flush = pcs->spare;
+
+			pcs->spare = NULL;
+			local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+			sheaf_flush(s, to_flush);
+			empty = to_flush;
+			goto got_empty;
+		}
+
+alloc_empty:
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty) {
+			sheaf_flush_main(s);
+			goto restart;
+		}
+
+got_empty:
+		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		/*
+		 * if we put any sheaf to barn here, it's because we raced or
+		 * have been migrated to a different cpu, which should be rare
+		 * enough so just ignore the barn's limits to simplify
+		 */
+		if (unlikely(pcs->main->size < s->sheaf_capacity)) {
+			if (!pcs->spare)
+				pcs->spare = empty;
+			else
+				barn_put_empty_sheaf(pcs->barn, empty, true);
+			goto do_free;
+		}
+
+		if (!pcs->spare) {
+			pcs->spare = pcs->main;
+			pcs->main = empty;
+			goto do_free;
+		}
+
+		barn_put_full_sheaf(pcs->barn, pcs->main, true);
+		pcs->main = empty;
+	}
+
+do_free:
+	pcs->main->objects[pcs->main->size++] = object;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat(s, FREE_PCS);
+}
+
+/*
+ * Bulk free objects to the percpu sheaves.
+ * Unlike free_to_pcs() this includes the calls to all necessary hooks
+ * and the fallback to freeing to slab pages.
+ */
+static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main;
+	unsigned long flags;
+	unsigned int batch, i = 0;
+	bool init;
+
+	init = slab_want_init_on_free(s);
+
+	while (i < size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+			p[i] = p[--size];
+			if (!size)
+				return;
+			continue;
+		}
+
+		i++;
+	}
+
+next_batch:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
+
+		struct slab_sheaf *empty;
+
+		if (!pcs->spare) {
+			empty = barn_get_empty_sheaf(pcs->barn);
+			if (empty) {
+				pcs->spare = pcs->main;
+				pcs->main = empty;
+				goto do_free;
+			}
+			goto no_empty;
+		}
+
+		if (pcs->spare->size < s->sheaf_capacity) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_free;
+		}
+
+		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+
+		if (!IS_ERR(empty)) {
+			pcs->main = empty;
+			goto do_free;
+		}
+
+no_empty:
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		/*
+		 * if we depleted all empty sheaves in the barn or there are too
+		 * many full sheaves, free the rest to slab pages
+		 */
+
+		__kmem_cache_free_bulk(s, size, p);
+		return;
+	}
+
+do_free:
+	main = pcs->main;
+	batch = min(size, s->sheaf_capacity - main->size);
+
+	memcpy(main->objects + main->size, p, batch * sizeof(void *));
+	main->size += batch;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat_add(s, FREE_PCS, batch);
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4607,7 +5370,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	memcg_slab_free_hook(s, slab, &object, 1);
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+		return;
+
+	if (s->cpu_sheaves)
+		free_to_pcs(s, object);
+	else
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -5033,6 +5801,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (!size)
 		return;
 
+	/*
+	 * freeing to sheaves is so incompatible with the detached freelist so
+	 * once we go that way, we have to do everything differently
+	 */
+	if (s && s->cpu_sheaves) {
+		free_to_pcs_bulk(s, size, p);
+		return;
+	}
+
 	do {
 		struct detached_freelist df;
 
@@ -5151,7 +5928,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 				 void **p)
 {
-	int i;
+	unsigned int i = 0;
 
 	if (!size)
 		return 0;
@@ -5160,9 +5937,21 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 	if (unlikely(!s))
 		return 0;
 
-	i = __kmem_cache_alloc_bulk(s, flags, size, p);
-	if (unlikely(i == 0))
-		return 0;
+	if (s->cpu_sheaves)
+		i = alloc_from_pcs_bulk(s, size, p);
+
+	if (i < size) {
+		unsigned int j = __kmem_cache_alloc_bulk(s, flags, size - i, p + i);
+		/*
+		 * If we ran out of memory, don't bother with freeing back to
+		 * the percpu sheaves, we have bigger problems.
+		 */
+		if (unlikely(j == 0)) {
+			if (i > 0)
+				__kmem_cache_free_bulk(s, i, p);
+			return 0;
+		}
+	}
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
@@ -5172,11 +5961,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 		    slab_want_init_on_alloc(flags, s), s->object_size))) {
 		return 0;
 	}
-	return i;
+
+	return size;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 
-
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can
@@ -5309,8 +6098,8 @@ static inline int calculate_order(unsigned int size)
 	return -ENOSYS;
 }
 
-static void
-init_kmem_cache_node(struct kmem_cache_node *n)
+static bool
+init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
@@ -5320,6 +6109,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+	n->barn = barn;
+	if (barn)
+		barn_init(barn);
+
+	return true;
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -5350,6 +6144,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 }
 #endif /* CONFIG_SLUB_TINY */
 
+static int init_percpu_sheaves(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+		int nid;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		local_lock_init(&pcs->lock);
+
+		nid = cpu_to_mem(cpu);
+
+		pcs->barn = get_node(s, nid)->barn;
+		pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+
+		if (!pcs->main)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static struct kmem_cache *kmem_cache_node;
 
 /*
@@ -5385,7 +6203,7 @@ static void early_kmem_cache_node_alloc(int node)
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
 	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n);
+	init_kmem_cache_node(n, NULL);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
@@ -5401,6 +6219,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn) {
+			WARN_ON(n->barn->nr_full);
+			WARN_ON(n->barn->nr_empty);
+			kfree(n->barn);
+			n->barn = NULL;
+		}
+
 		s->node[node] = NULL;
 		kmem_cache_free(kmem_cache_node, n);
 	}
@@ -5409,6 +6234,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
+	if (s->cpu_sheaves)
+		pcs_destroy(s);
 #ifndef CONFIG_SLUB_TINY
 	free_percpu(s->cpu_slab);
 #endif
@@ -5421,20 +6248,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_node_mask(node, slab_nodes) {
 		struct kmem_cache_node *n;
+		struct node_barn *barn = NULL;
 
 		if (slab_state == DOWN) {
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+			if (!barn)
+				return 0;
+		}
+
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
-
-		if (!n) {
-			free_kmem_cache_nodes(s);
+		if (!n)
 			return 0;
-		}
 
-		init_kmem_cache_node(n);
+		init_kmem_cache_node(n, barn);
+
 		s->node[node] = n;
 	}
 	return 1;
@@ -5690,6 +6524,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	flush_all_cpus_locked(s);
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn)
+			barn_shrink(s, n->barn);
 		free_partial(s, n);
 		if (n->nr_partial || node_nr_slabs(n))
 			return 1;
@@ -5893,6 +6729,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 			INIT_LIST_HEAD(promote + i);
 
+		if (n->barn)
+			barn_shrink(s, n->barn);
+
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -6005,12 +6844,24 @@ static int slab_mem_going_online_callback(void *arg)
 	 */
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
+		struct node_barn *barn = NULL;
+
 		/*
 		 * The structure may already exist if the node was previously
 		 * onlined and offlined.
 		 */
 		if (get_node(s, nid))
 			continue;
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
+
+			if (!barn) {
+				ret = -ENOMEM;
+				goto out;
+			}
+		}
+
 		/*
 		 * XXX: kmem_cache_alloc_node will fallback to other nodes
 		 *      since memory is not yet available from the node that
@@ -6021,7 +6872,9 @@ static int slab_mem_going_online_callback(void *arg)
 			ret = -ENOMEM;
 			goto out;
 		}
-		init_kmem_cache_node(n);
+
+		init_kmem_cache_node(n, barn);
+
 		s->node[nid] = n;
 	}
 	/*
@@ -6240,6 +7093,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 
 	set_cpu_partial(s);
 
+	if (args->sheaf_capacity) {
+		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
+		if (!s->cpu_sheaves) {
+			err = -ENOMEM;
+			goto out;
+		}
+		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
+		s->sheaf_capacity = args->sheaf_capacity;
+	}
+
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -6256,6 +7119,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	if (!alloc_kmem_cache_cpus(s))
 		goto out;
 
+	if (s->cpu_sheaves) {
+		err = init_percpu_sheaves(s);
+		if (err)
+			goto out;
+	}
+
 	err = 0;
 
 	/* Mutex is not taken during early boot */
@@ -6277,7 +7146,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 		__kmem_cache_release(s);
 	return err;
 }
-
 #ifdef SLAB_SUPPORTS_SYSFS
 static int count_inuse(struct slab *slab)
 {
@@ -7055,8 +7923,10 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
+STAT_ATTR(FREE_PCS, free_cpu_sheaf);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -7081,6 +7951,12 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
 STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
 STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
+STAT_ATTR(SHEAF_FLUSH_MAIN, sheaf_flush_main);
+STAT_ATTR(SHEAF_FLUSH_OTHER, sheaf_flush_other);
+STAT_ATTR(SHEAF_REFILL, sheaf_refill);
+STAT_ATTR(SHEAF_SWAP, sheaf_swap);
+STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
+STAT_ATTR(SHEAF_FREE, sheaf_free);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -7142,8 +8018,10 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
+	&alloc_cpu_sheaf_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
+	&free_cpu_sheaf_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
@@ -7168,6 +8046,12 @@ static struct attribute *slab_attrs[] = {
 	&cpu_partial_free_attr.attr,
 	&cpu_partial_node_attr.attr,
 	&cpu_partial_drain_attr.attr,
+	&sheaf_flush_main_attr.attr,
+	&sheaf_flush_other_attr.attr,
+	&sheaf_refill_attr.attr,
+	&sheaf_swap_attr.attr,
+	&sheaf_alloc_attr.attr,
+	&sheaf_free_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
  2025-02-14 16:27 ` [PATCH RFC v2 01/10] slab: add opt-in caching layer of " Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-22 23:08   ` Suren Baghdasaryan
  2025-02-24  8:40   ` Harry Yoo
  2025-02-14 16:27 ` [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t Vlastimil Babka
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.

kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put in in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.

It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() machinery.

Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
  existing batching
- sheaves can be reused for allocations via barn instead of being
  flushed to slabs, which is more efficient
  - this includes cases where only some cpus are allowed to process rcu
    callbacks (Android)

Possible disadvantage:
- objects might be waiting for more than their grace period (it is
  determined by the last object freed into the sheaf), increasing memory
  usage - but the existing batching does that too?

Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h        |   2 +
 mm/slab_common.c |  21 ++++++++
 mm/slub.c        | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 170 insertions(+), 4 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 8daaec53b6ecfc44171191d421adb12e5cba2c58..94e9959e1aefa350d3d74e3f5309fde7a5cf2ec8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -459,6 +459,8 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+
 /* Legal flag mask for kmem_cache_create(), for various configurations */
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ceeefb287899a82f30ad79b403556001c1860311..c6853450ed74160cfcb497c09f92c1f9f7b12629 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1613,6 +1613,24 @@ static void kfree_rcu_work(struct work_struct *work)
 		kvfree_rcu_list(head);
 }
 
+static bool kfree_rcu_sheaf(void *obj)
+{
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+
+	folio = virt_to_folio(obj);
+	if (unlikely(!folio_test_slab(folio)))
+		return false;
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+	if (s->cpu_sheaves)
+		return __kfree_rcu_sheaf(s, obj);
+
+	return false;
+}
+
 static bool
 need_offload_krc(struct kfree_rcu_cpu *krcp)
 {
@@ -1957,6 +1975,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!head)
 		might_sleep();
 
+	if (kfree_rcu_sheaf(ptr))
+		return;
+
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
diff --git a/mm/slub.c b/mm/slub.c
index c06734912972b799f537359f7fe6a750918ffe9e..40175747212fefb27137309b27571abe8d0966e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -350,6 +350,8 @@ enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
 	FREE_PCS,		/* Free to percpu sheaf */
+	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
+	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -2569,6 +2571,24 @@ static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	sheaf->size = 0;
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf);
+
+static void rcu_free_sheaf_nobarn(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	sheaf_flush(s, sheaf);
+
+	free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
@@ -2598,8 +2618,8 @@ static void pcs_flush_all(struct kmem_cache *s)
 		free_empty_sheaf(s, spare);
 	}
 
-	// TODO: handle rcu_free
-	BUG_ON(rcu_free);
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
 
 	sheaf_flush_main(s);
 }
@@ -2616,8 +2636,10 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
 		pcs->spare = NULL;
 	}
 
-	// TODO: handle rcu_free
-	BUG_ON(pcs->rcu_free);
+	if (pcs->rcu_free) {
+		call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+		pcs->rcu_free = NULL;
+	}
 
 	sheaf_flush_main(s);
 }
@@ -5192,6 +5214,118 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 	stat(s, FREE_PCS);
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf)
+{
+	bool init = slab_want_init_on_free(s);
+	void **p = &sheaf->objects[0];
+	unsigned int i = 0;
+
+	while (i < sheaf->size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+			p[i] = p[--sheaf->size];
+			continue;
+		}
+
+		i++;
+	}
+}
+
+static void rcu_free_sheaf(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct node_barn *barn;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	barn = get_node(s, numa_mem_id())->barn;
+
+	/* due to slab_free_hook() */
+	if (unlikely(sheaf->size == 0))
+		goto empty;
+
+	if (!barn_put_full_sheaf(barn, sheaf, false))
+		return;
+
+	sheaf_flush(s, sheaf);
+
+empty:
+	if (!barn_put_empty_sheaf(barn, sheaf, false))
+		return;
+
+	free_empty_sheaf(s, sheaf);
+}
+
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_sheaf;
+	unsigned long flags;
+
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(!pcs->rcu_free)) {
+
+		struct slab_sheaf *empty;
+
+		empty = barn_get_empty_sheaf(pcs->barn);
+
+		if (empty) {
+			pcs->rcu_free = empty;
+			goto do_free;
+		}
+
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty) {
+			stat(s, FREE_RCU_SHEAF_FAIL);
+			return false;
+		}
+
+		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		if (unlikely(pcs->rcu_free))
+			barn_put_empty_sheaf(pcs->barn, empty, true);
+		else
+			pcs->rcu_free = empty;
+	}
+
+do_free:
+
+	rcu_sheaf = pcs->rcu_free;
+
+	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
+
+	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		stat(s, FREE_RCU_SHEAF);
+		return true;
+	}
+
+	pcs->rcu_free = NULL;
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+
+	stat(s, FREE_RCU_SHEAF);
+
+	return true;
+}
+
 /*
  * Bulk free objects to the percpu sheaves.
  * Unlike free_to_pcs() this includes the calls to all necessary hooks
@@ -6522,6 +6656,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	flush_all_cpus_locked(s);
+
+	/* we might have rcu sheaves in flight */
+	if (s->cpu_sheaves)
+		rcu_barrier();
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
 		if (n->barn)
@@ -7927,6 +8066,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_PCS, free_cpu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -8022,6 +8163,8 @@ static struct attribute *slab_attrs[] = {
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
 	&free_cpu_sheaf_attr.attr,
+	&free_rcu_sheaf_attr.attr,
+	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
  2025-02-14 16:27 ` [PATCH RFC v2 01/10] slab: add opt-in caching layer of " Vlastimil Babka
  2025-02-14 16:27 ` [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-17 14:19   ` Sebastian Andrzej Siewior
  2025-02-26 17:00   ` Davidlohr Bueso
  2025-02-14 16:27 ` [PATCH RFC v2 04/10] locking/local_lock: add localtry_trylock() Vlastimil Babka
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Sebastian Andrzej Siewior,
	Alexei Starovoitov, Vlastimil Babka

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
critical section, but it doesn't prevent NMI, so the fully reentrant
code cannot use local_lock_irqsave() for exclusive access.

Introduce localtry_lock_t and localtry_lock_irqsave() that
disables interrupts and sets acquired=1, so localtry_lock_irqsave()
from NMI attempting to acquire the same lock will return false.

In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
Map localtry_lock_irqsave() to preemptible spin_trylock().
When in hard IRQ or NMI return false right away, since
spin_trylock() is not safe due to PI issues.

Note there is no need to use local_inc for acquired variable,
since it's a percpu variable with strict nesting scopes.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/local_lock.h          |  59 +++++++++++++++++
 include/linux/local_lock_internal.h | 123 ++++++++++++++++++++++++++++++++++++
 2 files changed, 182 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb9f4721f94d19828a38fbfa59346c..05c254a5d7d3e6db64d7f81a3a4a10f5a942c29e 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,6 +51,65 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)
 
+/**
+ * localtry_lock_init - Runtime initialize a lock instance
+ */
+#define localtry_lock_init(lock)		__localtry_lock_init(lock)
+
+/**
+ * localtry_lock - Acquire a per CPU local lock
+ * @lock:	The lock variable
+ */
+#define localtry_lock(lock)		__localtry_lock(lock)
+
+/**
+ * localtry_lock_irq - Acquire a per CPU local lock and disable interrupts
+ * @lock:	The lock variable
+ */
+#define localtry_lock_irq(lock)		__localtry_lock_irq(lock)
+
+/**
+ * localtry_lock_irqsave - Acquire a per CPU local lock, save and disable
+ *			 interrupts
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ */
+#define localtry_lock_irqsave(lock, flags)				\
+	__localtry_lock_irqsave(lock, flags)
+
+/**
+ * localtry_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ *			      interrupts if acquired
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ *
+ * The function can be used in any context such as NMI or HARDIRQ. Due to
+ * locking constrains it will _always_ fail to acquire the lock on PREEMPT_RT.
+ */
+#define localtry_trylock_irqsave(lock, flags)				\
+	__localtry_trylock_irqsave(lock, flags)
+
+/**
+ * local_unlock - Release a per CPU local lock
+ * @lock:	The lock variable
+ */
+#define localtry_unlock(lock)		__localtry_unlock(lock)
+
+/**
+ * local_unlock_irq - Release a per CPU local lock and enable interrupts
+ * @lock:	The lock variable
+ */
+#define localtry_unlock_irq(lock)		__localtry_unlock_irq(lock)
+
+/**
+ * localtry_unlock_irqrestore - Release a per CPU local lock and restore
+ *			      interrupt flags
+ * @lock:	The lock variable
+ * @flags:      Interrupt flags to restore
+ */
+#define localtry_unlock_irqrestore(lock, flags)			\
+	__localtry_unlock_irqrestore(lock, flags)
+
 DEFINE_GUARD(local_lock, local_lock_t __percpu*,
 	     local_lock(_T),
 	     local_unlock(_T))
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2b6748969438c4642f7d970834871..c1369b300777d3ff3700cfd8bd4de8186124f036 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -15,6 +15,11 @@ typedef struct {
 #endif
 } local_lock_t;
 
+typedef struct {
+	local_lock_t	llock;
+	unsigned int	acquired;
+} localtry_lock_t;
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define LOCAL_LOCK_DEBUG_INIT(lockname)		\
 	.dep_map = {					\
@@ -31,6 +36,13 @@ static inline void local_lock_acquire(local_lock_t *l)
 	l->owner = current;
 }
 
+static inline void local_trylock_acquire(local_lock_t *l)
+{
+	lock_map_acquire_try(&l->dep_map);
+	DEBUG_LOCKS_WARN_ON(l->owner);
+	l->owner = current;
+}
+
 static inline void local_lock_release(local_lock_t *l)
 {
 	DEBUG_LOCKS_WARN_ON(l->owner != current);
@@ -45,11 +57,13 @@ static inline void local_lock_debug_init(local_lock_t *l)
 #else /* CONFIG_DEBUG_LOCK_ALLOC */
 # define LOCAL_LOCK_DEBUG_INIT(lockname)
 static inline void local_lock_acquire(local_lock_t *l) { }
+static inline void local_trylock_acquire(local_lock_t *l) { }
 static inline void local_lock_release(local_lock_t *l) { }
 static inline void local_lock_debug_init(local_lock_t *l) { }
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
 
 #define INIT_LOCAL_LOCK(lockname)	{ LOCAL_LOCK_DEBUG_INIT(lockname) }
+#define INIT_LOCALTRY_LOCK(lockname)	{ .llock = { LOCAL_LOCK_DEBUG_INIT(lockname.llock) }}
 
 #define __local_lock_init(lock)					\
 do {								\
@@ -118,6 +132,86 @@ do {								\
 #define __local_unlock_nested_bh(lock)				\
 	local_lock_release(this_cpu_ptr(lock))
 
+/* localtry_lock_t variants */
+
+#define __localtry_lock_init(lock)				\
+do {								\
+	__local_lock_init(&(lock)->llock);			\
+	WRITE_ONCE(&(lock)->acquired, 0);			\
+} while (0)
+
+#define __localtry_lock(lock)					\
+	do {							\
+		localtry_lock_t *lt;				\
+		preempt_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_lock_irq(lock)				\
+	do {							\
+		localtry_lock_t *lt;				\
+		local_irq_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_lock_irqsave(lock, flags)			\
+	do {							\
+		localtry_lock_t *lt;				\
+		local_irq_save(flags);				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_trylock_irqsave(lock, flags)			\
+	({							\
+		localtry_lock_t *lt;				\
+		bool _ret;					\
+								\
+		local_irq_save(flags);				\
+		lt = this_cpu_ptr(lock);			\
+		if (!READ_ONCE(lt->acquired)) {			\
+			WRITE_ONCE(lt->acquired, 1);		\
+			local_trylock_acquire(&lt->llock);	\
+			_ret = true;				\
+		} else {					\
+			_ret = false;				\
+			local_irq_restore(flags);		\
+		}						\
+		_ret;						\
+	})
+
+#define __localtry_unlock(lock)					\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		preempt_enable();				\
+	} while (0)
+
+#define __localtry_unlock_irq(lock)				\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		local_irq_enable();				\
+	} while (0)
+
+#define __localtry_unlock_irqrestore(lock, flags)		\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		local_irq_restore(flags);			\
+	} while (0)
+
 #else /* !CONFIG_PREEMPT_RT */
 
 /*
@@ -125,8 +219,10 @@ do {								\
  * critical section while staying preemptible.
  */
 typedef spinlock_t local_lock_t;
+typedef spinlock_t localtry_lock_t;
 
 #define INIT_LOCAL_LOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname))
+#define INIT_LOCALTRY_LOCK(lockname) INIT_LOCAL_LOCK(lockname)
 
 #define __local_lock_init(l)					\
 	do {							\
@@ -169,4 +265,31 @@ do {								\
 	spin_unlock(this_cpu_ptr((lock)));			\
 } while (0)
 
+/* localtry_lock_t variants */
+
+#define __localtry_lock_init(lock)			__local_lock_init(lock)
+#define __localtry_lock(lock)				__local_lock(lock)
+#define __localtry_lock_irq(lock)			__local_lock(lock)
+#define __localtry_lock_irqsave(lock, flags)		__local_lock_irqsave(lock, flags)
+#define __localtry_unlock(lock)				__local_unlock(lock)
+#define __localtry_unlock_irq(lock)			__local_unlock(lock)
+#define __localtry_unlock_irqrestore(lock, flags)	__local_unlock_irqrestore(lock, flags)
+
+#define __localtry_trylock_irqsave(lock, flags)			\
+	({							\
+		int __locked;					\
+								\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		if (in_nmi() | in_hardirq()) {			\
+			__locked = 0;				\
+		} else {					\
+			migrate_disable();			\
+			__locked = spin_trylock(this_cpu_ptr((lock)));	\
+			if (!__locked)				\
+				migrate_enable();		\
+		}						\
+		__locked;					\
+	})
+
 #endif /* CONFIG_PREEMPT_RT */

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 04/10] locking/local_lock: add localtry_trylock()
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (2 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-14 16:27 ` [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock Vlastimil Babka
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Add a localtry_trylock() variant without _irqsave that will be used in
slab sheaves implementation. Thanks to only disabling preemption and not
irqs, it has a lower overhead. It's not necessary to disable irqs to
avoid a deadlock if the irq context uses trylock and can handle
failures.

Also make the comment of localtry_trylock() more clear, and fix a
compilation failure in localtry_lock_init().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/local_lock.h          | 13 ++++++++++++-
 include/linux/local_lock_internal.h | 31 +++++++++++++++++++++++++++----
 2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 05c254a5d7d3e6db64d7f81a3a4a10f5a942c29e..1a0bc35839e360d5c8170105849c3883463852f8 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -77,6 +77,16 @@
 #define localtry_lock_irqsave(lock, flags)				\
 	__localtry_lock_irqsave(lock, flags)
 
+/**
+ * localtry_trylock - Try to acquire a per CPU local lock.
+ * @lock:	The lock variable
+ *
+ * The function can be used in any context such as NMI or HARDIRQ. Due to
+ * locking constrains it will _always_ fail to acquire the lock in NMI or
+ * HARDIRQ context on PREEMPT_RT.
+ */
+#define localtry_trylock(lock)		__localtry_trylock(lock)
+
 /**
  * localtry_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
  *			      interrupts if acquired
@@ -84,7 +94,8 @@
  * @flags:	Storage for interrupt flags
  *
  * The function can be used in any context such as NMI or HARDIRQ. Due to
- * locking constrains it will _always_ fail to acquire the lock on PREEMPT_RT.
+ * locking constrains it will _always_ fail to acquire the lock in NMI or
+ * HARDIRQ context on PREEMPT_RT.
  */
 #define localtry_trylock_irqsave(lock, flags)				\
 	__localtry_trylock_irqsave(lock, flags)
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index c1369b300777d3ff3700cfd8bd4de8186124f036..67bd13d142fac39bc0f8b2c05eaf81717ff480f9 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -137,7 +137,7 @@ do {								\
 #define __localtry_lock_init(lock)				\
 do {								\
 	__local_lock_init(&(lock)->llock);			\
-	WRITE_ONCE(&(lock)->acquired, 0);			\
+	WRITE_ONCE((lock)->acquired, 0);			\
 } while (0)
 
 #define __localtry_lock(lock)					\
@@ -167,6 +167,24 @@ do {								\
 		WRITE_ONCE(lt->acquired, 1);			\
 	} while (0)
 
+#define __localtry_trylock(lock)				\
+	({							\
+		localtry_lock_t *lt;				\
+		bool _ret;					\
+								\
+		preempt_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		if (!READ_ONCE(lt->acquired)) {			\
+			WRITE_ONCE(lt->acquired, 1);		\
+			local_trylock_acquire(&lt->llock);	\
+			_ret = true;				\
+		} else {					\
+			_ret = false;				\
+			preempt_enable();			\
+		}						\
+		_ret;						\
+	})
+
 #define __localtry_trylock_irqsave(lock, flags)			\
 	({							\
 		localtry_lock_t *lt;				\
@@ -275,12 +293,10 @@ do {								\
 #define __localtry_unlock_irq(lock)			__local_unlock(lock)
 #define __localtry_unlock_irqrestore(lock, flags)	__local_unlock_irqrestore(lock, flags)
 
-#define __localtry_trylock_irqsave(lock, flags)			\
+#define __localtry_trylock(lock)				\
 	({							\
 		int __locked;					\
 								\
-		typecheck(unsigned long, flags);		\
-		flags = 0;					\
 		if (in_nmi() | in_hardirq()) {			\
 			__locked = 0;				\
 		} else {					\
@@ -292,4 +308,11 @@ do {								\
 		__locked;					\
 	})
 
+#define __localtry_trylock_irqsave(lock, flags)			\
+	({							\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		__localtry_trylock(lock);			\
+	})
+
 #endif /* CONFIG_PREEMPT_RT */

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (3 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 04/10] locking/local_lock: add localtry_trylock() Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-23  2:33   ` Suren Baghdasaryan
  2025-02-24 13:08   ` Harry Yoo
  2025-02-14 16:27 ` [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Instead of local_lock_irqsave(), use localtry_trylock() when potential
callers include irq context, and localtry_lock() otherwise (such as when
we already know the gfp flags allow blocking).

This should reduce the locking (due to irq disabling/enabling) overhead.
Failing to use percpu sheaves in an irq due to preempting an already
locked user of sheaves should be rare so it's a favorable tradeoff.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 122 ++++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 76 insertions(+), 46 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 40175747212fefb27137309b27571abe8d0966e2..3d7345e7e938d53950ed0d6abe8eb0e93cf8f5b1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -450,7 +450,7 @@ struct slab_sheaf {
 };
 
 struct slub_percpu_sheaves {
-	local_lock_t lock;
+	localtry_lock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
 	struct slab_sheaf *rcu_free;
@@ -2529,16 +2529,19 @@ static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
 
 static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
 
-static void sheaf_flush_main(struct kmem_cache *s)
+/* returns true if at least partially flushed */
+static bool sheaf_flush_main(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
 	unsigned int batch, remaining;
 	void *objects[PCS_BATCH_MAX];
 	struct slab_sheaf *sheaf;
-	unsigned long flags;
+	bool ret = false;
 
 next_batch:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	if (!localtry_trylock(&s->cpu_sheaves->lock))
+		return ret;
+
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 	sheaf = pcs->main;
 
@@ -2549,14 +2552,18 @@ static void sheaf_flush_main(struct kmem_cache *s)
 
 	remaining = sheaf->size;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	localtry_unlock(&s->cpu_sheaves->lock);
 
 	__kmem_cache_free_bulk(s, batch, &objects[0]);
 
 	stat_add(s, SHEAF_FLUSH_MAIN, batch);
 
+	ret = true;
+
 	if (remaining)
 		goto next_batch;
+
+	return ret;
 }
 
 static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
@@ -2593,6 +2600,8 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
  *
+ * must not be called from an irq
+ *
  * flushing operations are rare so let's keep it simple and flush to slabs
  * directly, skipping the barn
  */
@@ -2600,9 +2609,8 @@ static void pcs_flush_all(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *spare, *rcu_free;
-	unsigned long flags;
 
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	localtry_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	spare = pcs->spare;
@@ -2611,7 +2619,7 @@ static void pcs_flush_all(struct kmem_cache *s)
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	localtry_unlock(&s->cpu_sheaves->lock);
 
 	if (spare) {
 		sheaf_flush(s, spare);
@@ -4554,10 +4562,11 @@ static __fastpath_inline
 void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 {
 	struct slub_percpu_sheaves *pcs;
-	unsigned long flags;
 	void *object;
 
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	if (!localtry_trylock(&s->cpu_sheaves->lock))
+		return NULL;
+
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == 0)) {
@@ -4590,7 +4599,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 			}
 		}
 
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		localtry_unlock(&s->cpu_sheaves->lock);
 
 		if (!can_alloc)
 			return NULL;
@@ -4612,7 +4621,11 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 		if (!full)
 			return NULL;
 
-		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		/*
+		 * we can reach here only when gfpflags_allow_blocking
+		 * so this must not be an irq
+		 */
+		localtry_lock(&s->cpu_sheaves->lock);
 		pcs = this_cpu_ptr(s->cpu_sheaves);
 
 		/*
@@ -4646,7 +4659,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 do_alloc:
 	object = pcs->main->objects[--pcs->main->size];
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	localtry_unlock(&s->cpu_sheaves->lock);
 
 	stat(s, ALLOC_PCS);
 
@@ -4658,12 +4671,13 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *main;
-	unsigned long flags;
 	unsigned int allocated = 0;
 	unsigned int batch;
 
 next_batch:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	if (!localtry_trylock(&s->cpu_sheaves->lock))
+		return allocated;
+
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == 0)) {
@@ -4683,7 +4697,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 			goto do_alloc;
 		}
 
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		localtry_unlock(&s->cpu_sheaves->lock);
 
 		/*
 		 * Once full sheaves in barn are depleted, let the bulk
@@ -4701,7 +4715,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	main->size -= batch;
 	memcpy(p, main->objects + main->size, batch * sizeof(void *));
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	localtry_unlock(&s->cpu_sheaves->lock);
 
 	stat_add(s, ALLOC_PCS, batch);
 
@@ -5121,13 +5135,14 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
  * The object is expected to have passed slab_free_hook() already.
  */
 static __fastpath_inline
-void free_to_pcs(struct kmem_cache *s, void *object)
+bool free_to_pcs(struct kmem_cache *s, void *object)
 {
 	struct slub_percpu_sheaves *pcs;
-	unsigned long flags;
 
 restart:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	if (!localtry_trylock(&s->cpu_sheaves->lock))
+		return false;
+
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
@@ -5162,7 +5177,7 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 			struct slab_sheaf *to_flush = pcs->spare;
 
 			pcs->spare = NULL;
-			local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+			localtry_unlock(&s->cpu_sheaves->lock);
 
 			sheaf_flush(s, to_flush);
 			empty = to_flush;
@@ -5170,17 +5185,27 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 		}
 
 alloc_empty:
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		localtry_unlock(&s->cpu_sheaves->lock);
 
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
 		if (!empty) {
-			sheaf_flush_main(s);
-			goto restart;
+			if (sheaf_flush_main(s))
+				goto restart;
+			else
+				return false;
 		}
 
 got_empty:
-		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		if (!localtry_trylock(&s->cpu_sheaves->lock)) {
+			struct node_barn *barn;
+
+			barn = get_node(s, numa_mem_id())->barn;
+
+			barn_put_empty_sheaf(barn, empty, true);
+			return false;
+		}
+
 		pcs = this_cpu_ptr(s->cpu_sheaves);
 
 		/*
@@ -5209,9 +5234,11 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 do_free:
 	pcs->main->objects[pcs->main->size++] = object;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	localtry_unlock(&s->cpu_sheaves->lock);
 
 	stat(s, FREE_PCS);
+
+	return true;
 }
 
 static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
@@ -5270,9 +5297,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_sheaf;
-	unsigned long flags;
 
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	if (!localtry_trylock(&s->cpu_sheaves->lock))
+		goto fail;
+
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(!pcs->rcu_free)) {
@@ -5286,16 +5314,16 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 			goto do_free;
 		}
 
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		localtry_unlock(&s->cpu_sheaves->lock);
 
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
-		if (!empty) {
-			stat(s, FREE_RCU_SHEAF_FAIL);
-			return false;
-		}
+		if (!empty)
+			goto fail;
+
+		if (!localtry_trylock(&s->cpu_sheaves->lock))
+			goto fail;
 
-		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
 		pcs = this_cpu_ptr(s->cpu_sheaves);
 
 		if (unlikely(pcs->rcu_free))
@@ -5311,19 +5339,22 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
 
 	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		localtry_unlock(&s->cpu_sheaves->lock);
 		stat(s, FREE_RCU_SHEAF);
 		return true;
 	}
 
 	pcs->rcu_free = NULL;
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	localtry_unlock(&s->cpu_sheaves->lock);
 
 	call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
 
 	stat(s, FREE_RCU_SHEAF);
-
 	return true;
+
+fail:
+	stat(s, FREE_RCU_SHEAF_FAIL);
+	return false;
 }
 
 /*
@@ -5335,7 +5366,6 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *main;
-	unsigned long flags;
 	unsigned int batch, i = 0;
 	bool init;
 
@@ -5358,7 +5388,9 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	}
 
 next_batch:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	if (!localtry_trylock(&s->cpu_sheaves->lock))
+		goto fallback;
+
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
@@ -5389,13 +5421,13 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		}
 
 no_empty:
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		localtry_unlock(&s->cpu_sheaves->lock);
 
 		/*
 		 * if we depleted all empty sheaves in the barn or there are too
 		 * many full sheaves, free the rest to slab pages
 		 */
-
+fallback:
 		__kmem_cache_free_bulk(s, size, p);
 		return;
 	}
@@ -5407,7 +5439,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	memcpy(main->objects + main->size, p, batch * sizeof(void *));
 	main->size += batch;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	localtry_unlock(&s->cpu_sheaves->lock);
 
 	stat_add(s, FREE_PCS, batch);
 
@@ -5507,9 +5539,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (s->cpu_sheaves)
-		free_to_pcs(s, object);
-	else
+	if (!s->cpu_sheaves || !free_to_pcs(s, object))
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -6288,7 +6318,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		local_lock_init(&pcs->lock);
+		localtry_lock_init(&pcs->lock);
 
 		nid = cpu_to_mem(cpu);
 

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (4 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-23  3:54   ` Suren Baghdasaryan
  2025-02-25  8:00   ` Harry Yoo
  2025-02-14 16:27 ` [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock Vlastimil Babka
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Add functions for efficient guaranteed allocations e.g. in a critical
section that cannot sleep, when the exact number of allocations is not
known beforehand, but an upper limit can be calculated.

kmem_cache_prefill_sheaf() returns a sheaf containing at least given
number of objects.

kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
and is guaranteed not to fail until depleted.

kmem_cache_return_sheaf() is for giving the sheaf back to the slab
allocator after the critical section. This will also attempt to refill
it to cache's sheaf capacity for better efficiency of sheaves handling,
but it's not stricly necessary to succeed.

kmem_cache_refill_sheaf() can be used to refill a previously obtained
sheaf to requested size. If the current size is sufficient, it does
nothing. If the requested size exceeds cache's sheaf_capacity and the
sheaf's current capacity, the sheaf will be replaced with a new one,
hence the indirect pointer parameter.

kmem_cache_sheaf_size() can be used to query the current size.

The implementation supports requesting sizes that exceed cache's
sheaf_capacity, but it is not efficient - such sheaves are allocated
fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by
kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be expecially
ineffective when replacing a sheaf with a new one of a larger capacity.
It is therefore better to size cache's sheaf_capacity accordingly.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  16 ++++
 mm/slub.c            | 227 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 243 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 0e1b25228c77140d05b5b4433c9d7923de36ec05..dd01b67982e856b1b02f4f0e6fc557726e7f02a8 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -829,6 +829,22 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
 				   int node) __assume_slab_alignment __malloc;
 #define kmem_cache_alloc_node(...)	alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+				       struct slab_sheaf *sheaf);
+
+void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
+			struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_from_sheaf(...)	\
+			alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);
+
 /*
  * These macros allow declaring a kmem_buckets * parameter alongside size, which
  * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
diff --git a/mm/slub.c b/mm/slub.c
index 3d7345e7e938d53950ed0d6abe8eb0e93cf8f5b1..c1df7cf22267f28f743404531bef921e25fac086 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -443,6 +443,8 @@ struct slab_sheaf {
 	union {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
+		/* only used for prefilled sheafs */
+		unsigned int capacity;
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
@@ -2735,6 +2737,30 @@ static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
 	return ret;
 }
 
+static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *sheaf = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_full--;
+	} else if (barn->nr_empty) {
+		sheaf = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return sheaf;
+}
+
 /*
  * If a full sheaf is available, return it and put the supplied empty one to
  * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
@@ -4831,6 +4857,207 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
 
+
+/*
+ * returns a sheaf that has least the requested size
+ * when prefilling is needed, do so with given gfp flags
+ *
+ * return NULL if sheaf allocation or prefilling failed
+ */
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *sheaf = NULL;
+
+	if (unlikely(size > s->sheaf_capacity)) {
+		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
+		if (!sheaf)
+			return NULL;
+
+		sheaf->cache = s;
+		sheaf->capacity = size;
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, size,
+					     &sheaf->objects[0])) {
+			kfree(sheaf);
+			return NULL;
+		}
+
+		sheaf->size = size;
+
+		return sheaf;
+	}
+
+	localtry_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (pcs->spare) {
+		sheaf = pcs->spare;
+		pcs->spare = NULL;
+	}
+
+	if (!sheaf)
+		sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
+
+	localtry_unlock(&s->cpu_sheaves->lock);
+
+	if (!sheaf) {
+		sheaf = alloc_empty_sheaf(s, gfp);
+	}
+
+	if (sheaf && sheaf->size < size) {
+		if (refill_sheaf(s, sheaf, gfp)) {
+			sheaf_flush(s, sheaf);
+			free_empty_sheaf(s, sheaf);
+			sheaf = NULL;
+		}
+	}
+
+	if (sheaf)
+		sheaf->capacity = s->sheaf_capacity;
+
+	return sheaf;
+}
+
+/*
+ * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
+ * It tries to refill the sheaf back to the cache's sheaf_capacity
+ * to avoid handling partially full sheaves.
+ *
+ * If the refill fails because gfp is e.g. GFP_NOWAIT, the sheaf is
+ * instead dissolved
+ */
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+			     struct slab_sheaf *sheaf)
+{
+	struct slub_percpu_sheaves *pcs;
+	bool refill = false;
+	struct node_barn *barn;
+
+	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
+		sheaf_flush(s, sheaf);
+		kfree(sheaf);
+		return;
+	}
+
+	localtry_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (!pcs->spare) {
+		pcs->spare = sheaf;
+		sheaf = NULL;
+	} else if (pcs->barn->nr_full >= MAX_FULL_SHEAVES) {
+		/* racy check */
+		barn = pcs->barn;
+		refill = true;
+	}
+
+	localtry_unlock(&s->cpu_sheaves->lock);
+
+	if (!sheaf)
+		return;
+
+	/*
+	 * if the barn is full of full sheaves or we fail to refill the sheaf,
+	 * simply flush and free it
+	 */
+	if (!refill || refill_sheaf(s, sheaf, gfp)) {
+		sheaf_flush(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+		return;
+	}
+
+	/* we racily determined the sheaf would fit, so now force it */
+	barn_put_full_sheaf(barn, sheaf, true);
+}
+
+/*
+ * refill a sheaf previously returned by kmem_cache_prefill_sheaf to at least
+ * the given size
+ *
+ * the sheaf might be replaced by a new one when requesting more than
+ * s->sheaf_capacity objects if such replacement is necessary, but the refill
+ * fails (with -ENOMEM), the existing sheaf is left intact
+ */
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+			    struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+
+	/*
+	 * TODO: do we want to support *sheaf == NULL to be equivalent of
+	 * kmem_cache_prefill_sheaf() ?
+	 */
+	if (!sheafp || !(*sheafp))
+		return -EINVAL;
+
+	sheaf = *sheafp;
+	if (sheaf->size >= size)
+		return 0;
+
+	if (likely(sheaf->capacity >= size)) {
+		if (likely(sheaf->capacity == s->sheaf_capacity))
+			return refill_sheaf(s, sheaf, gfp);
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
+					     &sheaf->objects[sheaf->size])) {
+			return -ENOMEM;
+		}
+		sheaf->size = sheaf->capacity;
+
+		return 0;
+	}
+
+	/*
+	 * We had a regular sized sheaf and need an oversize one, or we had an
+	 * oversize one already but need a larger one now.
+	 * This should be a very rare path so let's not complicate it.
+	 */
+	sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+	if (!sheaf)
+		return -ENOMEM;
+
+	kmem_cache_return_sheaf(s, gfp, *sheafp);
+	*sheafp = sheaf;
+	return 0;
+}
+
+/*
+ * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * Guaranteed not to fail as many allocations as was the requested size.
+ * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
+ *
+ * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
+ * memcg charging is forced over limit if necessary, to avoid failure.
+ */
+void *
+kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
+				   struct slab_sheaf *sheaf)
+{
+	void *ret = NULL;
+	bool init;
+
+	if (sheaf->size == 0)
+		goto out;
+
+	ret = sheaf->objects[--sheaf->size];
+
+	init = slab_want_init_on_alloc(gfp, s);
+
+	/* add __GFP_NOFAIL to force successful memcg charging */
+	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
+out:
+	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
+
+	return ret;
+}
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
 /*
  * To avoid unnecessary overhead, we pass through large allocation requests
  * directly to the page allocator. We use __GFP_COMP, because we will need to

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (5 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-23  4:00   ` Suren Baghdasaryan
  2025-02-25  8:54   ` Harry Yoo
  2025-02-14 16:27 ` [PATCH RFC v2 08/10] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

The possibility of many barn operations is determined by the current
number of full or empty sheaves. Taking the barn->lock just to find out
that e.g. there are no empty sheaves results in unnecessary overhead and
lock contention. Thus perform these checks outside of the lock with a
data_race() annotated variable read and fail quickly without taking the
lock.

Checks for sheaf availability that racily succeed have to be obviously
repeated under the lock for correctness, but we can skip repeating
checks if there are too many sheaves on the given list as the limits
don't need to be strict.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 57 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 34 insertions(+), 23 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c1df7cf22267f28f743404531bef921e25fac086..72e6437f1d74bfacbb1cd7642af42929c48cc66a 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2685,9 +2685,12 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *empty = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_empty) {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty,
 					 struct slab_sheaf, barn_list);
 		list_del(&empty->barn_list);
@@ -2703,38 +2706,36 @@ static int barn_put_empty_sheaf(struct node_barn *barn,
 				struct slab_sheaf *sheaf, bool ignore_limit)
 {
 	unsigned long flags;
-	int ret = 0;
+
+	/* we don't repeat the check under barn->lock as it's not critical */
+	if (!ignore_limit && data_race(barn->nr_empty) >= MAX_EMPTY_SHEAVES)
+		return -E2BIG;
 
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
-		ret = -E2BIG;
-	} else {
-		list_add(&sheaf->barn_list, &barn->sheaves_empty);
-		barn->nr_empty++;
-	}
+	list_add(&sheaf->barn_list, &barn->sheaves_empty);
+	barn->nr_empty++;
 
 	spin_unlock_irqrestore(&barn->lock, flags);
-	return ret;
+	return 0;
 }
 
 static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
 			       bool ignore_limit)
 {
 	unsigned long flags;
-	int ret = 0;
+
+	/* we don't repeat the check under barn->lock as it's not critical */
+	if (!ignore_limit && data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
+		return -E2BIG;
 
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
-		ret = -E2BIG;
-	} else {
-		list_add(&sheaf->barn_list, &barn->sheaves_full);
-		barn->nr_full++;
-	}
+	list_add(&sheaf->barn_list, &barn->sheaves_full);
+	barn->nr_full++;
 
 	spin_unlock_irqrestore(&barn->lock, flags);
-	return ret;
+	return 0;
 }
 
 static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
@@ -2742,6 +2743,9 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *sheaf = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full) && !data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
 	if (barn->nr_full) {
@@ -2772,9 +2776,12 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
 	struct slab_sheaf *full = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full) {
+	if (likely(barn->nr_full)) {
 		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
 					barn_list);
 		list_del(&full->barn_list);
@@ -2797,19 +2804,23 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
 	struct slab_sheaf *empty;
 	unsigned long flags;
 
+	/* we don't repeat this check under barn->lock as it's not critical */
+	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
+		return ERR_PTR(-E2BIG);
+	if (!data_race(barn->nr_empty))
+		return ERR_PTR(-ENOMEM);
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full >= MAX_FULL_SHEAVES) {
-		empty = ERR_PTR(-E2BIG);
-	} else if (!barn->nr_empty) {
-		empty = ERR_PTR(-ENOMEM);
-	} else {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
 					 barn_list);
 		list_del(&empty->barn_list);
 		list_add(&full->barn_list, &barn->sheaves_full);
 		barn->nr_empty--;
 		barn->nr_full++;
+	} else {
+		empty = ERR_PTR(-ENOMEM);
 	}
 
 	spin_unlock_irqrestore(&barn->lock, flags);

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 08/10] tools: Add testing support for changes to rcu and slab for sheaves
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (6 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-23  4:24   ` Suren Baghdasaryan
  2025-02-14 16:27 ` [PATCH RFC v2 09/10] tools: Add sheafs support to testing infrastructure Vlastimil Babka
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Make testing work for the slab and rcu changes that have come in with
the sheaves work.

This only works with one kmem_cache, and only the first one used.
Subsequent setting of keme_cache will not update the active kmem_cache
and will be silently dropped because there are other tests which happen
after the kmem_cache of interest is set.

The saved active kmem_cache is used in the rcu callback, which passes
the object to be freed.

The rcu call takes the rcu_head, which is passed in as the field in the
struct (in this case rcu in the maple tree node), which is calculated by
pointer math.  The offset of which is saved (in a global variable) for
restoring the node pointer on the callback after the rcu grace period
expires.

Don't use any of this outside of testing, please.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 tools/include/linux/slab.h            | 41 ++++++++++++++++++++++++++++++++---
 tools/testing/shared/linux.c          | 24 ++++++++++++++++----
 tools/testing/shared/linux/rcupdate.h | 22 +++++++++++++++++++
 3 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index 51b25e9c4ec7b66bdf4c68cc1353c6faf1ca7bb8..a475364cfd9fcdb10db252aab18ea3a620326b6b 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -22,6 +22,12 @@ enum slab_state {
 	FULL
 };
 
+struct kmem_cache_args {
+	unsigned int align;
+	unsigned int sheaf_capacity;
+	void (*ctor)(void *);
+};
+
 static inline void *kzalloc(size_t size, gfp_t gfp)
 {
 	return kmalloc(size, gfp | __GFP_ZERO);
@@ -36,9 +42,38 @@ static inline void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
 }
 void kmem_cache_free(struct kmem_cache *cachep, void *objp);
 
-struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
-			unsigned int align, unsigned int flags,
-			void (*ctor)(void *));
+
+struct kmem_cache *
+__kmem_cache_create_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags);
+
+/* If NULL is passed for @args, use this variant with default arguments. */
+static inline struct kmem_cache *
+__kmem_cache_default_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags)
+{
+	struct kmem_cache_args kmem_default_args = {};
+
+	return __kmem_cache_create_args(name, size, &kmem_default_args, flags);
+}
+
+static inline struct kmem_cache *
+__kmem_cache_create(const char *name, unsigned int size, unsigned int align,
+		unsigned int flags, void (*ctor)(void *))
+{
+	struct kmem_cache_args kmem_args = {
+		.align	= align,
+		.ctor	= ctor,
+	};
+
+	return __kmem_cache_create_args(name, size, &kmem_args, flags);
+}
+
+#define kmem_cache_create(__name, __object_size, __args, ...)           \
+	_Generic((__args),                                              \
+		struct kmem_cache_args *: __kmem_cache_create_args,	\
+		void *: __kmem_cache_default_args,			\
+		default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
 
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 66dbb362385f3c3d923233448cc591adfe6dc9e7..9f5fd722f27f1d3877be8927be30409cd74ab3c3 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -20,6 +20,7 @@ struct kmem_cache {
 	pthread_mutex_t lock;
 	unsigned int size;
 	unsigned int align;
+	unsigned int sheaf_capacity;
 	int nr_objs;
 	void *objs;
 	void (*ctor)(void *);
@@ -31,6 +32,8 @@ struct kmem_cache {
 	void *private;
 };
 
+static struct kmem_cache *kmem_active = NULL;
+
 void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *))
 {
 	cachep->callback = callback;
@@ -147,6 +150,14 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
 	pthread_mutex_unlock(&cachep->lock);
 }
 
+void kmem_cache_free_active(void *objp)
+{
+	if (!kmem_active)
+		printf("WARNING: No active kmem_cache\n");
+
+	kmem_cache_free(kmem_active, objp);
+}
+
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
 {
 	if (kmalloc_verbose)
@@ -234,23 +245,28 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 }
 
 struct kmem_cache *
-kmem_cache_create(const char *name, unsigned int size, unsigned int align,
-		unsigned int flags, void (*ctor)(void *))
+__kmem_cache_create_args(const char *name, unsigned int size,
+			  struct kmem_cache_args *args,
+			  unsigned int flags)
 {
 	struct kmem_cache *ret = malloc(sizeof(*ret));
 
 	pthread_mutex_init(&ret->lock, NULL);
 	ret->size = size;
-	ret->align = align;
+	ret->align = args->align;
+	ret->sheaf_capacity = args->sheaf_capacity;
 	ret->nr_objs = 0;
 	ret->nr_allocated = 0;
 	ret->nr_tallocated = 0;
 	ret->objs = NULL;
-	ret->ctor = ctor;
+	ret->ctor = args->ctor;
 	ret->non_kernel = 0;
 	ret->exec_callback = false;
 	ret->callback = NULL;
 	ret->private = NULL;
+	if (!kmem_active)
+		kmem_active = ret;
+
 	return ret;
 }
 
diff --git a/tools/testing/shared/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h
index fed468fb0c78db6f33fb1900c7110ab5f3c19c65..c95e2f0bbd93798e544d7d34e0823ed68414f924 100644
--- a/tools/testing/shared/linux/rcupdate.h
+++ b/tools/testing/shared/linux/rcupdate.h
@@ -9,4 +9,26 @@
 #define rcu_dereference_check(p, cond) rcu_dereference(p)
 #define RCU_INIT_POINTER(p, v)	do { (p) = (v); } while (0)
 
+void kmem_cache_free_active(void *objp);
+static unsigned long kfree_cb_offset = 0;
+
+static inline void kfree_rcu_cb(struct rcu_head *head)
+{
+	void *objp = (void *) ((unsigned long)head - kfree_cb_offset);
+
+	kmem_cache_free_active(objp);
+}
+
+#ifndef offsetof
+#define offsetof(TYPE, MEMBER)	__builtin_offsetof(TYPE, MEMBER)
+#endif
+
+#define kfree_rcu(ptr, rhv)						\
+do {									\
+	if (!kfree_cb_offset)						\
+		kfree_cb_offset = offsetof(typeof(*(ptr)), rhv);	\
+									\
+	call_rcu(&ptr->rhv, kfree_rcu_cb);				\
+} while (0)
+
 #endif

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 09/10] tools: Add sheafs support to testing infrastructure
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (7 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 08/10] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-14 16:27 ` [PATCH RFC v2 10/10] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Allocate a sheaf and fill it to the count amount.  Does not fill to the
sheaf limit to detect incorrect allocation requests.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 tools/include/linux/slab.h   | 24 +++++++++++++
 tools/testing/shared/linux.c | 84 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index a475364cfd9fcdb10db252aab18ea3a620326b6b..0b6b42c9921fc402b4f3d4f681a95c9067d128db 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -22,6 +22,13 @@ enum slab_state {
 	FULL
 };
 
+struct slab_sheaf {
+	struct kmem_cache *cache;
+	unsigned int size;
+	unsigned int capacity;
+	void *objects[];
+};
+
 struct kmem_cache_args {
 	unsigned int align;
 	unsigned int sheaf_capacity;
@@ -79,4 +86,21 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **list);
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+static inline unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
+
 #endif		/* _TOOLS_SLAB_H */
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 9f5fd722f27f1d3877be8927be30409cd74ab3c3..a61c755c3c87e80036a5173115e955bfe7d5a80c 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -181,6 +181,12 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 	if (kmalloc_verbose)
 		pr_debug("Bulk alloc %lu\n", size);
 
+	if (cachep->exec_callback) {
+		if (cachep->callback)
+			cachep->callback(cachep->private);
+		cachep->exec_callback = false;
+	}
+
 	pthread_mutex_lock(&cachep->lock);
 	if (cachep->nr_objs >= size) {
 		struct radix_tree_node *node;
@@ -270,6 +276,84 @@ __kmem_cache_create_args(const char *name, unsigned int size,
 	return ret;
 }
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+	unsigned int capacity;
+
+	if (size > s->sheaf_capacity)
+		capacity = size;
+	else
+		capacity = s->sheaf_capacity;
+
+	sheaf = malloc(sizeof(*sheaf) + sizeof(void *) * s->sheaf_capacity * capacity);
+	if (!sheaf) {
+		return NULL;
+	}
+
+	memset(sheaf, 0, size);
+	sheaf->cache = s;
+	sheaf->capacity = capacity;
+	sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
+	if (!sheaf->size) {
+		free(sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf = *sheafp;
+	int refill;
+
+	if (sheaf->size >= size)
+		return 0;
+
+	if (size > sheaf->capacity) {
+		sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+		if (!sheaf)
+			return -ENOMEM;
+
+		kmem_cache_return_sheaf(s, gfp, *sheafp);
+		*sheafp = sheaf;
+		return 0;
+	}
+
+	refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
+				       &sheaf->objects[sheaf->size]);
+	if (!refill)
+		return -ENOMEM;
+
+	sheaf->size += refill;
+	return 0;
+}
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf *sheaf)
+{
+	if (sheaf->size) {
+		//s->non_kernel += sheaf->size;
+		kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+	}
+	free(sheaf);
+}
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf)
+{
+	if (sheaf->size == 0) {
+		printf("Nothing left in sheaf!\n");
+		return NULL;
+	}
+
+	return sheaf->objects[--sheaf->size];
+}
+
 /*
  * Test the test infrastructure for kem_cache_alloc/free and bulk counterparts.
  */

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH RFC v2 10/10] maple_tree: use percpu sheaves for maple_node_cache
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (8 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 09/10] tools: Add sheafs support to testing infrastructure Vlastimil Babka
@ 2025-02-14 16:27 ` Vlastimil Babka
  2025-02-23  4:27   ` Suren Baghdasaryan
  2025-02-14 18:28 ` [PATCH RFC v2 00/10] SLUB percpu sheaves Christoph Lameter (Ampere)
  2025-02-23  0:19 ` Kent Overstreet
  11 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-14 16:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
improve its performance. Change the single node rcu freeing in
ma_free_rcu() to use kfree_rcu() instead of the custom callback, which
allows the rcu_free sheaf batching to be used. Note there are other
users of mt_free_rcu() where larger parts of maple tree are submitted to
call_rcu() as a whole, and that cannot use the rcu_free sheaf, but it's
still possible for maple nodes freed this way to be reused via the barn,
even if only some cpus are allowed to process rcu callbacks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index f7153ade1be5f16423f0ca073846a7f3dfa60523..56e7a00f6f0941bff163091c999a873e4273f071 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -208,7 +208,7 @@ static void mt_free_rcu(struct rcu_head *head)
 static void ma_free_rcu(struct maple_node *node)
 {
 	WARN_ON(node->parent != ma_parent_ptr(node));
-	call_rcu(&node->rcu, mt_free_rcu);
+	kfree_rcu(node, rcu);
 }
 
 static void mas_set_height(struct ma_state *mas)
@@ -6258,9 +6258,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 
 void __init maple_tree_init(void)
 {
+	struct kmem_cache_args args = {
+		.align  = sizeof(struct maple_node),
+		.sheaf_capacity = 32,
+	};
+
 	maple_node_cache = kmem_cache_create("maple_node",
-			sizeof(struct maple_node), sizeof(struct maple_node),
-			SLAB_PANIC, NULL);
+			sizeof(struct maple_node), &args,
+			SLAB_PANIC);
 }
 
 /**

-- 
2.48.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (9 preceding siblings ...)
  2025-02-14 16:27 ` [PATCH RFC v2 10/10] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
@ 2025-02-14 18:28 ` Christoph Lameter (Ampere)
  2025-02-23  0:19 ` Kent Overstreet
  11 siblings, 0 replies; 55+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-02-14 18:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Sebastian Andrzej Siewior,
	Alexei Starovoitov

On Fri, 14 Feb 2025, Vlastimil Babka wrote:

> - Cheaper fast paths. For allocations, instead of local double cmpxchg,
>   after Patch 5 it's preempt_disable() and no atomic operations. Same for
>   freeing, which is normally a local double cmpxchg only for a short
>   term allocations (so the same slab is still active on the same cpu when
>   freeing the object) and a more costly locked double cmpxchg otherwise.
>   The downside is the lack of NUMA locality guarantees for the allocated
>   objects.

The local double cmpxchg is not an atomic instruction. For that it would
need a lock prefix.

The local cmpxchg is atomic vs an interrupt because the interrupt can only
occur between instructions. That is true for any processor instruction.

We use the fact that the cmpxchg does a RMV in one unbreakable
instruction to ensure that interrupts cannot do evil things to the fast path.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-14 16:27 ` [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t Vlastimil Babka
@ 2025-02-17 14:19   ` Sebastian Andrzej Siewior
  2025-02-17 14:35     ` Vlastimil Babka
  2025-02-26 17:00   ` Davidlohr Bueso
  1 sibling, 1 reply; 55+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-17 14:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree, Alexei Starovoitov

On 2025-02-14 17:27:39 [+0100], Vlastimil Babka wrote:
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
> critical section, but it doesn't prevent NMI, so the fully reentrant
> code cannot use local_lock_irqsave() for exclusive access.
> 
> Introduce localtry_lock_t and localtry_lock_irqsave() that
> disables interrupts and sets acquired=1, so localtry_lock_irqsave()
> from NMI attempting to acquire the same lock will return false.
> 
> In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
> Map localtry_lock_irqsave() to preemptible spin_trylock().
> When in hard IRQ or NMI return false right away, since
> spin_trylock() is not safe due to PI issues.

spin_trylock() is not safe due to explicit locking in the underneath
rt_spin_trylock() implementation. Removing this explicit locking and
attempting only "trylock" is undesired due to PI implications.

> Note there is no need to use local_inc for acquired variable,
> since it's a percpu variable with strict nesting scopes.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Other than that, thank you two ;)

Sebastian


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-17 14:19   ` Sebastian Andrzej Siewior
@ 2025-02-17 14:35     ` Vlastimil Babka
  2025-02-17 15:07       ` Sebastian Andrzej Siewior
  2025-02-18 18:41       ` Alexei Starovoitov
  0 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-17 14:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree, Alexei Starovoitov

On 2/17/25 15:19, Sebastian Andrzej Siewior wrote:
> On 2025-02-14 17:27:39 [+0100], Vlastimil Babka wrote:
>> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> 
>> In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
>> critical section, but it doesn't prevent NMI, so the fully reentrant
>> code cannot use local_lock_irqsave() for exclusive access.
>> 
>> Introduce localtry_lock_t and localtry_lock_irqsave() that
>> disables interrupts and sets acquired=1, so localtry_lock_irqsave()
>> from NMI attempting to acquire the same lock will return false.
>> 
>> In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
>> Map localtry_lock_irqsave() to preemptible spin_trylock().
>> When in hard IRQ or NMI return false right away, since
>> spin_trylock() is not safe due to PI issues.
> 
> spin_trylock() is not safe due to explicit locking in the underneath
> rt_spin_trylock() implementation. Removing this explicit locking and
> attempting only "trylock" is undesired due to PI implications.

Just to be sure, you're suggesting how to reword that sentence in the
changelog to make it more precise right?
Alexei will you incorporate that in your version?

>> Note there is no need to use local_inc for acquired variable,
>> since it's a percpu variable with strict nesting scopes.
>> 
>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Other than that, thank you two ;)

Thank you too :)

Do you agree with my fixups and addition here?
https://lore.kernel.org/all/efc30cf9-8351-4889-8245-cc4a6893ebf4@suse.cz/

> Sebastian



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-17 14:35     ` Vlastimil Babka
@ 2025-02-17 15:07       ` Sebastian Andrzej Siewior
  2025-02-18 18:41       ` Alexei Starovoitov
  1 sibling, 0 replies; 55+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-17 15:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree, Alexei Starovoitov

On 2025-02-17 15:35:11 [+0100], Vlastimil Babka wrote:
> > spin_trylock() is not safe due to explicit locking in the underneath
> > rt_spin_trylock() implementation. Removing this explicit locking and
> > attempting only "trylock" is undesired due to PI implications.
> 
> Just to be sure, you're suggesting how to reword that sentence in the
> changelog to make it more precise right?

Yes, just a reword. Everything else is fine by me. It just feels odd ack
my own patch.

> Alexei will you incorporate that in your version?
> 
> >> Note there is no need to use local_inc for acquired variable,
> >> since it's a percpu variable with strict nesting scopes.
> >> 
> >> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > Other than that, thank you two ;)
> 
> Thank you too :)
> 
> Do you agree with my fixups and addition here?
> https://lore.kernel.org/all/efc30cf9-8351-4889-8245-cc4a6893ebf4@suse.cz/

Yes, looks good.

Sebastian


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-17 14:35     ` Vlastimil Babka
  2025-02-17 15:07       ` Sebastian Andrzej Siewior
@ 2025-02-18 18:41       ` Alexei Starovoitov
  1 sibling, 0 replies; 55+ messages in thread
From: Alexei Starovoitov @ 2025-02-18 18:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sebastian Andrzej Siewior, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin, Hyeonggon Yoo,
	Uladzislau Rezki, linux-mm, LKML, rcu, maple-tree,
	Alexei Starovoitov

On Mon, Feb 17, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/17/25 15:19, Sebastian Andrzej Siewior wrote:
> > On 2025-02-14 17:27:39 [+0100], Vlastimil Babka wrote:
> >> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >>
> >> In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
> >> critical section, but it doesn't prevent NMI, so the fully reentrant
> >> code cannot use local_lock_irqsave() for exclusive access.
> >>
> >> Introduce localtry_lock_t and localtry_lock_irqsave() that
> >> disables interrupts and sets acquired=1, so localtry_lock_irqsave()
> >> from NMI attempting to acquire the same lock will return false.
> >>
> >> In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
> >> Map localtry_lock_irqsave() to preemptible spin_trylock().
> >> When in hard IRQ or NMI return false right away, since
> >> spin_trylock() is not safe due to PI issues.
> >
> > spin_trylock() is not safe due to explicit locking in the underneath
> > rt_spin_trylock() implementation. Removing this explicit locking and
> > attempting only "trylock" is undesired due to PI implications.

Makes sense.

> Just to be sure, you're suggesting how to reword that sentence in the
> changelog to make it more precise right?
> Alexei will you incorporate that in your version?

Sure. Let's squash patches 3 and 4 and add above
commit log clarification.
Whoever respins first can do it.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-02-14 16:27 ` [PATCH RFC v2 01/10] slab: add opt-in caching layer of " Vlastimil Babka
@ 2025-02-22 22:46   ` Suren Baghdasaryan
  2025-02-22 22:56     ` Suren Baghdasaryan
  2025-03-12 14:57     ` Vlastimil Babka
  2025-02-24  8:04   ` Harry Yoo
  1 sibling, 2 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-22 22:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
>
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> refill one of the sheaves.
>
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
>
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with a specific
> node (not NUMA_NO_NODE), sheaves are bypassed.
>
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
>
> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> allocated or freed using the sheaves. Counters sheaf_refill,
> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> from or to slab pages, and can be used to assess how effective the
> caching is. The refill and flush operations will also count towards the
> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> counters.
>
> Access to the percpu sheaves is protected by local_lock_irqsave()
> operations, each per-NUMA-node barn has a spin_lock.
>
> A current limitation is that when slub_debug is enabled for a cache with
> percpu sheaves, the objects in the array are considered as allocated from
> the slub_debug perspective, and the alloc/free debugging hooks occur
> when moving the objects between the array and slab pages. This means
> that e.g. an use-after-free that occurs for an object cached in the
> array is undetected. Collected alloc/free stacktraces might also be less
> useful. This limitation could be changed in the future.
>
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Only one possible issue in __pcs_flush_all_cpu(), all other comments
are nits and suggestions.

> ---
>  include/linux/slab.h |  34 ++
>  mm/slab.h            |   2 +
>  mm/slab_common.c     |   5 +-
>  mm/slub.c            | 982 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 973 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -332,6 +332,40 @@ struct kmem_cache_args {
>          * %NULL means no constructor.
>          */
>         void (*ctor)(void *);
> +       /**
> +        * @sheaf_capacity: Enable sheaves of given capacity for the cache.
> +        *
> +        * With a non-zero value, allocations from the cache go through caching
> +        * arrays called sheaves. Each cpu has a main sheaf that's always
> +        * present, and a spare sheaf thay may be not present. When both become
> +        * empty, there's an attempt to replace an empty sheaf with a full sheaf
> +        * from the per-node barn.
> +        *
> +        * When no full sheaf is available, and gfp flags allow blocking, a
> +        * sheaf is allocated and filled from slab(s) using bulk allocation.
> +        * Otherwise the allocation falls back to the normal operation
> +        * allocating a single object from a slab.
> +        *
> +        * Analogically when freeing and both percpu sheaves are full, the barn
> +        * may replace it with an empty sheaf, unless it's over capacity. In
> +        * that case a sheaf is bulk freed to slab pages.
> +        *
> +        * The sheaves does not distinguish NUMA placement of objects, so
> +        * allocations via kmem_cache_alloc_node() with a node specified other
> +        * than NUMA_NO_NODE will bypass them.
> +        *
> +        * Bulk allocation and free operations also try to use the cpu sheaves
> +        * and barn, but fallback to using slab pages directly.
> +        *
> +        * Limitations: when slub_debug is enabled for the cache, all relevant
> +        * actions (i.e. poisoning, obtaining stacktraces) and checks happen
> +        * when objects move between sheaves and slab pages, which may result in
> +        * e.g. not detecting a use-after-free while the object is in the array
> +        * cache, and the stacktraces may be less useful.

I would also love to see a short comparison of sheaves (when objects
are freed using kfree_rcu()) vs SLAB_TYPESAFE_BY_RCU. I think both
mechanisms rcu-free objects in bulk but sheaves would not reuse an
object before RCU grace period is passed. Is that right?

> +        *
> +        * %0 means no sheaves will be created
> +        */
> +       unsigned int sheaf_capacity;
>  };
>
>  struct kmem_cache *__kmem_cache_create_args(const char *name,
> diff --git a/mm/slab.h b/mm/slab.h
> index 2f01c7317988ce036f0b22807403226a59f0f708..8daaec53b6ecfc44171191d421adb12e5cba2c58 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -259,6 +259,7 @@ struct kmem_cache {
>  #ifndef CONFIG_SLUB_TINY
>         struct kmem_cache_cpu __percpu *cpu_slab;
>  #endif
> +       struct slub_percpu_sheaves __percpu *cpu_sheaves;
>         /* Used for retrieving partial slabs, etc. */
>         slab_flags_t flags;
>         unsigned long min_partial;
> @@ -272,6 +273,7 @@ struct kmem_cache {
>         /* Number of per cpu partial slabs to keep around */
>         unsigned int cpu_partial_slabs;
>  #endif
> +       unsigned int sheaf_capacity;
>         struct kmem_cache_order_objects oo;
>
>         /* Allocation and freeing of slabs */
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 46d0a4cd33b5982fd79c307d572f231fdea9514a..ceeefb287899a82f30ad79b403556001c1860311 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
>                 return 1;
>  #endif
>
> +       if (s->cpu_sheaves)
> +               return 1;
> +
>         /*
>          * We may have set a slab to be unmergeable during bootstrap.
>          */
> @@ -328,7 +331,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
>                     object_size - args->usersize < args->useroffset))
>                 args->usersize = args->useroffset = 0;
>
> -       if (!args->usersize)
> +       if (!args->usersize && !args->sheaf_capacity)
>                 s = __kmem_cache_alias(name, object_size, args->align, flags,
>                                        args->ctor);
>         if (s)
> diff --git a/mm/slub.c b/mm/slub.c
> index e8273f28656936c05d015c53923f8fe69cd161b2..c06734912972b799f537359f7fe6a750918ffe9e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -346,8 +346,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
>  #endif
>
>  enum stat_item {
> +       ALLOC_PCS,              /* Allocation from percpu sheaf */
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> +       FREE_PCS,               /* Free to percpu sheaf */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -372,6 +374,12 @@ enum stat_item {
>         CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
>         CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
>         CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
> +       SHEAF_FLUSH_MAIN,       /* Objects flushed from main percpu sheaf */
> +       SHEAF_FLUSH_OTHER,      /* Objects flushed from other sheaves */
> +       SHEAF_REFILL,           /* Objects refilled to a sheaf */
> +       SHEAF_SWAP,             /* Swapping main and spare sheaf */
> +       SHEAF_ALLOC,            /* Allocation of an empty sheaf */
> +       SHEAF_FREE,             /* Freeing of an empty sheaf */
>         NR_SLUB_STAT_ITEMS
>  };
>
> @@ -418,6 +426,35 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
>  #endif
>  }
>
> +#define MAX_FULL_SHEAVES       10
> +#define MAX_EMPTY_SHEAVES      10
> +
> +struct node_barn {
> +       spinlock_t lock;
> +       struct list_head sheaves_full;
> +       struct list_head sheaves_empty;
> +       unsigned int nr_full;
> +       unsigned int nr_empty;
> +};
> +
> +struct slab_sheaf {
> +       union {
> +               struct rcu_head rcu_head;
> +               struct list_head barn_list;
> +       };
> +       struct kmem_cache *cache;
> +       unsigned int size;
> +       void *objects[];
> +};
> +
> +struct slub_percpu_sheaves {
> +       local_lock_t lock;
> +       struct slab_sheaf *main; /* never NULL when unlocked */
> +       struct slab_sheaf *spare; /* empty or full, may be NULL */
> +       struct slab_sheaf *rcu_free;

Would be nice to have a short comment for rcu_free as well. I could
guess what main and spare are but for rcu_free had to look further.

> +       struct node_barn *barn;
> +};
> +
>  /*
>   * The slab lists for all objects.
>   */
> @@ -430,6 +467,7 @@ struct kmem_cache_node {
>         atomic_long_t total_objects;
>         struct list_head full;
>  #endif
> +       struct node_barn *barn;
>  };
>
>  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> @@ -453,12 +491,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>   */
>  static nodemask_t slab_nodes;
>
> -#ifndef CONFIG_SLUB_TINY
>  /*
>   * Workqueue used for flush_cpu_slab().
>   */
>  static struct workqueue_struct *flushwq;
> -#endif
> +
> +struct slub_flush_work {
> +       struct work_struct work;
> +       struct kmem_cache *s;
> +       bool skip;
> +};
> +
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>
>  /********************************************************************
>   *                     Core slab cache functions
> @@ -2410,6 +2455,349 @@ static void *setup_object(struct kmem_cache *s, void *object)
>         return object;
>  }
>
> +static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
> +                                       s->sheaf_capacity), gfp);
> +
> +       if (unlikely(!sheaf))
> +               return NULL;
> +
> +       sheaf->cache = s;
> +
> +       stat(s, SHEAF_ALLOC);
> +
> +       return sheaf;
> +}
> +
> +static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
> +{
> +       kfree(sheaf);
> +
> +       stat(s, SHEAF_FREE);
> +}
> +
> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> +                                  size_t size, void **p);
> +
> +
> +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
> +                        gfp_t gfp)
> +{
> +       int to_fill = s->sheaf_capacity - sheaf->size;
> +       int filled;
> +
> +       if (!to_fill)
> +               return 0;
> +
> +       filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> +                                        &sheaf->objects[sheaf->size]);
> +
> +       if (!filled)
> +               return -ENOMEM;
> +
> +       sheaf->size = s->sheaf_capacity;

nit: __kmem_cache_alloc_bulk() either allocates requested number of
objects or returns 0, so the current code is fine but if at some point
the implementation changes so that it can return smaller number of
objects than requested (filled < to_fill) then the above assignment
will become invalid. I think a safer thing here would be to just:

       sheaf->size += filled;

which also makes logical sense. Alternatively you could add
VM_BUG_ON(filled != to_fill) but the increment I think would be
better.

> +
> +       stat_add(s, SHEAF_REFILL, filled);
> +
> +       return 0;
> +}
> +
> +
> +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
> +
> +       if (!sheaf)
> +               return NULL;
> +
> +       if (refill_sheaf(s, sheaf, gfp)) {
> +               free_empty_sheaf(s, sheaf);
> +               return NULL;
> +       }
> +
> +       return sheaf;
> +}
> +
> +/*
> + * Maximum number of objects freed during a single flush of main pcs sheaf.
> + * Translates directly to an on-stack array size.
> + */
> +#define PCS_BATCH_MAX  32U
> +
.> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t
size, void **p);
> +

A comment clarifying why you are freeing in PCS_BATCH_MAX batches here
would be helpful. My understanding is that you do that to free objects
outside of the cpu_sheaves->lock, so you isolate a batch, release the
lock and then free the batch.

> +static void sheaf_flush_main(struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       unsigned int batch, remaining;
> +       void *objects[PCS_BATCH_MAX];
> +       struct slab_sheaf *sheaf;
> +       unsigned long flags;
> +
> +next_batch:
> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +       sheaf = pcs->main;
> +
> +       batch = min(PCS_BATCH_MAX, sheaf->size);
> +
> +       sheaf->size -= batch;
> +       memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
> +
> +       remaining = sheaf->size;
> +
> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +       __kmem_cache_free_bulk(s, batch, &objects[0]);
> +
> +       stat_add(s, SHEAF_FLUSH_MAIN, batch);
> +
> +       if (remaining)
> +               goto next_batch;
> +}
> +

This function seems to be used against either isolated sheaves or in
slub_cpu_dead() --> __pcs_flush_all_cpu() path where we hold
slab_mutex and I think that guarantees that the sheaf is unused. Maybe
a short comment clarifying this requirement or rename the function to
reflect that? Something like flush_unused_sheaf()?

> +static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
> +{
> +       if (!sheaf->size)
> +               return;
> +
> +       stat_add(s, SHEAF_FLUSH_OTHER, sheaf->size);
> +
> +       __kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
> +
> +       sheaf->size = 0;
> +}
> +
> +/*
> + * Caller needs to make sure migration is disabled in order to fully flush
> + * single cpu's sheaves
> + *
> + * flushing operations are rare so let's keep it simple and flush to slabs
> + * directly, skipping the barn
> + */
> +static void pcs_flush_all(struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *spare, *rcu_free;
> +       unsigned long flags;
> +
> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       spare = pcs->spare;
> +       pcs->spare = NULL;
> +
> +       rcu_free = pcs->rcu_free;
> +       pcs->rcu_free = NULL;
> +
> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +       if (spare) {
> +               sheaf_flush(s, spare);
> +               free_empty_sheaf(s, spare);
> +       }
> +
> +       // TODO: handle rcu_free
> +       BUG_ON(rcu_free);
> +
> +       sheaf_flush_main(s);
> +}
> +
> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +       if (pcs->spare) {
> +               sheaf_flush(s, pcs->spare);
> +               free_empty_sheaf(s, pcs->spare);
> +               pcs->spare = NULL;
> +       }
> +
> +       // TODO: handle rcu_free
> +       BUG_ON(pcs->rcu_free);
> +
> +       sheaf_flush_main(s);

Hmm. sheaf_flush_main() always flushes for this_cpu only, so IIUC this
call will not necessarily flush the main sheaf for the cpu passed to
__pcs_flush_all_cpu().

> +}
> +
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct slub_percpu_sheaves *pcs;
> +
> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +               /* can happen when unwinding failed create */
> +               if (!pcs->main)
> +                       continue;
> +
> +               WARN_ON(pcs->spare);
> +               WARN_ON(pcs->rcu_free);
> +
> +               if (!WARN_ON(pcs->main->size)) {
> +                       free_empty_sheaf(s, pcs->main);
> +                       pcs->main = NULL;
> +               }
> +       }
> +
> +       free_percpu(s->cpu_sheaves);
> +       s->cpu_sheaves = NULL;
> +}
> +
> +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
> +{
> +       struct slab_sheaf *empty = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_empty) {
> +               empty = list_first_entry(&barn->sheaves_empty,
> +                                        struct slab_sheaf, barn_list);
> +               list_del(&empty->barn_list);
> +               barn->nr_empty--;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return empty;
> +}
> +
> +static int barn_put_empty_sheaf(struct node_barn *barn,
> +                               struct slab_sheaf *sheaf, bool ignore_limit)
> +{
> +       unsigned long flags;
> +       int ret = 0;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
> +               ret = -E2BIG;
> +       } else {
> +               list_add(&sheaf->barn_list, &barn->sheaves_empty);
> +               barn->nr_empty++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +       return ret;
> +}
> +
> +static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
> +                              bool ignore_limit)
> +{
> +       unsigned long flags;
> +       int ret = 0;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
> +               ret = -E2BIG;
> +       } else {
> +               list_add(&sheaf->barn_list, &barn->sheaves_full);
> +               barn->nr_full++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +       return ret;
> +}
> +
> +/*
> + * If a full sheaf is available, return it and put the supplied empty one to
> + * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
> + * change.
> + */
> +static struct slab_sheaf *
> +barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
> +{
> +       struct slab_sheaf *full = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full) {
> +               full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> +                                       barn_list);
> +               list_del(&full->barn_list);
> +               list_add(&empty->barn_list, &barn->sheaves_empty);
> +               barn->nr_full--;
> +               barn->nr_empty++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return full;
> +}
> +/*
> + * If a empty sheaf is available, return it and put the supplied full one to
> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> +{
> +       struct slab_sheaf *empty;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full >= MAX_FULL_SHEAVES) {
> +               empty = ERR_PTR(-E2BIG);
> +       } else if (!barn->nr_empty) {
> +               empty = ERR_PTR(-ENOMEM);
> +       } else {
> +               empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
> +                                        barn_list);
> +               list_del(&empty->barn_list);
> +               list_add(&full->barn_list, &barn->sheaves_full);
> +               barn->nr_empty--;
> +               barn->nr_full++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return empty;
> +}
> +
> +static void barn_init(struct node_barn *barn)
> +{
> +       spin_lock_init(&barn->lock);
> +       INIT_LIST_HEAD(&barn->sheaves_full);
> +       INIT_LIST_HEAD(&barn->sheaves_empty);
> +       barn->nr_full = 0;
> +       barn->nr_empty = 0;
> +}
> +
> +static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
> +{
> +       struct list_head empty_list;
> +       struct list_head full_list;
> +       struct slab_sheaf *sheaf, *sheaf2;
> +       unsigned long flags;
> +
> +       INIT_LIST_HEAD(&empty_list);
> +       INIT_LIST_HEAD(&full_list);
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       list_splice_init(&barn->sheaves_full, &full_list);
> +       barn->nr_full = 0;
> +       list_splice_init(&barn->sheaves_empty, &empty_list);
> +       barn->nr_empty = 0;
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
> +               sheaf_flush(s, sheaf);
> +               list_move(&sheaf->barn_list, &empty_list);
> +       }
> +
> +       list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
> +               free_empty_sheaf(s, sheaf);
> +}
> +
>  /*
>   * Slab allocation and freeing
>   */
> @@ -3280,11 +3668,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>         put_partials_cpu(s, c);
>  }
>
> -struct slub_flush_work {
> -       struct work_struct work;
> -       struct kmem_cache *s;
> -       bool skip;
> -};
> +static inline void flush_this_cpu_slab(struct kmem_cache *s)
> +{
> +       struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> +
> +       if (c->slab)
> +               flush_slab(s, c);
> +
> +       put_partials(s);
> +}
> +
> +static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> +{
> +       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +
> +       return c->slab || slub_percpu_partial(c);
> +}
> +
> +#else /* CONFIG_SLUB_TINY */
> +static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> +static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
> +static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
> +#endif /* CONFIG_SLUB_TINY */
> +
> +static bool has_pcs_used(int cpu, struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +       if (!s->cpu_sheaves)
> +               return false;
> +
> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +       return (pcs->spare || pcs->rcu_free || pcs->main->size);
> +}
> +
> +static void pcs_flush_all(struct kmem_cache *s);
>
>  /*
>   * Flush cpu slab.
> @@ -3294,30 +3713,18 @@ struct slub_flush_work {
>  static void flush_cpu_slab(struct work_struct *w)
>  {
>         struct kmem_cache *s;
> -       struct kmem_cache_cpu *c;
>         struct slub_flush_work *sfw;
>
>         sfw = container_of(w, struct slub_flush_work, work);
>
>         s = sfw->s;
> -       c = this_cpu_ptr(s->cpu_slab);
>
> -       if (c->slab)
> -               flush_slab(s, c);
> -
> -       put_partials(s);
> -}
> -
> -static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> -{
> -       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +       if (s->cpu_sheaves)
> +               pcs_flush_all(s);
>
> -       return c->slab || slub_percpu_partial(c);
> +       flush_this_cpu_slab(s);
>  }
>
> -static DEFINE_MUTEX(flush_lock);
> -static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> -
>  static void flush_all_cpus_locked(struct kmem_cache *s)
>  {
>         struct slub_flush_work *sfw;
> @@ -3328,7 +3735,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
>
>         for_each_online_cpu(cpu) {
>                 sfw = &per_cpu(slub_flush, cpu);
> -               if (!has_cpu_slab(cpu, s)) {
> +               if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
>                         sfw->skip = true;
>                         continue;
>                 }
> @@ -3364,19 +3771,14 @@ static int slub_cpu_dead(unsigned int cpu)
>         struct kmem_cache *s;
>
>         mutex_lock(&slab_mutex);
> -       list_for_each_entry(s, &slab_caches, list)
> +       list_for_each_entry(s, &slab_caches, list) {
>                 __flush_cpu_slab(s, cpu);
> +               __pcs_flush_all_cpu(s, cpu);
> +       }
>         mutex_unlock(&slab_mutex);
>         return 0;
>  }
>
> -#else /* CONFIG_SLUB_TINY */
> -static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
> -static inline void flush_all(struct kmem_cache *s) { }
> -static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> -static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
> -#endif /* CONFIG_SLUB_TINY */
> -
>  /*
>   * Check if the objects in a per cpu structure fit numa
>   * locality expectations.
> @@ -4126,6 +4528,173 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>         return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
>  }
>
> +static __fastpath_inline
> +void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       unsigned long flags;
> +       void *object;
> +
> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == 0)) {
> +
> +               struct slab_sheaf *empty = NULL;
> +               struct slab_sheaf *full;
> +               bool can_alloc;
> +
> +               if (pcs->spare && pcs->spare->size > 0) {
> +                       stat(s, SHEAF_SWAP);
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_alloc;
> +               }
> +
> +               full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> +
> +               if (full) {
> +                       pcs->main = full;
> +                       goto do_alloc;
> +               }
> +
> +               can_alloc = gfpflags_allow_blocking(gfp);
> +
> +               if (can_alloc) {
> +                       if (pcs->spare) {
> +                               empty = pcs->spare;
> +                               pcs->spare = NULL;
> +                       } else {
> +                               empty = barn_get_empty_sheaf(pcs->barn);
> +                       }
> +               }
> +
> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +               if (!can_alloc)
> +                       return NULL;
> +
> +               if (empty) {
> +                       if (!refill_sheaf(s, empty, gfp)) {
> +                               full = empty;
> +                       } else {
> +                               /*
> +                                * we must be very low on memory so don't bother
> +                                * with the barn
> +                                */
> +                               free_empty_sheaf(s, empty);
> +                       }
> +               } else {
> +                       full = alloc_full_sheaf(s, gfp);
> +               }
> +
> +               if (!full)
> +                       return NULL;
> +
> +               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +               /*
> +                * If we are returning empty sheaf, we either got it from the
> +                * barn or had to allocate one. If we are returning a full
> +                * sheaf, it's due to racing or being migrated to a different
> +                * cpu. Breaching the barn's sheaf limits should be thus rare
> +                * enough so just ignore them to simplify the recovery.
> +                */
> +
> +               if (pcs->main->size == 0) {
> +                       barn_put_empty_sheaf(pcs->barn, pcs->main, true);
> +                       pcs->main = full;
> +                       goto do_alloc;
> +               }
> +
> +               if (!pcs->spare) {
> +                       pcs->spare = full;
> +                       goto do_alloc;
> +               }
> +
> +               if (pcs->spare->size == 0) {
> +                       barn_put_empty_sheaf(pcs->barn, pcs->spare, true);
> +                       pcs->spare = full;
> +                       goto do_alloc;
> +               }
> +
> +               barn_put_full_sheaf(pcs->barn, full, true);
> +       }
> +
> +do_alloc:
> +       object = pcs->main->objects[--pcs->main->size];
> +
> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +       stat(s, ALLOC_PCS);
> +
> +       return object;
> +}
> +
> +static __fastpath_inline
> +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *main;
> +       unsigned long flags;
> +       unsigned int allocated = 0;
> +       unsigned int batch;
> +
> +next_batch:
> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == 0)) {
> +
> +               struct slab_sheaf *full;
> +
> +               if (pcs->spare && pcs->spare->size > 0) {
> +                       stat(s, SHEAF_SWAP);
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_alloc;
> +               }
> +
> +               full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> +
> +               if (full) {
> +                       pcs->main = full;
> +                       goto do_alloc;
> +               }
> +
> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +               /*
> +                * Once full sheaves in barn are depleted, let the bulk
> +                * allocation continue from slab pages, otherwise we would just
> +                * be copying arrays of pointers twice.
> +                */
> +               return allocated;
> +       }
> +
> +do_alloc:
> +
> +       main = pcs->main;
> +       batch = min(size, main->size);
> +
> +       main->size -= batch;
> +       memcpy(p, main->objects + main->size, batch * sizeof(void *));
> +
> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +       stat_add(s, ALLOC_PCS, batch);
> +
> +       allocated += batch;
> +
> +       if (batch < size) {
> +               p += batch;
> +               size -= batch;
> +               goto next_batch;
> +       }
> +
> +       return allocated;
> +}
> +
> +
>  /*
>   * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
>   * have the fastpath folded into their functions. So no function call
> @@ -4150,7 +4719,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>         if (unlikely(object))
>                 goto out;
>
> -       object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> +       if (s->cpu_sheaves && (node == NUMA_NO_NODE))
> +               object = alloc_from_pcs(s, gfpflags);
> +
> +       if (!object)
> +               object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>
>         maybe_wipe_obj_freeptr(s, object);
>         init = slab_want_init_on_alloc(gfpflags, s);
> @@ -4521,6 +5094,196 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>         discard_slab(s, slab);
>  }
>
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +void free_to_pcs(struct kmem_cache *s, void *object)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       unsigned long flags;
> +
> +restart:
> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +               struct slab_sheaf *empty;
> +
> +               if (!pcs->spare) {
> +                       empty = barn_get_empty_sheaf(pcs->barn);
> +                       if (empty) {
> +                               pcs->spare = pcs->main;
> +                               pcs->main = empty;
> +                               goto do_free;
> +                       }
> +                       goto alloc_empty;
> +               }
> +
> +               if (pcs->spare->size < s->sheaf_capacity) {
> +                       stat(s, SHEAF_SWAP);
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_free;
> +               }
> +
> +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +               if (!IS_ERR(empty)) {
> +                       pcs->main = empty;
> +                       goto do_free;
> +               }
> +
> +               if (PTR_ERR(empty) == -E2BIG) {
> +                       /* Since we got here, spare exists and is full */
> +                       struct slab_sheaf *to_flush = pcs->spare;
> +
> +                       pcs->spare = NULL;
> +                       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +                       sheaf_flush(s, to_flush);
> +                       empty = to_flush;
> +                       goto got_empty;
> +               }
> +
> +alloc_empty:
> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +
> +               if (!empty) {
> +                       sheaf_flush_main(s);
> +                       goto restart;
> +               }
> +
> +got_empty:
> +               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +               /*
> +                * if we put any sheaf to barn here, it's because we raced or
> +                * have been migrated to a different cpu, which should be rare
> +                * enough so just ignore the barn's limits to simplify
> +                */
> +               if (unlikely(pcs->main->size < s->sheaf_capacity)) {
> +                       if (!pcs->spare)
> +                               pcs->spare = empty;
> +                       else
> +                               barn_put_empty_sheaf(pcs->barn, empty, true);
> +                       goto do_free;
> +               }
> +
> +               if (!pcs->spare) {
> +                       pcs->spare = pcs->main;
> +                       pcs->main = empty;
> +                       goto do_free;
> +               }
> +
> +               barn_put_full_sheaf(pcs->barn, pcs->main, true);
> +               pcs->main = empty;

I find the program flow in this function quite complex and hard to
follow. I think refactoring the above block starting from "pcs =
this_cpu_ptr(s->cpu_sheaves)" would somewhat simplify it. That
eliminates the need for the "got_empty" label and makes the
locking/unlocking sequence of s->cpu_sheaves->lock a bit more clear.

> +       }
> +
> +do_free:
> +       pcs->main->objects[pcs->main->size++] = object;
> +
> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +       stat(s, FREE_PCS);
> +}
> +
> +/*
> + * Bulk free objects to the percpu sheaves.
> + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> + * and the fallback to freeing to slab pages.
> + */
> +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *main;
> +       unsigned long flags;
> +       unsigned int batch, i = 0;
> +       bool init;
> +
> +       init = slab_want_init_on_free(s);
> +
> +       while (i < size) {
> +               struct slab *slab = virt_to_slab(p[i]);
> +
> +               memcg_slab_free_hook(s, slab, p + i, 1);
> +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
> +
> +               if (unlikely(!slab_free_hook(s, p[i], init, false))) {
> +                       p[i] = p[--size];
> +                       if (!size)
> +                               return;
> +                       continue;
> +               }
> +
> +               i++;
> +       }
> +
> +next_batch:
> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +               struct slab_sheaf *empty;
> +
> +               if (!pcs->spare) {
> +                       empty = barn_get_empty_sheaf(pcs->barn);
> +                       if (empty) {
> +                               pcs->spare = pcs->main;
> +                               pcs->main = empty;
> +                               goto do_free;
> +                       }
> +                       goto no_empty;
> +               }
> +
> +               if (pcs->spare->size < s->sheaf_capacity) {
> +                       stat(s, SHEAF_SWAP);
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_free;
> +               }
> +
> +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +               if (!IS_ERR(empty)) {
> +                       pcs->main = empty;
> +                       goto do_free;
> +               }
> +
> +no_empty:
> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +               /*
> +                * if we depleted all empty sheaves in the barn or there are too
> +                * many full sheaves, free the rest to slab pages
> +                */
> +
> +               __kmem_cache_free_bulk(s, size, p);
> +               return;
> +       }
> +
> +do_free:
> +       main = pcs->main;
> +       batch = min(size, s->sheaf_capacity - main->size);
> +
> +       memcpy(main->objects + main->size, p, batch * sizeof(void *));
> +       main->size += batch;
> +
> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +       stat_add(s, FREE_PCS, batch);
> +
> +       if (batch < size) {
> +               p += batch;
> +               size -= batch;
> +               goto next_batch;
> +       }
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -4607,7 +5370,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>         memcg_slab_free_hook(s, slab, &object, 1);
>         alloc_tagging_slab_free_hook(s, slab, &object, 1);
>
> -       if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +       if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +               return;
> +
> +       if (s->cpu_sheaves)
> +               free_to_pcs(s, object);
> +       else
>                 do_slab_free(s, slab, object, object, 1, addr);
>  }
>
> @@ -5033,6 +5801,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>         if (!size)
>                 return;
>
> +       /*
> +        * freeing to sheaves is so incompatible with the detached freelist so
> +        * once we go that way, we have to do everything differently
> +        */
> +       if (s && s->cpu_sheaves) {
> +               free_to_pcs_bulk(s, size, p);
> +               return;
> +       }
> +
>         do {
>                 struct detached_freelist df;
>
> @@ -5151,7 +5928,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>  int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                                  void **p)
>  {
> -       int i;
> +       unsigned int i = 0;
>
>         if (!size)
>                 return 0;
> @@ -5160,9 +5937,21 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>         if (unlikely(!s))
>                 return 0;
>
> -       i = __kmem_cache_alloc_bulk(s, flags, size, p);
> -       if (unlikely(i == 0))
> -               return 0;
> +       if (s->cpu_sheaves)
> +               i = alloc_from_pcs_bulk(s, size, p);
> +
> +       if (i < size) {
> +               unsigned int j = __kmem_cache_alloc_bulk(s, flags, size - i, p + i);
> +               /*
> +                * If we ran out of memory, don't bother with freeing back to
> +                * the percpu sheaves, we have bigger problems.
> +                */
> +               if (unlikely(j == 0)) {
> +                       if (i > 0)
> +                               __kmem_cache_free_bulk(s, i, p);
> +                       return 0;
> +               }
> +       }
>
>         /*
>          * memcg and kmem_cache debug support and memory initialization.
> @@ -5172,11 +5961,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                     slab_want_init_on_alloc(flags, s), s->object_size))) {
>                 return 0;
>         }
> -       return i;
> +
> +       return size;
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
>
> -
>  /*
>   * Object placement in a slab is made very easy because we always start at
>   * offset 0. If we tune the size of the object to the alignment then we can
> @@ -5309,8 +6098,8 @@ static inline int calculate_order(unsigned int size)
>         return -ENOSYS;
>  }
>
> -static void
> -init_kmem_cache_node(struct kmem_cache_node *n)
> +static bool
> +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
>  {
>         n->nr_partial = 0;
>         spin_lock_init(&n->list_lock);
> @@ -5320,6 +6109,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
>         atomic_long_set(&n->total_objects, 0);
>         INIT_LIST_HEAD(&n->full);
>  #endif
> +       n->barn = barn;
> +       if (barn)
> +               barn_init(barn);
> +
> +       return true;
>  }
>
>  #ifndef CONFIG_SLUB_TINY
> @@ -5350,6 +6144,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
>  }
>  #endif /* CONFIG_SLUB_TINY */
>
> +static int init_percpu_sheaves(struct kmem_cache *s)
> +{
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct slub_percpu_sheaves *pcs;
> +               int nid;
> +
> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +               local_lock_init(&pcs->lock);
> +
> +               nid = cpu_to_mem(cpu);
> +
> +               pcs->barn = get_node(s, nid)->barn;
> +               pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
> +
> +               if (!pcs->main)
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
>  static struct kmem_cache *kmem_cache_node;
>
>  /*
> @@ -5385,7 +6203,7 @@ static void early_kmem_cache_node_alloc(int node)
>         slab->freelist = get_freepointer(kmem_cache_node, n);
>         slab->inuse = 1;
>         kmem_cache_node->node[node] = n;
> -       init_kmem_cache_node(n);
> +       init_kmem_cache_node(n, NULL);
>         inc_slabs_node(kmem_cache_node, node, slab->objects);
>
>         /*
> @@ -5401,6 +6219,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
>         struct kmem_cache_node *n;
>
>         for_each_kmem_cache_node(s, node, n) {
> +               if (n->barn) {
> +                       WARN_ON(n->barn->nr_full);
> +                       WARN_ON(n->barn->nr_empty);
> +                       kfree(n->barn);
> +                       n->barn = NULL;
> +               }
> +
>                 s->node[node] = NULL;
>                 kmem_cache_free(kmem_cache_node, n);
>         }
> @@ -5409,6 +6234,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
>  void __kmem_cache_release(struct kmem_cache *s)
>  {
>         cache_random_seq_destroy(s);
> +       if (s->cpu_sheaves)
> +               pcs_destroy(s);
>  #ifndef CONFIG_SLUB_TINY
>         free_percpu(s->cpu_slab);
>  #endif
> @@ -5421,20 +6248,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>
>         for_each_node_mask(node, slab_nodes) {
>                 struct kmem_cache_node *n;
> +               struct node_barn *barn = NULL;
>
>                 if (slab_state == DOWN) {
>                         early_kmem_cache_node_alloc(node);
>                         continue;
>                 }
> +
> +               if (s->cpu_sheaves) {
> +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> +
> +                       if (!barn)
> +                               return 0;
> +               }
> +
>                 n = kmem_cache_alloc_node(kmem_cache_node,
>                                                 GFP_KERNEL, node);
> -
> -               if (!n) {
> -                       free_kmem_cache_nodes(s);
> +               if (!n)
>                         return 0;
> -               }
>
> -               init_kmem_cache_node(n);
> +               init_kmem_cache_node(n, barn);
> +
>                 s->node[node] = n;
>         }
>         return 1;
> @@ -5690,6 +6524,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>         flush_all_cpus_locked(s);
>         /* Attempt to free all objects */
>         for_each_kmem_cache_node(s, node, n) {
> +               if (n->barn)
> +                       barn_shrink(s, n->barn);
>                 free_partial(s, n);
>                 if (n->nr_partial || node_nr_slabs(n))
>                         return 1;
> @@ -5893,6 +6729,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
>                 for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
>                         INIT_LIST_HEAD(promote + i);
>
> +               if (n->barn)
> +                       barn_shrink(s, n->barn);
> +
>                 spin_lock_irqsave(&n->list_lock, flags);
>
>                 /*
> @@ -6005,12 +6844,24 @@ static int slab_mem_going_online_callback(void *arg)
>          */
>         mutex_lock(&slab_mutex);
>         list_for_each_entry(s, &slab_caches, list) {
> +               struct node_barn *barn = NULL;
> +
>                 /*
>                  * The structure may already exist if the node was previously
>                  * onlined and offlined.
>                  */
>                 if (get_node(s, nid))
>                         continue;
> +
> +               if (s->cpu_sheaves) {
> +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> +
> +                       if (!barn) {
> +                               ret = -ENOMEM;
> +                               goto out;
> +                       }
> +               }
> +
>                 /*
>                  * XXX: kmem_cache_alloc_node will fallback to other nodes
>                  *      since memory is not yet available from the node that
> @@ -6021,7 +6872,9 @@ static int slab_mem_going_online_callback(void *arg)
>                         ret = -ENOMEM;
>                         goto out;
>                 }
> -               init_kmem_cache_node(n);
> +
> +               init_kmem_cache_node(n, barn);
> +
>                 s->node[nid] = n;
>         }
>         /*
> @@ -6240,6 +7093,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>
>         set_cpu_partial(s);
>
> +       if (args->sheaf_capacity) {
> +               s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> +               if (!s->cpu_sheaves) {
> +                       err = -ENOMEM;
> +                       goto out;
> +               }
> +               // TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> +               s->sheaf_capacity = args->sheaf_capacity;
> +       }
> +
>  #ifdef CONFIG_NUMA
>         s->remote_node_defrag_ratio = 1000;
>  #endif
> @@ -6256,6 +7119,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>         if (!alloc_kmem_cache_cpus(s))
>                 goto out;
>
> +       if (s->cpu_sheaves) {
> +               err = init_percpu_sheaves(s);
> +               if (err)
> +                       goto out;
> +       }
> +
>         err = 0;
>
>         /* Mutex is not taken during early boot */
> @@ -6277,7 +7146,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>                 __kmem_cache_release(s);
>         return err;
>  }
> -
>  #ifdef SLAB_SUPPORTS_SYSFS
>  static int count_inuse(struct slab *slab)
>  {
> @@ -7055,8 +7923,10 @@ static ssize_t text##_store(struct kmem_cache *s,                \
>  }                                                              \
>  SLAB_ATTR(text);                                               \
>
> +STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> +STAT_ATTR(FREE_PCS, free_cpu_sheaf);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_FROZEN, free_frozen);
> @@ -7081,6 +7951,12 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
>  STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
>  STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
>  STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
> +STAT_ATTR(SHEAF_FLUSH_MAIN, sheaf_flush_main);
> +STAT_ATTR(SHEAF_FLUSH_OTHER, sheaf_flush_other);
> +STAT_ATTR(SHEAF_REFILL, sheaf_refill);
> +STAT_ATTR(SHEAF_SWAP, sheaf_swap);
> +STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
> +STAT_ATTR(SHEAF_FREE, sheaf_free);
>  #endif /* CONFIG_SLUB_STATS */
>
>  #ifdef CONFIG_KFENCE
> @@ -7142,8 +8018,10 @@ static struct attribute *slab_attrs[] = {
>         &remote_node_defrag_ratio_attr.attr,
>  #endif
>  #ifdef CONFIG_SLUB_STATS
> +       &alloc_cpu_sheaf_attr.attr,
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
> +       &free_cpu_sheaf_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_frozen_attr.attr,
> @@ -7168,6 +8046,12 @@ static struct attribute *slab_attrs[] = {
>         &cpu_partial_free_attr.attr,
>         &cpu_partial_node_attr.attr,
>         &cpu_partial_drain_attr.attr,
> +       &sheaf_flush_main_attr.attr,
> +       &sheaf_flush_other_attr.attr,
> +       &sheaf_refill_attr.attr,
> +       &sheaf_swap_attr.attr,
> +       &sheaf_alloc_attr.attr,
> +       &sheaf_free_attr.attr,
>  #endif
>  #ifdef CONFIG_FAILSLAB
>         &failslab_attr.attr,
>
> --
> 2.48.1
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-02-22 22:46   ` Suren Baghdasaryan
@ 2025-02-22 22:56     ` Suren Baghdasaryan
  2025-03-12 14:57     ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-22 22:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Sat, Feb 22, 2025 at 2:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Specifying a non-zero value for a new struct kmem_cache_args field
> > sheaf_capacity will setup a caching layer of percpu arrays called
> > sheaves of given capacity for the created cache.
> >
> > Allocations from the cache will allocate via the percpu sheaves (main or
> > spare) as long as they have no NUMA node preference. Frees will also
> > refill one of the sheaves.
> >
> > When both percpu sheaves are found empty during an allocation, an empty
> > sheaf may be replaced with a full one from the per-node barn. If none
> > are available and the allocation is allowed to block, an empty sheaf is
> > refilled from slab(s) by an internal bulk alloc operation. When both
> > percpu sheaves are full during freeing, the barn can replace a full one
> > with an empty one, unless over a full sheaves limit. In that case a
> > sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> > sheaves and barns is also wired to the existing cpu flushing and cache
> > shrinking operations.
> >
> > The sheaves do not distinguish NUMA locality of the cached objects. If
> > an allocation is requested with kmem_cache_alloc_node() with a specific
> > node (not NUMA_NO_NODE), sheaves are bypassed.
> >
> > The bulk operations exposed to slab users also try to utilize the
> > sheaves as long as the necessary (full or empty) sheaves are available
> > on the cpu or in the barn. Once depleted, they will fallback to bulk
> > alloc/free to slabs directly to avoid double copying.
> >
> > Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> > allocated or freed using the sheaves. Counters sheaf_refill,
> > sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> > from or to slab pages, and can be used to assess how effective the
> > caching is. The refill and flush operations will also count towards the
> > usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> > counters.
> >
> > Access to the percpu sheaves is protected by local_lock_irqsave()
> > operations, each per-NUMA-node barn has a spin_lock.
> >
> > A current limitation is that when slub_debug is enabled for a cache with
> > percpu sheaves, the objects in the array are considered as allocated from
> > the slub_debug perspective, and the alloc/free debugging hooks occur
> > when moving the objects between the array and slab pages. This means
> > that e.g. an use-after-free that occurs for an object cached in the
> > array is undetected. Collected alloc/free stacktraces might also be less
> > useful. This limitation could be changed in the future.
> >
> > On the other hand, KASAN, kmemcg and other hooks are executed on actual
> > allocations and frees by kmem_cache users even if those use the array,
> > so their debugging or accounting accuracy should be unaffected.
> >
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>
> Only one possible issue in __pcs_flush_all_cpu(), all other comments
> are nits and suggestions.
>
> > ---
> >  include/linux/slab.h |  34 ++
> >  mm/slab.h            |   2 +
> >  mm/slab_common.c     |   5 +-
> >  mm/slub.c            | 982 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  4 files changed, 973 insertions(+), 50 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 7686054dd494cc65def7f58748718e03eb78e481..0e1b25228c77140d05b5b4433c9d7923de36ec05 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -332,6 +332,40 @@ struct kmem_cache_args {
> >          * %NULL means no constructor.
> >          */
> >         void (*ctor)(void *);
> > +       /**
> > +        * @sheaf_capacity: Enable sheaves of given capacity for the cache.
> > +        *
> > +        * With a non-zero value, allocations from the cache go through caching
> > +        * arrays called sheaves. Each cpu has a main sheaf that's always
> > +        * present, and a spare sheaf thay may be not present. When both become
> > +        * empty, there's an attempt to replace an empty sheaf with a full sheaf
> > +        * from the per-node barn.
> > +        *
> > +        * When no full sheaf is available, and gfp flags allow blocking, a
> > +        * sheaf is allocated and filled from slab(s) using bulk allocation.
> > +        * Otherwise the allocation falls back to the normal operation
> > +        * allocating a single object from a slab.
> > +        *
> > +        * Analogically when freeing and both percpu sheaves are full, the barn
> > +        * may replace it with an empty sheaf, unless it's over capacity. In
> > +        * that case a sheaf is bulk freed to slab pages.
> > +        *
> > +        * The sheaves does not distinguish NUMA placement of objects, so
> > +        * allocations via kmem_cache_alloc_node() with a node specified other
> > +        * than NUMA_NO_NODE will bypass them.
> > +        *
> > +        * Bulk allocation and free operations also try to use the cpu sheaves
> > +        * and barn, but fallback to using slab pages directly.
> > +        *
> > +        * Limitations: when slub_debug is enabled for the cache, all relevant
> > +        * actions (i.e. poisoning, obtaining stacktraces) and checks happen
> > +        * when objects move between sheaves and slab pages, which may result in
> > +        * e.g. not detecting a use-after-free while the object is in the array
> > +        * cache, and the stacktraces may be less useful.
>
> I would also love to see a short comparison of sheaves (when objects
> are freed using kfree_rcu()) vs SLAB_TYPESAFE_BY_RCU. I think both
> mechanisms rcu-free objects in bulk but sheaves would not reuse an
> object before RCU grace period is passed. Is that right?
>
> > +        *
> > +        * %0 means no sheaves will be created
> > +        */
> > +       unsigned int sheaf_capacity;
> >  };
> >
> >  struct kmem_cache *__kmem_cache_create_args(const char *name,
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 2f01c7317988ce036f0b22807403226a59f0f708..8daaec53b6ecfc44171191d421adb12e5cba2c58 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -259,6 +259,7 @@ struct kmem_cache {
> >  #ifndef CONFIG_SLUB_TINY
> >         struct kmem_cache_cpu __percpu *cpu_slab;
> >  #endif
> > +       struct slub_percpu_sheaves __percpu *cpu_sheaves;
> >         /* Used for retrieving partial slabs, etc. */
> >         slab_flags_t flags;
> >         unsigned long min_partial;
> > @@ -272,6 +273,7 @@ struct kmem_cache {
> >         /* Number of per cpu partial slabs to keep around */
> >         unsigned int cpu_partial_slabs;
> >  #endif
> > +       unsigned int sheaf_capacity;
> >         struct kmem_cache_order_objects oo;
> >
> >         /* Allocation and freeing of slabs */
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 46d0a4cd33b5982fd79c307d572f231fdea9514a..ceeefb287899a82f30ad79b403556001c1860311 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
> >                 return 1;
> >  #endif
> >
> > +       if (s->cpu_sheaves)
> > +               return 1;
> > +
> >         /*
> >          * We may have set a slab to be unmergeable during bootstrap.
> >          */
> > @@ -328,7 +331,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
> >                     object_size - args->usersize < args->useroffset))
> >                 args->usersize = args->useroffset = 0;
> >
> > -       if (!args->usersize)
> > +       if (!args->usersize && !args->sheaf_capacity)
> >                 s = __kmem_cache_alias(name, object_size, args->align, flags,
> >                                        args->ctor);
> >         if (s)
> > diff --git a/mm/slub.c b/mm/slub.c
> > index e8273f28656936c05d015c53923f8fe69cd161b2..c06734912972b799f537359f7fe6a750918ffe9e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -346,8 +346,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
> >  #endif
> >
> >  enum stat_item {
> > +       ALLOC_PCS,              /* Allocation from percpu sheaf */
> >         ALLOC_FASTPATH,         /* Allocation from cpu slab */
> >         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> > +       FREE_PCS,               /* Free to percpu sheaf */
> >         FREE_FASTPATH,          /* Free to cpu slab */
> >         FREE_SLOWPATH,          /* Freeing not to cpu slab */
> >         FREE_FROZEN,            /* Freeing to frozen slab */
> > @@ -372,6 +374,12 @@ enum stat_item {
> >         CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
> >         CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
> >         CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
> > +       SHEAF_FLUSH_MAIN,       /* Objects flushed from main percpu sheaf */
> > +       SHEAF_FLUSH_OTHER,      /* Objects flushed from other sheaves */
> > +       SHEAF_REFILL,           /* Objects refilled to a sheaf */
> > +       SHEAF_SWAP,             /* Swapping main and spare sheaf */
> > +       SHEAF_ALLOC,            /* Allocation of an empty sheaf */
> > +       SHEAF_FREE,             /* Freeing of an empty sheaf */
> >         NR_SLUB_STAT_ITEMS
> >  };
> >
> > @@ -418,6 +426,35 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
> >  #endif
> >  }
> >
> > +#define MAX_FULL_SHEAVES       10
> > +#define MAX_EMPTY_SHEAVES      10
> > +
> > +struct node_barn {
> > +       spinlock_t lock;
> > +       struct list_head sheaves_full;
> > +       struct list_head sheaves_empty;
> > +       unsigned int nr_full;
> > +       unsigned int nr_empty;
> > +};
> > +
> > +struct slab_sheaf {
> > +       union {
> > +               struct rcu_head rcu_head;
> > +               struct list_head barn_list;
> > +       };
> > +       struct kmem_cache *cache;
> > +       unsigned int size;
> > +       void *objects[];
> > +};
> > +
> > +struct slub_percpu_sheaves {
> > +       local_lock_t lock;
> > +       struct slab_sheaf *main; /* never NULL when unlocked */
> > +       struct slab_sheaf *spare; /* empty or full, may be NULL */
> > +       struct slab_sheaf *rcu_free;
>
> Would be nice to have a short comment for rcu_free as well. I could
> guess what main and spare are but for rcu_free had to look further.
>
> > +       struct node_barn *barn;
> > +};
> > +
> >  /*
> >   * The slab lists for all objects.
> >   */
> > @@ -430,6 +467,7 @@ struct kmem_cache_node {
> >         atomic_long_t total_objects;
> >         struct list_head full;
> >  #endif
> > +       struct node_barn *barn;
> >  };
> >
> >  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> > @@ -453,12 +491,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> >   */
> >  static nodemask_t slab_nodes;
> >
> > -#ifndef CONFIG_SLUB_TINY
> >  /*
> >   * Workqueue used for flush_cpu_slab().
> >   */
> >  static struct workqueue_struct *flushwq;
> > -#endif
> > +
> > +struct slub_flush_work {
> > +       struct work_struct work;
> > +       struct kmem_cache *s;
> > +       bool skip;
> > +};
> > +
> > +static DEFINE_MUTEX(flush_lock);
> > +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> >
> >  /********************************************************************
> >   *                     Core slab cache functions
> > @@ -2410,6 +2455,349 @@ static void *setup_object(struct kmem_cache *s, void *object)
> >         return object;
> >  }
> >
> > +static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
> > +{
> > +       struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
> > +                                       s->sheaf_capacity), gfp);
> > +
> > +       if (unlikely(!sheaf))
> > +               return NULL;
> > +
> > +       sheaf->cache = s;

Forgot to mention. I don't see sheaf->cache being used anywhere here.
I'll assume it's used in later patches...

> > +
> > +       stat(s, SHEAF_ALLOC);
> > +
> > +       return sheaf;
> > +}
> > +
> > +static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
> > +{
> > +       kfree(sheaf);
> > +
> > +       stat(s, SHEAF_FREE);
> > +}
> > +
> > +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> > +                                  size_t size, void **p);
> > +
> > +
> > +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
> > +                        gfp_t gfp)
> > +{
> > +       int to_fill = s->sheaf_capacity - sheaf->size;
> > +       int filled;
> > +
> > +       if (!to_fill)
> > +               return 0;
> > +
> > +       filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> > +                                        &sheaf->objects[sheaf->size]);
> > +
> > +       if (!filled)
> > +               return -ENOMEM;
> > +
> > +       sheaf->size = s->sheaf_capacity;
>
> nit: __kmem_cache_alloc_bulk() either allocates requested number of
> objects or returns 0, so the current code is fine but if at some point
> the implementation changes so that it can return smaller number of
> objects than requested (filled < to_fill) then the above assignment
> will become invalid. I think a safer thing here would be to just:
>
>        sheaf->size += filled;
>
> which also makes logical sense. Alternatively you could add
> VM_BUG_ON(filled != to_fill) but the increment I think would be
> better.
>
> > +
> > +       stat_add(s, SHEAF_REFILL, filled);
> > +
> > +       return 0;
> > +}
> > +
> > +
> > +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
> > +{
> > +       struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
> > +
> > +       if (!sheaf)
> > +               return NULL;
> > +
> > +       if (refill_sheaf(s, sheaf, gfp)) {
> > +               free_empty_sheaf(s, sheaf);
> > +               return NULL;
> > +       }
> > +
> > +       return sheaf;
> > +}
> > +
> > +/*
> > + * Maximum number of objects freed during a single flush of main pcs sheaf.
> > + * Translates directly to an on-stack array size.
> > + */
> > +#define PCS_BATCH_MAX  32U
> > +
> .> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t
> size, void **p);
> > +
>
> A comment clarifying why you are freeing in PCS_BATCH_MAX batches here
> would be helpful. My understanding is that you do that to free objects
> outside of the cpu_sheaves->lock, so you isolate a batch, release the
> lock and then free the batch.
>
> > +static void sheaf_flush_main(struct kmem_cache *s)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       unsigned int batch, remaining;
> > +       void *objects[PCS_BATCH_MAX];
> > +       struct slab_sheaf *sheaf;
> > +       unsigned long flags;
> > +
> > +next_batch:
> > +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +       sheaf = pcs->main;
> > +
> > +       batch = min(PCS_BATCH_MAX, sheaf->size);
> > +
> > +       sheaf->size -= batch;
> > +       memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
> > +
> > +       remaining = sheaf->size;
> > +
> > +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +       __kmem_cache_free_bulk(s, batch, &objects[0]);
> > +
> > +       stat_add(s, SHEAF_FLUSH_MAIN, batch);
> > +
> > +       if (remaining)
> > +               goto next_batch;
> > +}
> > +
>
> This function seems to be used against either isolated sheaves or in
> slub_cpu_dead() --> __pcs_flush_all_cpu() path where we hold
> slab_mutex and I think that guarantees that the sheaf is unused. Maybe
> a short comment clarifying this requirement or rename the function to
> reflect that? Something like flush_unused_sheaf()?
>
> > +static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
> > +{
> > +       if (!sheaf->size)
> > +               return;
> > +
> > +       stat_add(s, SHEAF_FLUSH_OTHER, sheaf->size);
> > +
> > +       __kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
> > +
> > +       sheaf->size = 0;
> > +}
> > +
> > +/*
> > + * Caller needs to make sure migration is disabled in order to fully flush
> > + * single cpu's sheaves
> > + *
> > + * flushing operations are rare so let's keep it simple and flush to slabs
> > + * directly, skipping the barn
> > + */
> > +static void pcs_flush_all(struct kmem_cache *s)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       struct slab_sheaf *spare, *rcu_free;
> > +       unsigned long flags;
> > +
> > +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +       spare = pcs->spare;
> > +       pcs->spare = NULL;
> > +
> > +       rcu_free = pcs->rcu_free;
> > +       pcs->rcu_free = NULL;
> > +
> > +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +       if (spare) {
> > +               sheaf_flush(s, spare);
> > +               free_empty_sheaf(s, spare);
> > +       }
> > +
> > +       // TODO: handle rcu_free
> > +       BUG_ON(rcu_free);
> > +
> > +       sheaf_flush_main(s);
> > +}
> > +
> > +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +
> > +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> > +
> > +       if (pcs->spare) {
> > +               sheaf_flush(s, pcs->spare);
> > +               free_empty_sheaf(s, pcs->spare);
> > +               pcs->spare = NULL;
> > +       }
> > +
> > +       // TODO: handle rcu_free
> > +       BUG_ON(pcs->rcu_free);
> > +
> > +       sheaf_flush_main(s);
>
> Hmm. sheaf_flush_main() always flushes for this_cpu only, so IIUC this
> call will not necessarily flush the main sheaf for the cpu passed to
> __pcs_flush_all_cpu().
>
> > +}
> > +
> > +static void pcs_destroy(struct kmem_cache *s)
> > +{
> > +       int cpu;
> > +
> > +       for_each_possible_cpu(cpu) {
> > +               struct slub_percpu_sheaves *pcs;
> > +
> > +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> > +
> > +               /* can happen when unwinding failed create */
> > +               if (!pcs->main)
> > +                       continue;
> > +
> > +               WARN_ON(pcs->spare);
> > +               WARN_ON(pcs->rcu_free);
> > +
> > +               if (!WARN_ON(pcs->main->size)) {
> > +                       free_empty_sheaf(s, pcs->main);
> > +                       pcs->main = NULL;
> > +               }
> > +       }
> > +
> > +       free_percpu(s->cpu_sheaves);
> > +       s->cpu_sheaves = NULL;
> > +}
> > +
> > +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
> > +{
> > +       struct slab_sheaf *empty = NULL;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&barn->lock, flags);
> > +
> > +       if (barn->nr_empty) {
> > +               empty = list_first_entry(&barn->sheaves_empty,
> > +                                        struct slab_sheaf, barn_list);
> > +               list_del(&empty->barn_list);
> > +               barn->nr_empty--;
> > +       }
> > +
> > +       spin_unlock_irqrestore(&barn->lock, flags);
> > +
> > +       return empty;
> > +}
> > +
> > +static int barn_put_empty_sheaf(struct node_barn *barn,
> > +                               struct slab_sheaf *sheaf, bool ignore_limit)
> > +{
> > +       unsigned long flags;
> > +       int ret = 0;
> > +
> > +       spin_lock_irqsave(&barn->lock, flags);
> > +
> > +       if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
> > +               ret = -E2BIG;
> > +       } else {
> > +               list_add(&sheaf->barn_list, &barn->sheaves_empty);
> > +               barn->nr_empty++;
> > +       }
> > +
> > +       spin_unlock_irqrestore(&barn->lock, flags);
> > +       return ret;
> > +}
> > +
> > +static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
> > +                              bool ignore_limit)
> > +{
> > +       unsigned long flags;
> > +       int ret = 0;
> > +
> > +       spin_lock_irqsave(&barn->lock, flags);
> > +
> > +       if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
> > +               ret = -E2BIG;
> > +       } else {
> > +               list_add(&sheaf->barn_list, &barn->sheaves_full);
> > +               barn->nr_full++;
> > +       }
> > +
> > +       spin_unlock_irqrestore(&barn->lock, flags);
> > +       return ret;
> > +}
> > +
> > +/*
> > + * If a full sheaf is available, return it and put the supplied empty one to
> > + * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
> > + * change.
> > + */
> > +static struct slab_sheaf *
> > +barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
> > +{
> > +       struct slab_sheaf *full = NULL;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&barn->lock, flags);
> > +
> > +       if (barn->nr_full) {
> > +               full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> > +                                       barn_list);
> > +               list_del(&full->barn_list);
> > +               list_add(&empty->barn_list, &barn->sheaves_empty);
> > +               barn->nr_full--;
> > +               barn->nr_empty++;
> > +       }
> > +
> > +       spin_unlock_irqrestore(&barn->lock, flags);
> > +
> > +       return full;
> > +}
> > +/*
> > + * If a empty sheaf is available, return it and put the supplied full one to
> > + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> > + */
> > +static struct slab_sheaf *
> > +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> > +{
> > +       struct slab_sheaf *empty;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&barn->lock, flags);
> > +
> > +       if (barn->nr_full >= MAX_FULL_SHEAVES) {
> > +               empty = ERR_PTR(-E2BIG);
> > +       } else if (!barn->nr_empty) {
> > +               empty = ERR_PTR(-ENOMEM);
> > +       } else {
> > +               empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
> > +                                        barn_list);
> > +               list_del(&empty->barn_list);
> > +               list_add(&full->barn_list, &barn->sheaves_full);
> > +               barn->nr_empty--;
> > +               barn->nr_full++;
> > +       }
> > +
> > +       spin_unlock_irqrestore(&barn->lock, flags);
> > +
> > +       return empty;
> > +}
> > +
> > +static void barn_init(struct node_barn *barn)
> > +{
> > +       spin_lock_init(&barn->lock);
> > +       INIT_LIST_HEAD(&barn->sheaves_full);
> > +       INIT_LIST_HEAD(&barn->sheaves_empty);
> > +       barn->nr_full = 0;
> > +       barn->nr_empty = 0;
> > +}
> > +
> > +static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
> > +{
> > +       struct list_head empty_list;
> > +       struct list_head full_list;
> > +       struct slab_sheaf *sheaf, *sheaf2;
> > +       unsigned long flags;
> > +
> > +       INIT_LIST_HEAD(&empty_list);
> > +       INIT_LIST_HEAD(&full_list);
> > +
> > +       spin_lock_irqsave(&barn->lock, flags);
> > +
> > +       list_splice_init(&barn->sheaves_full, &full_list);
> > +       barn->nr_full = 0;
> > +       list_splice_init(&barn->sheaves_empty, &empty_list);
> > +       barn->nr_empty = 0;
> > +
> > +       spin_unlock_irqrestore(&barn->lock, flags);
> > +
> > +       list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
> > +               sheaf_flush(s, sheaf);
> > +               list_move(&sheaf->barn_list, &empty_list);
> > +       }
> > +
> > +       list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
> > +               free_empty_sheaf(s, sheaf);
> > +}
> > +
> >  /*
> >   * Slab allocation and freeing
> >   */
> > @@ -3280,11 +3668,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
> >         put_partials_cpu(s, c);
> >  }
> >
> > -struct slub_flush_work {
> > -       struct work_struct work;
> > -       struct kmem_cache *s;
> > -       bool skip;
> > -};
> > +static inline void flush_this_cpu_slab(struct kmem_cache *s)
> > +{
> > +       struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> > +
> > +       if (c->slab)
> > +               flush_slab(s, c);
> > +
> > +       put_partials(s);
> > +}
> > +
> > +static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> > +{
> > +       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> > +
> > +       return c->slab || slub_percpu_partial(c);
> > +}
> > +
> > +#else /* CONFIG_SLUB_TINY */
> > +static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> > +static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
> > +static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
> > +#endif /* CONFIG_SLUB_TINY */
> > +
> > +static bool has_pcs_used(int cpu, struct kmem_cache *s)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +
> > +       if (!s->cpu_sheaves)
> > +               return false;
> > +
> > +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> > +
> > +       return (pcs->spare || pcs->rcu_free || pcs->main->size);
> > +}
> > +
> > +static void pcs_flush_all(struct kmem_cache *s);
> >
> >  /*
> >   * Flush cpu slab.
> > @@ -3294,30 +3713,18 @@ struct slub_flush_work {
> >  static void flush_cpu_slab(struct work_struct *w)
> >  {
> >         struct kmem_cache *s;
> > -       struct kmem_cache_cpu *c;
> >         struct slub_flush_work *sfw;
> >
> >         sfw = container_of(w, struct slub_flush_work, work);
> >
> >         s = sfw->s;
> > -       c = this_cpu_ptr(s->cpu_slab);
> >
> > -       if (c->slab)
> > -               flush_slab(s, c);
> > -
> > -       put_partials(s);
> > -}
> > -
> > -static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> > -{
> > -       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> > +       if (s->cpu_sheaves)
> > +               pcs_flush_all(s);
> >
> > -       return c->slab || slub_percpu_partial(c);
> > +       flush_this_cpu_slab(s);
> >  }
> >
> > -static DEFINE_MUTEX(flush_lock);
> > -static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> > -
> >  static void flush_all_cpus_locked(struct kmem_cache *s)
> >  {
> >         struct slub_flush_work *sfw;
> > @@ -3328,7 +3735,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
> >
> >         for_each_online_cpu(cpu) {
> >                 sfw = &per_cpu(slub_flush, cpu);
> > -               if (!has_cpu_slab(cpu, s)) {
> > +               if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
> >                         sfw->skip = true;
> >                         continue;
> >                 }
> > @@ -3364,19 +3771,14 @@ static int slub_cpu_dead(unsigned int cpu)
> >         struct kmem_cache *s;
> >
> >         mutex_lock(&slab_mutex);
> > -       list_for_each_entry(s, &slab_caches, list)
> > +       list_for_each_entry(s, &slab_caches, list) {
> >                 __flush_cpu_slab(s, cpu);
> > +               __pcs_flush_all_cpu(s, cpu);
> > +       }
> >         mutex_unlock(&slab_mutex);
> >         return 0;
> >  }
> >
> > -#else /* CONFIG_SLUB_TINY */
> > -static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
> > -static inline void flush_all(struct kmem_cache *s) { }
> > -static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> > -static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
> > -#endif /* CONFIG_SLUB_TINY */
> > -
> >  /*
> >   * Check if the objects in a per cpu structure fit numa
> >   * locality expectations.
> > @@ -4126,6 +4528,173 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> >         return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
> >  }
> >
> > +static __fastpath_inline
> > +void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       unsigned long flags;
> > +       void *object;
> > +
> > +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +       if (unlikely(pcs->main->size == 0)) {
> > +
> > +               struct slab_sheaf *empty = NULL;
> > +               struct slab_sheaf *full;
> > +               bool can_alloc;
> > +
> > +               if (pcs->spare && pcs->spare->size > 0) {
> > +                       stat(s, SHEAF_SWAP);
> > +                       swap(pcs->main, pcs->spare);
> > +                       goto do_alloc;
> > +               }
> > +
> > +               full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> > +
> > +               if (full) {
> > +                       pcs->main = full;
> > +                       goto do_alloc;
> > +               }
> > +
> > +               can_alloc = gfpflags_allow_blocking(gfp);
> > +
> > +               if (can_alloc) {
> > +                       if (pcs->spare) {
> > +                               empty = pcs->spare;
> > +                               pcs->spare = NULL;
> > +                       } else {
> > +                               empty = barn_get_empty_sheaf(pcs->barn);
> > +                       }
> > +               }
> > +
> > +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +               if (!can_alloc)
> > +                       return NULL;
> > +
> > +               if (empty) {
> > +                       if (!refill_sheaf(s, empty, gfp)) {
> > +                               full = empty;
> > +                       } else {
> > +                               /*
> > +                                * we must be very low on memory so don't bother
> > +                                * with the barn
> > +                                */
> > +                               free_empty_sheaf(s, empty);
> > +                       }
> > +               } else {
> > +                       full = alloc_full_sheaf(s, gfp);
> > +               }
> > +
> > +               if (!full)
> > +                       return NULL;
> > +
> > +               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +               pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +               /*
> > +                * If we are returning empty sheaf, we either got it from the
> > +                * barn or had to allocate one. If we are returning a full
> > +                * sheaf, it's due to racing or being migrated to a different
> > +                * cpu. Breaching the barn's sheaf limits should be thus rare
> > +                * enough so just ignore them to simplify the recovery.
> > +                */
> > +
> > +               if (pcs->main->size == 0) {
> > +                       barn_put_empty_sheaf(pcs->barn, pcs->main, true);
> > +                       pcs->main = full;
> > +                       goto do_alloc;
> > +               }
> > +
> > +               if (!pcs->spare) {
> > +                       pcs->spare = full;
> > +                       goto do_alloc;
> > +               }
> > +
> > +               if (pcs->spare->size == 0) {
> > +                       barn_put_empty_sheaf(pcs->barn, pcs->spare, true);
> > +                       pcs->spare = full;
> > +                       goto do_alloc;
> > +               }
> > +
> > +               barn_put_full_sheaf(pcs->barn, full, true);
> > +       }
> > +
> > +do_alloc:
> > +       object = pcs->main->objects[--pcs->main->size];
> > +
> > +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +       stat(s, ALLOC_PCS);
> > +
> > +       return object;
> > +}
> > +
> > +static __fastpath_inline
> > +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       struct slab_sheaf *main;
> > +       unsigned long flags;
> > +       unsigned int allocated = 0;
> > +       unsigned int batch;
> > +
> > +next_batch:
> > +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +       if (unlikely(pcs->main->size == 0)) {
> > +
> > +               struct slab_sheaf *full;
> > +
> > +               if (pcs->spare && pcs->spare->size > 0) {
> > +                       stat(s, SHEAF_SWAP);
> > +                       swap(pcs->main, pcs->spare);
> > +                       goto do_alloc;
> > +               }
> > +
> > +               full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> > +
> > +               if (full) {
> > +                       pcs->main = full;
> > +                       goto do_alloc;
> > +               }
> > +
> > +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +               /*
> > +                * Once full sheaves in barn are depleted, let the bulk
> > +                * allocation continue from slab pages, otherwise we would just
> > +                * be copying arrays of pointers twice.
> > +                */
> > +               return allocated;
> > +       }
> > +
> > +do_alloc:
> > +
> > +       main = pcs->main;
> > +       batch = min(size, main->size);
> > +
> > +       main->size -= batch;
> > +       memcpy(p, main->objects + main->size, batch * sizeof(void *));
> > +
> > +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +       stat_add(s, ALLOC_PCS, batch);
> > +
> > +       allocated += batch;
> > +
> > +       if (batch < size) {
> > +               p += batch;
> > +               size -= batch;
> > +               goto next_batch;
> > +       }
> > +
> > +       return allocated;
> > +}
> > +
> > +
> >  /*
> >   * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
> >   * have the fastpath folded into their functions. So no function call
> > @@ -4150,7 +4719,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
> >         if (unlikely(object))
> >                 goto out;
> >
> > -       object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> > +       if (s->cpu_sheaves && (node == NUMA_NO_NODE))
> > +               object = alloc_from_pcs(s, gfpflags);
> > +
> > +       if (!object)
> > +               object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> >
> >         maybe_wipe_obj_freeptr(s, object);
> >         init = slab_want_init_on_alloc(gfpflags, s);
> > @@ -4521,6 +5094,196 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> >         discard_slab(s, slab);
> >  }
> >
> > +/*
> > + * Free an object to the percpu sheaves.
> > + * The object is expected to have passed slab_free_hook() already.
> > + */
> > +static __fastpath_inline
> > +void free_to_pcs(struct kmem_cache *s, void *object)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       unsigned long flags;
> > +
> > +restart:
> > +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> > +
> > +               struct slab_sheaf *empty;
> > +
> > +               if (!pcs->spare) {
> > +                       empty = barn_get_empty_sheaf(pcs->barn);
> > +                       if (empty) {
> > +                               pcs->spare = pcs->main;
> > +                               pcs->main = empty;
> > +                               goto do_free;
> > +                       }
> > +                       goto alloc_empty;
> > +               }
> > +
> > +               if (pcs->spare->size < s->sheaf_capacity) {
> > +                       stat(s, SHEAF_SWAP);
> > +                       swap(pcs->main, pcs->spare);
> > +                       goto do_free;
> > +               }
> > +
> > +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> > +
> > +               if (!IS_ERR(empty)) {
> > +                       pcs->main = empty;
> > +                       goto do_free;
> > +               }
> > +
> > +               if (PTR_ERR(empty) == -E2BIG) {
> > +                       /* Since we got here, spare exists and is full */
> > +                       struct slab_sheaf *to_flush = pcs->spare;
> > +
> > +                       pcs->spare = NULL;
> > +                       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +                       sheaf_flush(s, to_flush);
> > +                       empty = to_flush;
> > +                       goto got_empty;
> > +               }
> > +
> > +alloc_empty:
> > +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> > +
> > +               if (!empty) {
> > +                       sheaf_flush_main(s);
> > +                       goto restart;
> > +               }
> > +
> > +got_empty:
> > +               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +               pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +               /*
> > +                * if we put any sheaf to barn here, it's because we raced or
> > +                * have been migrated to a different cpu, which should be rare
> > +                * enough so just ignore the barn's limits to simplify
> > +                */
> > +               if (unlikely(pcs->main->size < s->sheaf_capacity)) {
> > +                       if (!pcs->spare)
> > +                               pcs->spare = empty;
> > +                       else
> > +                               barn_put_empty_sheaf(pcs->barn, empty, true);
> > +                       goto do_free;
> > +               }
> > +
> > +               if (!pcs->spare) {
> > +                       pcs->spare = pcs->main;
> > +                       pcs->main = empty;
> > +                       goto do_free;
> > +               }
> > +
> > +               barn_put_full_sheaf(pcs->barn, pcs->main, true);
> > +               pcs->main = empty;
>
> I find the program flow in this function quite complex and hard to
> follow. I think refactoring the above block starting from "pcs =
> this_cpu_ptr(s->cpu_sheaves)" would somewhat simplify it. That
> eliminates the need for the "got_empty" label and makes the
> locking/unlocking sequence of s->cpu_sheaves->lock a bit more clear.
>
> > +       }
> > +
> > +do_free:
> > +       pcs->main->objects[pcs->main->size++] = object;
> > +
> > +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +       stat(s, FREE_PCS);
> > +}
> > +
> > +/*
> > + * Bulk free objects to the percpu sheaves.
> > + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> > + * and the fallback to freeing to slab pages.
> > + */
> > +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       struct slab_sheaf *main;
> > +       unsigned long flags;
> > +       unsigned int batch, i = 0;
> > +       bool init;
> > +
> > +       init = slab_want_init_on_free(s);
> > +
> > +       while (i < size) {
> > +               struct slab *slab = virt_to_slab(p[i]);
> > +
> > +               memcg_slab_free_hook(s, slab, p + i, 1);
> > +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
> > +
> > +               if (unlikely(!slab_free_hook(s, p[i], init, false))) {
> > +                       p[i] = p[--size];
> > +                       if (!size)
> > +                               return;
> > +                       continue;
> > +               }
> > +
> > +               i++;
> > +       }
> > +
> > +next_batch:
> > +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> > +
> > +               struct slab_sheaf *empty;
> > +
> > +               if (!pcs->spare) {
> > +                       empty = barn_get_empty_sheaf(pcs->barn);
> > +                       if (empty) {
> > +                               pcs->spare = pcs->main;
> > +                               pcs->main = empty;
> > +                               goto do_free;
> > +                       }
> > +                       goto no_empty;
> > +               }
> > +
> > +               if (pcs->spare->size < s->sheaf_capacity) {
> > +                       stat(s, SHEAF_SWAP);
> > +                       swap(pcs->main, pcs->spare);
> > +                       goto do_free;
> > +               }
> > +
> > +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> > +
> > +               if (!IS_ERR(empty)) {
> > +                       pcs->main = empty;
> > +                       goto do_free;
> > +               }
> > +
> > +no_empty:
> > +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +               /*
> > +                * if we depleted all empty sheaves in the barn or there are too
> > +                * many full sheaves, free the rest to slab pages
> > +                */
> > +
> > +               __kmem_cache_free_bulk(s, size, p);
> > +               return;
> > +       }
> > +
> > +do_free:
> > +       main = pcs->main;
> > +       batch = min(size, s->sheaf_capacity - main->size);
> > +
> > +       memcpy(main->objects + main->size, p, batch * sizeof(void *));
> > +       main->size += batch;
> > +
> > +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> > +
> > +       stat_add(s, FREE_PCS, batch);
> > +
> > +       if (batch < size) {
> > +               p += batch;
> > +               size -= batch;
> > +               goto next_batch;
> > +       }
> > +}
> > +
> >  #ifndef CONFIG_SLUB_TINY
> >  /*
> >   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> > @@ -4607,7 +5370,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> >         memcg_slab_free_hook(s, slab, &object, 1);
> >         alloc_tagging_slab_free_hook(s, slab, &object, 1);
> >
> > -       if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > +       if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > +               return;
> > +
> > +       if (s->cpu_sheaves)
> > +               free_to_pcs(s, object);
> > +       else
> >                 do_slab_free(s, slab, object, object, 1, addr);
> >  }
> >
> > @@ -5033,6 +5801,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >         if (!size)
> >                 return;
> >
> > +       /*
> > +        * freeing to sheaves is so incompatible with the detached freelist so
> > +        * once we go that way, we have to do everything differently
> > +        */
> > +       if (s && s->cpu_sheaves) {
> > +               free_to_pcs_bulk(s, size, p);
> > +               return;
> > +       }
> > +
> >         do {
> >                 struct detached_freelist df;
> >
> > @@ -5151,7 +5928,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> >  int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
> >                                  void **p)
> >  {
> > -       int i;
> > +       unsigned int i = 0;
> >
> >         if (!size)
> >                 return 0;
> > @@ -5160,9 +5937,21 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
> >         if (unlikely(!s))
> >                 return 0;
> >
> > -       i = __kmem_cache_alloc_bulk(s, flags, size, p);
> > -       if (unlikely(i == 0))
> > -               return 0;
> > +       if (s->cpu_sheaves)
> > +               i = alloc_from_pcs_bulk(s, size, p);
> > +
> > +       if (i < size) {
> > +               unsigned int j = __kmem_cache_alloc_bulk(s, flags, size - i, p + i);
> > +               /*
> > +                * If we ran out of memory, don't bother with freeing back to
> > +                * the percpu sheaves, we have bigger problems.
> > +                */
> > +               if (unlikely(j == 0)) {
> > +                       if (i > 0)
> > +                               __kmem_cache_free_bulk(s, i, p);
> > +                       return 0;
> > +               }
> > +       }
> >
> >         /*
> >          * memcg and kmem_cache debug support and memory initialization.
> > @@ -5172,11 +5961,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
> >                     slab_want_init_on_alloc(flags, s), s->object_size))) {
> >                 return 0;
> >         }
> > -       return i;
> > +
> > +       return size;
> >  }
> >  EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
> >
> > -
> >  /*
> >   * Object placement in a slab is made very easy because we always start at
> >   * offset 0. If we tune the size of the object to the alignment then we can
> > @@ -5309,8 +6098,8 @@ static inline int calculate_order(unsigned int size)
> >         return -ENOSYS;
> >  }
> >
> > -static void
> > -init_kmem_cache_node(struct kmem_cache_node *n)
> > +static bool
> > +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
> >  {
> >         n->nr_partial = 0;
> >         spin_lock_init(&n->list_lock);
> > @@ -5320,6 +6109,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
> >         atomic_long_set(&n->total_objects, 0);
> >         INIT_LIST_HEAD(&n->full);
> >  #endif
> > +       n->barn = barn;
> > +       if (barn)
> > +               barn_init(barn);
> > +
> > +       return true;
> >  }
> >
> >  #ifndef CONFIG_SLUB_TINY
> > @@ -5350,6 +6144,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
> >  }
> >  #endif /* CONFIG_SLUB_TINY */
> >
> > +static int init_percpu_sheaves(struct kmem_cache *s)
> > +{
> > +       int cpu;
> > +
> > +       for_each_possible_cpu(cpu) {
> > +               struct slub_percpu_sheaves *pcs;
> > +               int nid;
> > +
> > +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> > +
> > +               local_lock_init(&pcs->lock);
> > +
> > +               nid = cpu_to_mem(cpu);
> > +
> > +               pcs->barn = get_node(s, nid)->barn;
> > +               pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
> > +
> > +               if (!pcs->main)
> > +                       return -ENOMEM;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> >  static struct kmem_cache *kmem_cache_node;
> >
> >  /*
> > @@ -5385,7 +6203,7 @@ static void early_kmem_cache_node_alloc(int node)
> >         slab->freelist = get_freepointer(kmem_cache_node, n);
> >         slab->inuse = 1;
> >         kmem_cache_node->node[node] = n;
> > -       init_kmem_cache_node(n);
> > +       init_kmem_cache_node(n, NULL);
> >         inc_slabs_node(kmem_cache_node, node, slab->objects);
> >
> >         /*
> > @@ -5401,6 +6219,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
> >         struct kmem_cache_node *n;
> >
> >         for_each_kmem_cache_node(s, node, n) {
> > +               if (n->barn) {
> > +                       WARN_ON(n->barn->nr_full);
> > +                       WARN_ON(n->barn->nr_empty);
> > +                       kfree(n->barn);
> > +                       n->barn = NULL;
> > +               }
> > +
> >                 s->node[node] = NULL;
> >                 kmem_cache_free(kmem_cache_node, n);
> >         }
> > @@ -5409,6 +6234,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
> >  void __kmem_cache_release(struct kmem_cache *s)
> >  {
> >         cache_random_seq_destroy(s);
> > +       if (s->cpu_sheaves)
> > +               pcs_destroy(s);
> >  #ifndef CONFIG_SLUB_TINY
> >         free_percpu(s->cpu_slab);
> >  #endif
> > @@ -5421,20 +6248,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> >
> >         for_each_node_mask(node, slab_nodes) {
> >                 struct kmem_cache_node *n;
> > +               struct node_barn *barn = NULL;
> >
> >                 if (slab_state == DOWN) {
> >                         early_kmem_cache_node_alloc(node);
> >                         continue;
> >                 }
> > +
> > +               if (s->cpu_sheaves) {
> > +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> > +
> > +                       if (!barn)
> > +                               return 0;
> > +               }
> > +
> >                 n = kmem_cache_alloc_node(kmem_cache_node,
> >                                                 GFP_KERNEL, node);
> > -
> > -               if (!n) {
> > -                       free_kmem_cache_nodes(s);
> > +               if (!n)
> >                         return 0;
> > -               }
> >
> > -               init_kmem_cache_node(n);
> > +               init_kmem_cache_node(n, barn);
> > +
> >                 s->node[node] = n;
> >         }
> >         return 1;
> > @@ -5690,6 +6524,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
> >         flush_all_cpus_locked(s);
> >         /* Attempt to free all objects */
> >         for_each_kmem_cache_node(s, node, n) {
> > +               if (n->barn)
> > +                       barn_shrink(s, n->barn);
> >                 free_partial(s, n);
> >                 if (n->nr_partial || node_nr_slabs(n))
> >                         return 1;
> > @@ -5893,6 +6729,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
> >                 for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
> >                         INIT_LIST_HEAD(promote + i);
> >
> > +               if (n->barn)
> > +                       barn_shrink(s, n->barn);
> > +
> >                 spin_lock_irqsave(&n->list_lock, flags);
> >
> >                 /*
> > @@ -6005,12 +6844,24 @@ static int slab_mem_going_online_callback(void *arg)
> >          */
> >         mutex_lock(&slab_mutex);
> >         list_for_each_entry(s, &slab_caches, list) {
> > +               struct node_barn *barn = NULL;
> > +
> >                 /*
> >                  * The structure may already exist if the node was previously
> >                  * onlined and offlined.
> >                  */
> >                 if (get_node(s, nid))
> >                         continue;
> > +
> > +               if (s->cpu_sheaves) {
> > +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> > +
> > +                       if (!barn) {
> > +                               ret = -ENOMEM;
> > +                               goto out;
> > +                       }
> > +               }
> > +
> >                 /*
> >                  * XXX: kmem_cache_alloc_node will fallback to other nodes
> >                  *      since memory is not yet available from the node that
> > @@ -6021,7 +6872,9 @@ static int slab_mem_going_online_callback(void *arg)
> >                         ret = -ENOMEM;
> >                         goto out;
> >                 }
> > -               init_kmem_cache_node(n);
> > +
> > +               init_kmem_cache_node(n, barn);
> > +
> >                 s->node[nid] = n;
> >         }
> >         /*
> > @@ -6240,6 +7093,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
> >
> >         set_cpu_partial(s);
> >
> > +       if (args->sheaf_capacity) {
> > +               s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> > +               if (!s->cpu_sheaves) {
> > +                       err = -ENOMEM;
> > +                       goto out;
> > +               }
> > +               // TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> > +               s->sheaf_capacity = args->sheaf_capacity;
> > +       }
> > +
> >  #ifdef CONFIG_NUMA
> >         s->remote_node_defrag_ratio = 1000;
> >  #endif
> > @@ -6256,6 +7119,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
> >         if (!alloc_kmem_cache_cpus(s))
> >                 goto out;
> >
> > +       if (s->cpu_sheaves) {
> > +               err = init_percpu_sheaves(s);
> > +               if (err)
> > +                       goto out;
> > +       }
> > +
> >         err = 0;
> >
> >         /* Mutex is not taken during early boot */
> > @@ -6277,7 +7146,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
> >                 __kmem_cache_release(s);
> >         return err;
> >  }
> > -
> >  #ifdef SLAB_SUPPORTS_SYSFS
> >  static int count_inuse(struct slab *slab)
> >  {
> > @@ -7055,8 +7923,10 @@ static ssize_t text##_store(struct kmem_cache *s,                \
> >  }                                                              \
> >  SLAB_ATTR(text);                                               \
> >
> > +STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
> >  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
> >  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> > +STAT_ATTR(FREE_PCS, free_cpu_sheaf);
> >  STAT_ATTR(FREE_FASTPATH, free_fastpath);
> >  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
> >  STAT_ATTR(FREE_FROZEN, free_frozen);
> > @@ -7081,6 +7951,12 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
> >  STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
> >  STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
> >  STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
> > +STAT_ATTR(SHEAF_FLUSH_MAIN, sheaf_flush_main);
> > +STAT_ATTR(SHEAF_FLUSH_OTHER, sheaf_flush_other);
> > +STAT_ATTR(SHEAF_REFILL, sheaf_refill);
> > +STAT_ATTR(SHEAF_SWAP, sheaf_swap);
> > +STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
> > +STAT_ATTR(SHEAF_FREE, sheaf_free);
> >  #endif /* CONFIG_SLUB_STATS */
> >
> >  #ifdef CONFIG_KFENCE
> > @@ -7142,8 +8018,10 @@ static struct attribute *slab_attrs[] = {
> >         &remote_node_defrag_ratio_attr.attr,
> >  #endif
> >  #ifdef CONFIG_SLUB_STATS
> > +       &alloc_cpu_sheaf_attr.attr,
> >         &alloc_fastpath_attr.attr,
> >         &alloc_slowpath_attr.attr,
> > +       &free_cpu_sheaf_attr.attr,
> >         &free_fastpath_attr.attr,
> >         &free_slowpath_attr.attr,
> >         &free_frozen_attr.attr,
> > @@ -7168,6 +8046,12 @@ static struct attribute *slab_attrs[] = {
> >         &cpu_partial_free_attr.attr,
> >         &cpu_partial_node_attr.attr,
> >         &cpu_partial_drain_attr.attr,
> > +       &sheaf_flush_main_attr.attr,
> > +       &sheaf_flush_other_attr.attr,
> > +       &sheaf_refill_attr.attr,
> > +       &sheaf_swap_attr.attr,
> > +       &sheaf_alloc_attr.attr,
> > +       &sheaf_free_attr.attr,
> >  #endif
> >  #ifdef CONFIG_FAILSLAB
> >         &failslab_attr.attr,
> >
> > --
> > 2.48.1
> >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations
  2025-02-14 16:27 ` [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-02-22 23:08   ` Suren Baghdasaryan
  2025-03-12 16:19     ` Vlastimil Babka
  2025-02-24  8:40   ` Harry Yoo
  1 sibling, 1 reply; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-22 23:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
>
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put in in the barn, or flush to slab pages using bulk free,

s/in in/it in

> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
>
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() machinery.
>
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
>
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too?
>
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slab.h        |   2 +
>  mm/slab_common.c |  21 ++++++++
>  mm/slub.c        | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 170 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 8daaec53b6ecfc44171191d421adb12e5cba2c58..94e9959e1aefa350d3d74e3f5309fde7a5cf2ec8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -459,6 +459,8 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>         return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>  }
>
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
> +
>  /* Legal flag mask for kmem_cache_create(), for various configurations */
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>                          SLAB_CACHE_DMA32 | SLAB_PANIC | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index ceeefb287899a82f30ad79b403556001c1860311..c6853450ed74160cfcb497c09f92c1f9f7b12629 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1613,6 +1613,24 @@ static void kfree_rcu_work(struct work_struct *work)
>                 kvfree_rcu_list(head);
>  }
>
> +static bool kfree_rcu_sheaf(void *obj)
> +{
> +       struct kmem_cache *s;
> +       struct folio *folio;
> +       struct slab *slab;
> +
> +       folio = virt_to_folio(obj);
> +       if (unlikely(!folio_test_slab(folio)))
> +               return false;
> +
> +       slab = folio_slab(folio);
> +       s = slab->slab_cache;
> +       if (s->cpu_sheaves)
> +               return __kfree_rcu_sheaf(s, obj);
> +
> +       return false;
> +}
> +
>  static bool
>  need_offload_krc(struct kfree_rcu_cpu *krcp)
>  {
> @@ -1957,6 +1975,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>         if (!head)
>                 might_sleep();
>
> +       if (kfree_rcu_sheaf(ptr))
> +               return;
> +
>         // Queue the object but don't yet schedule the batch.
>         if (debug_rcu_head_queue(ptr)) {
>                 // Probable double kfree_rcu(), just leak.
> diff --git a/mm/slub.c b/mm/slub.c
> index c06734912972b799f537359f7fe6a750918ffe9e..40175747212fefb27137309b27571abe8d0966e2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -350,6 +350,8 @@ enum stat_item {
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
>         FREE_PCS,               /* Free to percpu sheaf */
> +       FREE_RCU_SHEAF,         /* Free to rcu_free sheaf */
> +       FREE_RCU_SHEAF_FAIL,    /* Failed to free to a rcu_free sheaf */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -2569,6 +2571,24 @@ static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
>         sheaf->size = 0;
>  }
>
> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> +                                    struct slab_sheaf *sheaf);
> +
> +static void rcu_free_sheaf_nobarn(struct rcu_head *head)
> +{
> +       struct slab_sheaf *sheaf;
> +       struct kmem_cache *s;
> +
> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
> +       s = sheaf->cache;

Ah, that's where you are using sheaf->cache. Maybe you should
introduce it in this patch?

> +
> +       __rcu_free_sheaf_prepare(s, sheaf);
> +
> +       sheaf_flush(s, sheaf);
> +
> +       free_empty_sheaf(s, sheaf);
> +}
> +
>  /*
>   * Caller needs to make sure migration is disabled in order to fully flush
>   * single cpu's sheaves
> @@ -2598,8 +2618,8 @@ static void pcs_flush_all(struct kmem_cache *s)
>                 free_empty_sheaf(s, spare);
>         }
>
> -       // TODO: handle rcu_free
> -       BUG_ON(rcu_free);
> +       if (rcu_free)
> +               call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
>
>         sheaf_flush_main(s);
>  }
> @@ -2616,8 +2636,10 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
>                 pcs->spare = NULL;
>         }
>
> -       // TODO: handle rcu_free
> -       BUG_ON(pcs->rcu_free);
> +       if (pcs->rcu_free) {
> +               call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> +               pcs->rcu_free = NULL;
> +       }
>
>         sheaf_flush_main(s);
>  }
> @@ -5192,6 +5214,118 @@ void free_to_pcs(struct kmem_cache *s, void *object)
>         stat(s, FREE_PCS);
>  }
>
> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> +                                    struct slab_sheaf *sheaf)
> +{
> +       bool init = slab_want_init_on_free(s);
> +       void **p = &sheaf->objects[0];
> +       unsigned int i = 0;
> +
> +       while (i < sheaf->size) {
> +               struct slab *slab = virt_to_slab(p[i]);
> +
> +               memcg_slab_free_hook(s, slab, p + i, 1);
> +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
> +
> +               if (unlikely(!slab_free_hook(s, p[i], init, false))) {
> +                       p[i] = p[--sheaf->size];
> +                       continue;
> +               }
> +
> +               i++;
> +       }
> +}
> +
> +static void rcu_free_sheaf(struct rcu_head *head)
> +{
> +       struct slab_sheaf *sheaf;
> +       struct node_barn *barn;
> +       struct kmem_cache *s;
> +
> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
> +
> +       s = sheaf->cache;
> +
> +       __rcu_free_sheaf_prepare(s, sheaf);
> +
> +       barn = get_node(s, numa_mem_id())->barn;
> +
> +       /* due to slab_free_hook() */
> +       if (unlikely(sheaf->size == 0))
> +               goto empty;
> +
> +       if (!barn_put_full_sheaf(barn, sheaf, false))
> +               return;
> +
> +       sheaf_flush(s, sheaf);
> +
> +empty:
> +       if (!barn_put_empty_sheaf(barn, sheaf, false))
> +               return;
> +
> +       free_empty_sheaf(s, sheaf);
> +}
> +
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *rcu_sheaf;
> +       unsigned long flags;
> +
> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(!pcs->rcu_free)) {
> +
> +               struct slab_sheaf *empty;
> +
> +               empty = barn_get_empty_sheaf(pcs->barn);
> +
> +               if (empty) {
> +                       pcs->rcu_free = empty;
> +                       goto do_free;
> +               }
> +
> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +
> +               if (!empty) {
> +                       stat(s, FREE_RCU_SHEAF_FAIL);
> +                       return false;
> +               }
> +
> +               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +               if (unlikely(pcs->rcu_free))
> +                       barn_put_empty_sheaf(pcs->barn, empty, true);
> +               else
> +                       pcs->rcu_free = empty;
> +       }
> +
> +do_free:
> +
> +       rcu_sheaf = pcs->rcu_free;
> +
> +       rcu_sheaf->objects[rcu_sheaf->size++] = obj;
> +
> +       if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +               stat(s, FREE_RCU_SHEAF);
> +               return true;
> +       }
> +
> +       pcs->rcu_free = NULL;
> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +
> +       call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
> +
> +       stat(s, FREE_RCU_SHEAF);
> +
> +       return true;
> +}
> +
>  /*
>   * Bulk free objects to the percpu sheaves.
>   * Unlike free_to_pcs() this includes the calls to all necessary hooks
> @@ -6522,6 +6656,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>         struct kmem_cache_node *n;
>
>         flush_all_cpus_locked(s);
> +
> +       /* we might have rcu sheaves in flight */
> +       if (s->cpu_sheaves)
> +               rcu_barrier();
> +
>         /* Attempt to free all objects */
>         for_each_kmem_cache_node(s, node, n) {
>                 if (n->barn)
> @@ -7927,6 +8066,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
>  STAT_ATTR(FREE_PCS, free_cpu_sheaf);
> +STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
> +STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_FROZEN, free_frozen);
> @@ -8022,6 +8163,8 @@ static struct attribute *slab_attrs[] = {
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
>         &free_cpu_sheaf_attr.attr,
> +       &free_rcu_sheaf_attr.attr,
> +       &free_rcu_sheaf_fail_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_frozen_attr.attr,
>
> --
> 2.48.1
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
                   ` (10 preceding siblings ...)
  2025-02-14 18:28 ` [PATCH RFC v2 00/10] SLUB percpu sheaves Christoph Lameter (Ampere)
@ 2025-02-23  0:19 ` Kent Overstreet
  2025-02-23  4:44   ` Suren Baghdasaryan
  11 siblings, 1 reply; 55+ messages in thread
From: Kent Overstreet @ 2025-02-23  0:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote:
> - Cheaper fast paths. For allocations, instead of local double cmpxchg,
>   after Patch 5 it's preempt_disable() and no atomic operations. Same for
>   freeing, which is normally a local double cmpxchg only for a short
>   term allocations (so the same slab is still active on the same cpu when
>   freeing the object) and a more costly locked double cmpxchg otherwise.
>   The downside is the lack of NUMA locality guarantees for the allocated
>   objects.

Is that really cheaper than a local non locked double cmpxchg?

Especially if you now have to use pushf/popf...

> - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
>   separate percpu sheaf and only submit the whole sheaf to call_rcu()
>   when full. After the grace period, the sheaf can be used for
>   allocations, which is more efficient than freeing and reallocating
>   individual slab objects (even with the batching done by kfree_rcu()
>   implementation itself). In case only some cpus are allowed to handle rcu
>   callbacks, the sheaf can still be made available to other cpus on the
>   same node via the shared barn. The maple_node cache uses kfree_rcu() and
>   thus can benefit from this.

Have you looked at fs/bcachefs/rcu_pending.c?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock
  2025-02-14 16:27 ` [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock Vlastimil Babka
@ 2025-02-23  2:33   ` Suren Baghdasaryan
  2025-02-24 13:08   ` Harry Yoo
  1 sibling, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-23  2:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Instead of local_lock_irqsave(), use localtry_trylock() when potential
> callers include irq context, and localtry_lock() otherwise (such as when
> we already know the gfp flags allow blocking).
>
> This should reduce the locking (due to irq disabling/enabling) overhead.
> Failing to use percpu sheaves in an irq due to preempting an already
> locked user of sheaves should be rare so it's a favorable tradeoff.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 122 ++++++++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 76 insertions(+), 46 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 40175747212fefb27137309b27571abe8d0966e2..3d7345e7e938d53950ed0d6abe8eb0e93cf8f5b1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -450,7 +450,7 @@ struct slab_sheaf {
>  };
>
>  struct slub_percpu_sheaves {
> -       local_lock_t lock;
> +       localtry_lock_t lock;
>         struct slab_sheaf *main; /* never NULL when unlocked */
>         struct slab_sheaf *spare; /* empty or full, may be NULL */
>         struct slab_sheaf *rcu_free;
> @@ -2529,16 +2529,19 @@ static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
>
>  static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
>
> -static void sheaf_flush_main(struct kmem_cache *s)
> +/* returns true if at least partially flushed */
> +static bool sheaf_flush_main(struct kmem_cache *s)
>  {
>         struct slub_percpu_sheaves *pcs;
>         unsigned int batch, remaining;
>         void *objects[PCS_BATCH_MAX];
>         struct slab_sheaf *sheaf;
> -       unsigned long flags;
> +       bool ret = false;
>
>  next_batch:
> -       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
> +               return ret;
> +
>         pcs = this_cpu_ptr(s->cpu_sheaves);
>         sheaf = pcs->main;
>
> @@ -2549,14 +2552,18 @@ static void sheaf_flush_main(struct kmem_cache *s)
>
>         remaining = sheaf->size;
>
> -       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +       localtry_unlock(&s->cpu_sheaves->lock);
>
>         __kmem_cache_free_bulk(s, batch, &objects[0]);
>
>         stat_add(s, SHEAF_FLUSH_MAIN, batch);
>
> +       ret = true;
> +
>         if (remaining)
>                 goto next_batch;
> +
> +       return ret;
>  }
>
>  static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
> @@ -2593,6 +2600,8 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
>   * Caller needs to make sure migration is disabled in order to fully flush
>   * single cpu's sheaves
>   *
> + * must not be called from an irq
> + *
>   * flushing operations are rare so let's keep it simple and flush to slabs
>   * directly, skipping the barn
>   */
> @@ -2600,9 +2609,8 @@ static void pcs_flush_all(struct kmem_cache *s)
>  {
>         struct slub_percpu_sheaves *pcs;
>         struct slab_sheaf *spare, *rcu_free;
> -       unsigned long flags;
>
> -       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       localtry_lock(&s->cpu_sheaves->lock);
>         pcs = this_cpu_ptr(s->cpu_sheaves);
>
>         spare = pcs->spare;
> @@ -2611,7 +2619,7 @@ static void pcs_flush_all(struct kmem_cache *s)
>         rcu_free = pcs->rcu_free;
>         pcs->rcu_free = NULL;
>
> -       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +       localtry_unlock(&s->cpu_sheaves->lock);
>
>         if (spare) {
>                 sheaf_flush(s, spare);
> @@ -4554,10 +4562,11 @@ static __fastpath_inline
>  void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>  {
>         struct slub_percpu_sheaves *pcs;
> -       unsigned long flags;
>         void *object;
>
> -       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
> +               return NULL;
> +
>         pcs = this_cpu_ptr(s->cpu_sheaves);
>
>         if (unlikely(pcs->main->size == 0)) {
> @@ -4590,7 +4599,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>                         }
>                 }
>
> -               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +               localtry_unlock(&s->cpu_sheaves->lock);
>
>                 if (!can_alloc)
>                         return NULL;
> @@ -4612,7 +4621,11 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>                 if (!full)
>                         return NULL;
>
> -               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +               /*
> +                * we can reach here only when gfpflags_allow_blocking
> +                * so this must not be an irq
> +                */
> +               localtry_lock(&s->cpu_sheaves->lock);
>                 pcs = this_cpu_ptr(s->cpu_sheaves);
>
>                 /*
> @@ -4646,7 +4659,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>  do_alloc:
>         object = pcs->main->objects[--pcs->main->size];
>
> -       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +       localtry_unlock(&s->cpu_sheaves->lock);
>
>         stat(s, ALLOC_PCS);
>
> @@ -4658,12 +4671,13 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
>         struct slub_percpu_sheaves *pcs;
>         struct slab_sheaf *main;
> -       unsigned long flags;
>         unsigned int allocated = 0;
>         unsigned int batch;
>
>  next_batch:
> -       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
> +               return allocated;
> +
>         pcs = this_cpu_ptr(s->cpu_sheaves);
>
>         if (unlikely(pcs->main->size == 0)) {
> @@ -4683,7 +4697,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>                         goto do_alloc;
>                 }
>
> -               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +               localtry_unlock(&s->cpu_sheaves->lock);
>
>                 /*
>                  * Once full sheaves in barn are depleted, let the bulk
> @@ -4701,7 +4715,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>         main->size -= batch;
>         memcpy(p, main->objects + main->size, batch * sizeof(void *));
>
> -       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +       localtry_unlock(&s->cpu_sheaves->lock);
>
>         stat_add(s, ALLOC_PCS, batch);
>
> @@ -5121,13 +5135,14 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>   * The object is expected to have passed slab_free_hook() already.
>   */
>  static __fastpath_inline
> -void free_to_pcs(struct kmem_cache *s, void *object)
> +bool free_to_pcs(struct kmem_cache *s, void *object)
>  {
>         struct slub_percpu_sheaves *pcs;
> -       unsigned long flags;
>
>  restart:
> -       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
> +               return false;
> +
>         pcs = this_cpu_ptr(s->cpu_sheaves);
>
>         if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> @@ -5162,7 +5177,7 @@ void free_to_pcs(struct kmem_cache *s, void *object)
>                         struct slab_sheaf *to_flush = pcs->spare;
>
>                         pcs->spare = NULL;
> -                       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +                       localtry_unlock(&s->cpu_sheaves->lock);
>
>                         sheaf_flush(s, to_flush);
>                         empty = to_flush;
> @@ -5170,17 +5185,27 @@ void free_to_pcs(struct kmem_cache *s, void *object)
>                 }
>
>  alloc_empty:
> -               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +               localtry_unlock(&s->cpu_sheaves->lock);
>
>                 empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>
>                 if (!empty) {
> -                       sheaf_flush_main(s);
> -                       goto restart;
> +                       if (sheaf_flush_main(s))
> +                               goto restart;
> +                       else
> +                               return false;
>                 }
>
>  got_empty:
> -               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +               if (!localtry_trylock(&s->cpu_sheaves->lock)) {
> +                       struct node_barn *barn;
> +
> +                       barn = get_node(s, numa_mem_id())->barn;
> +
> +                       barn_put_empty_sheaf(barn, empty, true);
> +                       return false;
> +               }
> +
>                 pcs = this_cpu_ptr(s->cpu_sheaves);
>
>                 /*
> @@ -5209,9 +5234,11 @@ void free_to_pcs(struct kmem_cache *s, void *object)
>  do_free:
>         pcs->main->objects[pcs->main->size++] = object;
>
> -       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +       localtry_unlock(&s->cpu_sheaves->lock);
>
>         stat(s, FREE_PCS);
> +
> +       return true;
>  }
>
>  static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> @@ -5270,9 +5297,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>  {
>         struct slub_percpu_sheaves *pcs;
>         struct slab_sheaf *rcu_sheaf;
> -       unsigned long flags;
>
> -       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
> +               goto fail;
> +
>         pcs = this_cpu_ptr(s->cpu_sheaves);
>
>         if (unlikely(!pcs->rcu_free)) {
> @@ -5286,16 +5314,16 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>                         goto do_free;
>                 }
>
> -               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +               localtry_unlock(&s->cpu_sheaves->lock);
>
>                 empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>
> -               if (!empty) {
> -                       stat(s, FREE_RCU_SHEAF_FAIL);
> -                       return false;
> -               }
> +               if (!empty)
> +                       goto fail;
> +
> +               if (!localtry_trylock(&s->cpu_sheaves->lock))
> +                       goto fail;
>
> -               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>                 pcs = this_cpu_ptr(s->cpu_sheaves);
>
>                 if (unlikely(pcs->rcu_free))
> @@ -5311,19 +5339,22 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>         rcu_sheaf->objects[rcu_sheaf->size++] = obj;
>
>         if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
> -               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +               localtry_unlock(&s->cpu_sheaves->lock);
>                 stat(s, FREE_RCU_SHEAF);
>                 return true;
>         }
>
>         pcs->rcu_free = NULL;
> -       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +       localtry_unlock(&s->cpu_sheaves->lock);
>
>         call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
>
>         stat(s, FREE_RCU_SHEAF);
> -
>         return true;
> +
> +fail:
> +       stat(s, FREE_RCU_SHEAF_FAIL);
> +       return false;
>  }
>
>  /*
> @@ -5335,7 +5366,6 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
>         struct slub_percpu_sheaves *pcs;
>         struct slab_sheaf *main;
> -       unsigned long flags;
>         unsigned int batch, i = 0;
>         bool init;
>
> @@ -5358,7 +5388,9 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>         }
>
>  next_batch:
> -       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +       if (!localtry_trylock(&s->cpu_sheaves->lock))
> +               goto fallback;
> +
>         pcs = this_cpu_ptr(s->cpu_sheaves);
>
>         if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> @@ -5389,13 +5421,13 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>                 }
>
>  no_empty:
> -               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +               localtry_unlock(&s->cpu_sheaves->lock);
>
>                 /*
>                  * if we depleted all empty sheaves in the barn or there are too
>                  * many full sheaves, free the rest to slab pages
>                  */
> -
> +fallback:
>                 __kmem_cache_free_bulk(s, size, p);
>                 return;
>         }
> @@ -5407,7 +5439,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>         memcpy(main->objects + main->size, p, batch * sizeof(void *));
>         main->size += batch;
>
> -       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +       localtry_unlock(&s->cpu_sheaves->lock);
>
>         stat_add(s, FREE_PCS, batch);
>
> @@ -5507,9 +5539,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>         if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>                 return;
>
> -       if (s->cpu_sheaves)
> -               free_to_pcs(s, object);
> -       else
> +       if (!s->cpu_sheaves || !free_to_pcs(s, object))
>                 do_slab_free(s, slab, object, object, 1, addr);
>  }
>
> @@ -6288,7 +6318,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
>
>                 pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>
> -               local_lock_init(&pcs->lock);
> +               localtry_lock_init(&pcs->lock);
>
>                 nid = cpu_to_mem(cpu);
>
>
> --
> 2.48.1
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations
  2025-02-14 16:27 ` [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
@ 2025-02-23  3:54   ` Suren Baghdasaryan
  2025-02-25  7:30     ` Harry Yoo
  2025-03-12 17:09     ` Vlastimil Babka
  2025-02-25  8:00   ` Harry Yoo
  1 sibling, 2 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-23  3:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Add functions for efficient guaranteed allocations e.g. in a critical
> section that cannot sleep, when the exact number of allocations is not
> known beforehand, but an upper limit can be calculated.
>
> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
> number of objects.
>
> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
> and is guaranteed not to fail until depleted.
>
> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
> allocator after the critical section. This will also attempt to refill
> it to cache's sheaf capacity for better efficiency of sheaves handling,
> but it's not stricly necessary to succeed.
>
> kmem_cache_refill_sheaf() can be used to refill a previously obtained
> sheaf to requested size. If the current size is sufficient, it does
> nothing. If the requested size exceeds cache's sheaf_capacity and the
> sheaf's current capacity, the sheaf will be replaced with a new one,
> hence the indirect pointer parameter.
>
> kmem_cache_sheaf_size() can be used to query the current size.
>
> The implementation supports requesting sizes that exceed cache's
> sheaf_capacity, but it is not efficient - such sheaves are allocated
> fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by
> kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be expecially

s/expecially/especially

> ineffective when replacing a sheaf with a new one of a larger capacity.
> It is therefore better to size cache's sheaf_capacity accordingly.

If support for sizes exceeding sheaf_capacity adds much complexity
with no performance benefits, I think it would be ok not to support
them at all. Users know the capacity of a particular kmem_cache, so
they can use this API only when their needs are within sheaf_capacity,
otherwise either size the sheaf appropriately or use slab bulk
allocation.

>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  include/linux/slab.h |  16 ++++
>  mm/slub.c            | 227 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 243 insertions(+)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 0e1b25228c77140d05b5b4433c9d7923de36ec05..dd01b67982e856b1b02f4f0e6fc557726e7f02a8 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -829,6 +829,22 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
>                                    int node) __assume_slab_alignment __malloc;
>  #define kmem_cache_alloc_node(...)     alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
>
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
> +
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf **sheafp, unsigned int size);
> +
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                                      struct slab_sheaf *sheaf);
> +
> +void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
> +                       struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
> +#define kmem_cache_alloc_from_sheaf(...)       \
> +                       alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
> +
> +unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);
> +
>  /*
>   * These macros allow declaring a kmem_buckets * parameter alongside size, which
>   * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
> diff --git a/mm/slub.c b/mm/slub.c
> index 3d7345e7e938d53950ed0d6abe8eb0e93cf8f5b1..c1df7cf22267f28f743404531bef921e25fac086 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -443,6 +443,8 @@ struct slab_sheaf {
>         union {
>                 struct rcu_head rcu_head;
>                 struct list_head barn_list;
> +               /* only used for prefilled sheafs */
> +               unsigned int capacity;
>         };
>         struct kmem_cache *cache;
>         unsigned int size;
> @@ -2735,6 +2737,30 @@ static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
>         return ret;
>  }
>
> +static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
> +{
> +       struct slab_sheaf *sheaf = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full) {
> +               sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> +                                       barn_list);
> +               list_del(&sheaf->barn_list);
> +               barn->nr_full--;
> +       } else if (barn->nr_empty) {
> +               sheaf = list_first_entry(&barn->sheaves_empty,
> +                                        struct slab_sheaf, barn_list);
> +               list_del(&sheaf->barn_list);
> +               barn->nr_empty--;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return sheaf;
> +}
> +
>  /*
>   * If a full sheaf is available, return it and put the supplied empty one to
>   * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
> @@ -4831,6 +4857,207 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
>
> +
> +/*
> + * returns a sheaf that has least the requested size
> + * when prefilling is needed, do so with given gfp flags
> + *
> + * return NULL if sheaf allocation or prefilling failed
> + */
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *sheaf = NULL;
> +
> +       if (unlikely(size > s->sheaf_capacity)) {
> +               sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
> +               if (!sheaf)
> +                       return NULL;
> +
> +               sheaf->cache = s;
> +               sheaf->capacity = size;

After reviewing the code I would advocate that we support only shaves
of s->sheaf_capacity, unless we have a real usecase requiring
sheaf->capacity != s->sheaf_capacity.

> +
> +               if (!__kmem_cache_alloc_bulk(s, gfp, size,
> +                                            &sheaf->objects[0])) {
> +                       kfree(sheaf);
> +                       return NULL;
> +               }
> +
> +               sheaf->size = size;
> +
> +               return sheaf;
> +       }
> +
> +       localtry_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (pcs->spare) {
> +               sheaf = pcs->spare;
> +               pcs->spare = NULL;
> +       }
> +
> +       if (!sheaf)
> +               sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
> +
> +       localtry_unlock(&s->cpu_sheaves->lock);
> +
> +       if (!sheaf) {
> +               sheaf = alloc_empty_sheaf(s, gfp);
> +       }
> +
> +       if (sheaf && sheaf->size < size) {
> +               if (refill_sheaf(s, sheaf, gfp)) {
> +                       sheaf_flush(s, sheaf);
> +                       free_empty_sheaf(s, sheaf);
> +                       sheaf = NULL;
> +               }
> +       }
> +
> +       if (sheaf)
> +               sheaf->capacity = s->sheaf_capacity;
> +
> +       return sheaf;
> +}
> +
> +/*
> + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
> + * It tries to refill the sheaf back to the cache's sheaf_capacity
> + * to avoid handling partially full sheaves.
> + *
> + * If the refill fails because gfp is e.g. GFP_NOWAIT, the sheaf is
> + * instead dissolved

Refilling the sheaf here assumes that in the future we are more likely
to allocate than to free objects or shrink the slab. If the reverse is
true then it would make sense to flush the sheaf and add it as an
empty one into the barn. The fact that flushing can't fail would be
another advantage... We don't know the future but should we be
predicting a more costly case?

> + */
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                            struct slab_sheaf *sheaf)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       bool refill = false;
> +       struct node_barn *barn;
> +
> +       if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
> +               sheaf_flush(s, sheaf);
> +               kfree(sheaf);
> +               return;
> +       }
> +
> +       localtry_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (!pcs->spare) {
> +               pcs->spare = sheaf;
> +               sheaf = NULL;
> +       } else if (pcs->barn->nr_full >= MAX_FULL_SHEAVES) {
> +               /* racy check */
> +               barn = pcs->barn;
> +               refill = true;
> +       }
> +
> +       localtry_unlock(&s->cpu_sheaves->lock);
> +
> +       if (!sheaf)
> +               return;
> +
> +       /*
> +        * if the barn is full of full sheaves or we fail to refill the sheaf,
> +        * simply flush and free it
> +        */
> +       if (!refill || refill_sheaf(s, sheaf, gfp)) {
> +               sheaf_flush(s, sheaf);
> +               free_empty_sheaf(s, sheaf);
> +               return;
> +       }
> +
> +       /* we racily determined the sheaf would fit, so now force it */
> +       barn_put_full_sheaf(barn, sheaf, true);
> +}
> +
> +/*
> + * refill a sheaf previously returned by kmem_cache_prefill_sheaf to at least
> + * the given size
> + *
> + * the sheaf might be replaced by a new one when requesting more than
> + * s->sheaf_capacity objects if such replacement is necessary, but the refill
> + * fails (with -ENOMEM), the existing sheaf is left intact
> + */
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                           struct slab_sheaf **sheafp, unsigned int size)
> +{
> +       struct slab_sheaf *sheaf;
> +
> +       /*
> +        * TODO: do we want to support *sheaf == NULL to be equivalent of
> +        * kmem_cache_prefill_sheaf() ?
> +        */
> +       if (!sheafp || !(*sheafp))
> +               return -EINVAL;
> +
> +       sheaf = *sheafp;
> +       if (sheaf->size >= size)
> +               return 0;
> +
> +       if (likely(sheaf->capacity >= size)) {
> +               if (likely(sheaf->capacity == s->sheaf_capacity))
> +                       return refill_sheaf(s, sheaf, gfp);
> +
> +               if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
> +                                            &sheaf->objects[sheaf->size])) {
> +                       return -ENOMEM;
> +               }
> +               sheaf->size = sheaf->capacity;
> +
> +               return 0;
> +       }
> +
> +       /*
> +        * We had a regular sized sheaf and need an oversize one, or we had an
> +        * oversize one already but need a larger one now.
> +        * This should be a very rare path so let's not complicate it.
> +        */
> +       sheaf = kmem_cache_prefill_sheaf(s, gfp, size);

WIth all the above I think you always end up refilling up to
sheaf->capacity. Not sure if we should mention that in the comment for
this function because your statement about refilling to at least the
given size is still correct.

> +       if (!sheaf)
> +               return -ENOMEM;
> +
> +       kmem_cache_return_sheaf(s, gfp, *sheafp);
> +       *sheafp = sheaf;
> +       return 0;
> +}
> +
> +/*
> + * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
> + *
> + * Guaranteed not to fail as many allocations as was the requested size.
> + * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
> + *
> + * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
> + * memcg charging is forced over limit if necessary, to avoid failure.
> + */
> +void *
> +kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
> +                                  struct slab_sheaf *sheaf)
> +{
> +       void *ret = NULL;
> +       bool init;
> +
> +       if (sheaf->size == 0)
> +               goto out;
> +
> +       ret = sheaf->objects[--sheaf->size];
> +
> +       init = slab_want_init_on_alloc(gfp, s);
> +
> +       /* add __GFP_NOFAIL to force successful memcg charging */
> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
> +out:
> +       trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
> +
> +       return ret;
> +}
> +
> +unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
> +{
> +       return sheaf->size;
> +}
>  /*
>   * To avoid unnecessary overhead, we pass through large allocation requests
>   * directly to the page allocator. We use __GFP_COMP, because we will need to
>
> --
> 2.48.1
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock
  2025-02-14 16:27 ` [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock Vlastimil Babka
@ 2025-02-23  4:00   ` Suren Baghdasaryan
  2025-02-25  8:54   ` Harry Yoo
  1 sibling, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-23  4:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> The possibility of many barn operations is determined by the current
> number of full or empty sheaves. Taking the barn->lock just to find out
> that e.g. there are no empty sheaves results in unnecessary overhead and
> lock contention. Thus perform these checks outside of the lock with a
> data_race() annotated variable read and fail quickly without taking the
> lock.
>
> Checks for sheaf availability that racily succeed have to be obviously
> repeated under the lock for correctness, but we can skip repeating
> checks if there are too many sheaves on the given list as the limits
> don't need to be strict.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 57 ++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 34 insertions(+), 23 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index c1df7cf22267f28f743404531bef921e25fac086..72e6437f1d74bfacbb1cd7642af42929c48cc66a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2685,9 +2685,12 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
>         struct slab_sheaf *empty = NULL;
>         unsigned long flags;
>
> +       if (!data_race(barn->nr_empty))
> +               return NULL;
> +
>         spin_lock_irqsave(&barn->lock, flags);
>
> -       if (barn->nr_empty) {
> +       if (likely(barn->nr_empty)) {
>                 empty = list_first_entry(&barn->sheaves_empty,
>                                          struct slab_sheaf, barn_list);
>                 list_del(&empty->barn_list);
> @@ -2703,38 +2706,36 @@ static int barn_put_empty_sheaf(struct node_barn *barn,
>                                 struct slab_sheaf *sheaf, bool ignore_limit)
>  {
>         unsigned long flags;
> -       int ret = 0;
> +
> +       /* we don't repeat the check under barn->lock as it's not critical */
> +       if (!ignore_limit && data_race(barn->nr_empty) >= MAX_EMPTY_SHEAVES)
> +               return -E2BIG;
>
>         spin_lock_irqsave(&barn->lock, flags);
>
> -       if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
> -               ret = -E2BIG;
> -       } else {
> -               list_add(&sheaf->barn_list, &barn->sheaves_empty);
> -               barn->nr_empty++;
> -       }
> +       list_add(&sheaf->barn_list, &barn->sheaves_empty);
> +       barn->nr_empty++;
>
>         spin_unlock_irqrestore(&barn->lock, flags);
> -       return ret;
> +       return 0;
>  }
>
>  static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
>                                bool ignore_limit)
>  {
>         unsigned long flags;
> -       int ret = 0;
> +
> +       /* we don't repeat the check under barn->lock as it's not critical */
> +       if (!ignore_limit && data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
> +               return -E2BIG;
>
>         spin_lock_irqsave(&barn->lock, flags);
>
> -       if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
> -               ret = -E2BIG;
> -       } else {
> -               list_add(&sheaf->barn_list, &barn->sheaves_full);
> -               barn->nr_full++;
> -       }
> +       list_add(&sheaf->barn_list, &barn->sheaves_full);
> +       barn->nr_full++;
>
>         spin_unlock_irqrestore(&barn->lock, flags);
> -       return ret;
> +       return 0;
>  }
>
>  static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
> @@ -2742,6 +2743,9 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
>         struct slab_sheaf *sheaf = NULL;
>         unsigned long flags;
>
> +       if (!data_race(barn->nr_full) && !data_race(barn->nr_empty))
> +               return NULL;
> +
>         spin_lock_irqsave(&barn->lock, flags);
>
>         if (barn->nr_full) {
> @@ -2772,9 +2776,12 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
>         struct slab_sheaf *full = NULL;
>         unsigned long flags;
>
> +       if (!data_race(barn->nr_full))
> +               return NULL;
> +
>         spin_lock_irqsave(&barn->lock, flags);
>
> -       if (barn->nr_full) {
> +       if (likely(barn->nr_full)) {
>                 full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
>                                         barn_list);
>                 list_del(&full->barn_list);
> @@ -2797,19 +2804,23 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
>         struct slab_sheaf *empty;
>         unsigned long flags;
>
> +       /* we don't repeat this check under barn->lock as it's not critical */
> +       if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
> +               return ERR_PTR(-E2BIG);
> +       if (!data_race(barn->nr_empty))
> +               return ERR_PTR(-ENOMEM);
> +
>         spin_lock_irqsave(&barn->lock, flags);
>
> -       if (barn->nr_full >= MAX_FULL_SHEAVES) {
> -               empty = ERR_PTR(-E2BIG);
> -       } else if (!barn->nr_empty) {
> -               empty = ERR_PTR(-ENOMEM);
> -       } else {
> +       if (likely(barn->nr_empty)) {
>                 empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
>                                          barn_list);
>                 list_del(&empty->barn_list);
>                 list_add(&full->barn_list, &barn->sheaves_full);
>                 barn->nr_empty--;
>                 barn->nr_full++;
> +       } else {
> +               empty = ERR_PTR(-ENOMEM);
>         }
>
>         spin_unlock_irqrestore(&barn->lock, flags);
>
> --
> 2.48.1
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 08/10] tools: Add testing support for changes to rcu and slab for sheaves
  2025-02-14 16:27 ` [PATCH RFC v2 08/10] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
@ 2025-02-23  4:24   ` Suren Baghdasaryan
  0 siblings, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-23  4:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Make testing work for the slab and rcu changes that have come in with
> the sheaves work.
>
> This only works with one kmem_cache, and only the first one used.
> Subsequent setting of keme_cache will not update the active kmem_cache

s/keme_cache/kmem_cache


> and will be silently dropped because there are other tests which happen
> after the kmem_cache of interest is set.
>
> The saved active kmem_cache is used in the rcu callback, which passes
> the object to be freed.
>
> The rcu call takes the rcu_head, which is passed in as the field in the
> struct (in this case rcu in the maple tree node), which is calculated by
> pointer math.  The offset of which is saved (in a global variable) for
> restoring the node pointer on the callback after the rcu grace period
> expires.
>
> Don't use any of this outside of testing, please.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  tools/include/linux/slab.h            | 41 ++++++++++++++++++++++++++++++++---
>  tools/testing/shared/linux.c          | 24 ++++++++++++++++----
>  tools/testing/shared/linux/rcupdate.h | 22 +++++++++++++++++++
>  3 files changed, 80 insertions(+), 7 deletions(-)
>
> diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
> index 51b25e9c4ec7b66bdf4c68cc1353c6faf1ca7bb8..a475364cfd9fcdb10db252aab18ea3a620326b6b 100644
> --- a/tools/include/linux/slab.h
> +++ b/tools/include/linux/slab.h
> @@ -22,6 +22,12 @@ enum slab_state {
>         FULL
>  };
>
> +struct kmem_cache_args {
> +       unsigned int align;
> +       unsigned int sheaf_capacity;
> +       void (*ctor)(void *);
> +};
> +
>  static inline void *kzalloc(size_t size, gfp_t gfp)
>  {
>         return kmalloc(size, gfp | __GFP_ZERO);
> @@ -36,9 +42,38 @@ static inline void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
>  }
>  void kmem_cache_free(struct kmem_cache *cachep, void *objp);
>
> -struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
> -                       unsigned int align, unsigned int flags,
> -                       void (*ctor)(void *));
> +
> +struct kmem_cache *
> +__kmem_cache_create_args(const char *name, unsigned int size,
> +               struct kmem_cache_args *args, unsigned int flags);
> +
> +/* If NULL is passed for @args, use this variant with default arguments. */
> +static inline struct kmem_cache *
> +__kmem_cache_default_args(const char *name, unsigned int size,
> +               struct kmem_cache_args *args, unsigned int flags)
> +{
> +       struct kmem_cache_args kmem_default_args = {};
> +
> +       return __kmem_cache_create_args(name, size, &kmem_default_args, flags);
> +}
> +
> +static inline struct kmem_cache *
> +__kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> +               unsigned int flags, void (*ctor)(void *))
> +{
> +       struct kmem_cache_args kmem_args = {
> +               .align  = align,
> +               .ctor   = ctor,
> +       };
> +
> +       return __kmem_cache_create_args(name, size, &kmem_args, flags);
> +}
> +
> +#define kmem_cache_create(__name, __object_size, __args, ...)           \
> +       _Generic((__args),                                              \
> +               struct kmem_cache_args *: __kmem_cache_create_args,     \
> +               void *: __kmem_cache_default_args,                      \
> +               default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
>
>  void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
>  int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
> diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
> index 66dbb362385f3c3d923233448cc591adfe6dc9e7..9f5fd722f27f1d3877be8927be30409cd74ab3c3 100644
> --- a/tools/testing/shared/linux.c
> +++ b/tools/testing/shared/linux.c
> @@ -20,6 +20,7 @@ struct kmem_cache {
>         pthread_mutex_t lock;
>         unsigned int size;
>         unsigned int align;
> +       unsigned int sheaf_capacity;
>         int nr_objs;
>         void *objs;
>         void (*ctor)(void *);
> @@ -31,6 +32,8 @@ struct kmem_cache {
>         void *private;
>  };
>
> +static struct kmem_cache *kmem_active = NULL;
> +
>  void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *))
>  {
>         cachep->callback = callback;
> @@ -147,6 +150,14 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
>         pthread_mutex_unlock(&cachep->lock);
>  }
>
> +void kmem_cache_free_active(void *objp)
> +{
> +       if (!kmem_active)
> +               printf("WARNING: No active kmem_cache\n");
> +
> +       kmem_cache_free(kmem_active, objp);
> +}
> +
>  void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
>  {
>         if (kmalloc_verbose)
> @@ -234,23 +245,28 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>  }
>
>  struct kmem_cache *
> -kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> -               unsigned int flags, void (*ctor)(void *))
> +__kmem_cache_create_args(const char *name, unsigned int size,
> +                         struct kmem_cache_args *args,
> +                         unsigned int flags)
>  {
>         struct kmem_cache *ret = malloc(sizeof(*ret));
>
>         pthread_mutex_init(&ret->lock, NULL);
>         ret->size = size;
> -       ret->align = align;
> +       ret->align = args->align;
> +       ret->sheaf_capacity = args->sheaf_capacity;
>         ret->nr_objs = 0;
>         ret->nr_allocated = 0;
>         ret->nr_tallocated = 0;
>         ret->objs = NULL;
> -       ret->ctor = ctor;
> +       ret->ctor = args->ctor;
>         ret->non_kernel = 0;
>         ret->exec_callback = false;
>         ret->callback = NULL;
>         ret->private = NULL;
> +       if (!kmem_active)
> +               kmem_active = ret;

This kmem_active and kfree_cb_offset look like bad hacks... Could we
maybe modify kmem_cache_alloc() to allocate a small metadata at the
beginning to store a pointer to kmem_cache and kfree_cb_offset value?

> +
>         return ret;
>  }
>
> diff --git a/tools/testing/shared/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h
> index fed468fb0c78db6f33fb1900c7110ab5f3c19c65..c95e2f0bbd93798e544d7d34e0823ed68414f924 100644
> --- a/tools/testing/shared/linux/rcupdate.h
> +++ b/tools/testing/shared/linux/rcupdate.h
> @@ -9,4 +9,26 @@
>  #define rcu_dereference_check(p, cond) rcu_dereference(p)
>  #define RCU_INIT_POINTER(p, v) do { (p) = (v); } while (0)
>
> +void kmem_cache_free_active(void *objp);
> +static unsigned long kfree_cb_offset = 0;
> +
> +static inline void kfree_rcu_cb(struct rcu_head *head)
> +{
> +       void *objp = (void *) ((unsigned long)head - kfree_cb_offset);
> +
> +       kmem_cache_free_active(objp);
> +}
> +
> +#ifndef offsetof
> +#define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
> +#endif
> +
> +#define kfree_rcu(ptr, rhv)                                            \
> +do {                                                                   \
> +       if (!kfree_cb_offset)                                           \
> +               kfree_cb_offset = offsetof(typeof(*(ptr)), rhv);        \
> +                                                                       \
> +       call_rcu(&ptr->rhv, kfree_rcu_cb);                              \
> +} while (0)
> +
>  #endif
>
> --
> 2.48.1
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 10/10] maple_tree: use percpu sheaves for maple_node_cache
  2025-02-14 16:27 ` [PATCH RFC v2 10/10] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
@ 2025-02-23  4:27   ` Suren Baghdasaryan
  0 siblings, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-23  4:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 8:28 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
> improve its performance.

I guess 32 might change in the future based on further testing?

> Change the single node rcu freeing in
> ma_free_rcu() to use kfree_rcu() instead of the custom callback, which
> allows the rcu_free sheaf batching to be used. Note there are other
> users of mt_free_rcu() where larger parts of maple tree are submitted to
> call_rcu() as a whole, and that cannot use the rcu_free sheaf, but it's
> still possible for maple nodes freed this way to be reused via the barn,
> even if only some cpus are allowed to process rcu callbacks.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  lib/maple_tree.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index f7153ade1be5f16423f0ca073846a7f3dfa60523..56e7a00f6f0941bff163091c999a873e4273f071 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -208,7 +208,7 @@ static void mt_free_rcu(struct rcu_head *head)
>  static void ma_free_rcu(struct maple_node *node)
>  {
>         WARN_ON(node->parent != ma_parent_ptr(node));
> -       call_rcu(&node->rcu, mt_free_rcu);
> +       kfree_rcu(node, rcu);
>  }
>
>  static void mas_set_height(struct ma_state *mas)
> @@ -6258,9 +6258,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>
>  void __init maple_tree_init(void)
>  {
> +       struct kmem_cache_args args = {
> +               .align  = sizeof(struct maple_node),
> +               .sheaf_capacity = 32,
> +       };
> +
>         maple_node_cache = kmem_cache_create("maple_node",
> -                       sizeof(struct maple_node), sizeof(struct maple_node),
> -                       SLAB_PANIC, NULL);
> +                       sizeof(struct maple_node), &args,
> +                       SLAB_PANIC);
>  }
>
>  /**
>
> --
> 2.48.1
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-23  0:19 ` Kent Overstreet
@ 2025-02-23  4:44   ` Suren Baghdasaryan
  2025-02-24  1:36     ` Suren Baghdasaryan
  0 siblings, 1 reply; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-23  4:44 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote:
> > - Cheaper fast paths. For allocations, instead of local double cmpxchg,
> >   after Patch 5 it's preempt_disable() and no atomic operations. Same for
> >   freeing, which is normally a local double cmpxchg only for a short
> >   term allocations (so the same slab is still active on the same cpu when
> >   freeing the object) and a more costly locked double cmpxchg otherwise.
> >   The downside is the lack of NUMA locality guarantees for the allocated
> >   objects.
>
> Is that really cheaper than a local non locked double cmpxchg?

Don't know about this particular part but testing sheaves with maple
node cache and stress testing mmap/munmap syscalls shows performance
benefits as long as there is some delay to let kfree_rcu() do its job.
I'm still gathering results and will most likely post them tomorrow.

>
> Especially if you now have to use pushf/popf...
>
> > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
> >   separate percpu sheaf and only submit the whole sheaf to call_rcu()
> >   when full. After the grace period, the sheaf can be used for
> >   allocations, which is more efficient than freeing and reallocating
> >   individual slab objects (even with the batching done by kfree_rcu()
> >   implementation itself). In case only some cpus are allowed to handle rcu
> >   callbacks, the sheaf can still be made available to other cpus on the
> >   same node via the shared barn. The maple_node cache uses kfree_rcu() and
> >   thus can benefit from this.
>
> Have you looked at fs/bcachefs/rcu_pending.c?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-23  4:44   ` Suren Baghdasaryan
@ 2025-02-24  1:36     ` Suren Baghdasaryan
  2025-02-24  1:43       ` Suren Baghdasaryan
  2025-02-24 20:53       ` Vlastimil Babka
  0 siblings, 2 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-24  1:36 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote:
> > > - Cheaper fast paths. For allocations, instead of local double cmpxchg,
> > >   after Patch 5 it's preempt_disable() and no atomic operations. Same for
> > >   freeing, which is normally a local double cmpxchg only for a short
> > >   term allocations (so the same slab is still active on the same cpu when
> > >   freeing the object) and a more costly locked double cmpxchg otherwise.
> > >   The downside is the lack of NUMA locality guarantees for the allocated
> > >   objects.
> >
> > Is that really cheaper than a local non locked double cmpxchg?
>
> Don't know about this particular part but testing sheaves with maple
> node cache and stress testing mmap/munmap syscalls shows performance
> benefits as long as there is some delay to let kfree_rcu() do its job.
> I'm still gathering results and will most likely post them tomorrow.

Here are the promised test results:

First I ran an Android app cycle test comparing the baseline against sheaves
used for maple tree nodes (as this patchset implements). I registered about
3% improvement in app launch times, indicating improvement in mmap syscall
performance.
Next I ran an mmap stress test which maps 5 1-page readable file-backed
areas, faults them in and finally unmaps them, timing mmap syscalls.
Repeats that 200000 cycles and reports the total time. Average of 10 such
runs is used as the final result.
3 configurations were tested:

1. Sheaves used for maple tree nodes only (this patchset).

2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1].
This patchset avoids allocating additional vm_lock structure on each mmap
syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache.

3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock
to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace
TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache.

The values represent the total time it took to perform mmap syscalls, less is
better.

(1)                  baseline       control
Little core       7.58327       6.614939 (-12.77%)
Medium core  2.125315     1.428702 (-32.78%)
Big core          0.514673     0.422948 (-17.82%)

(2)                  baseline      control
Little core       7.58327       5.141478 (-32.20%)
Medium core  2.125315     0.427692 (-79.88%)
Big core          0.514673    0.046642 (-90.94%)

(3)                   baseline      control
Little core        7.58327      4.779624 (-36.97%)
Medium core   2.125315    0.450368 (-78.81%)
Big core           0.514673    0.037776 (-92.66%)

Results in (3) vs (2) indicate that using sheaves for vm_area_struct
yields slightly better averages and I noticed that this was mostly due
to sheaves results missing occasional spikes that worsened
TYPESAFE_BY_RCU averages (the results seemed more stable with
sheaves).

[1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/

>
> >
> > Especially if you now have to use pushf/popf...
> >
> > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
> > >   separate percpu sheaf and only submit the whole sheaf to call_rcu()
> > >   when full. After the grace period, the sheaf can be used for
> > >   allocations, which is more efficient than freeing and reallocating
> > >   individual slab objects (even with the batching done by kfree_rcu()
> > >   implementation itself). In case only some cpus are allowed to handle rcu
> > >   callbacks, the sheaf can still be made available to other cpus on the
> > >   same node via the shared barn. The maple_node cache uses kfree_rcu() and
> > >   thus can benefit from this.
> >
> > Have you looked at fs/bcachefs/rcu_pending.c?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-24  1:36     ` Suren Baghdasaryan
@ 2025-02-24  1:43       ` Suren Baghdasaryan
  2025-02-24 20:53       ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-24  1:43 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Vlastimil Babka, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Sun, Feb 23, 2025 at 5:36 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote:
> > > > - Cheaper fast paths. For allocations, instead of local double cmpxchg,
> > > >   after Patch 5 it's preempt_disable() and no atomic operations. Same for
> > > >   freeing, which is normally a local double cmpxchg only for a short
> > > >   term allocations (so the same slab is still active on the same cpu when
> > > >   freeing the object) and a more costly locked double cmpxchg otherwise.
> > > >   The downside is the lack of NUMA locality guarantees for the allocated
> > > >   objects.
> > >
> > > Is that really cheaper than a local non locked double cmpxchg?
> >
> > Don't know about this particular part but testing sheaves with maple
> > node cache and stress testing mmap/munmap syscalls shows performance
> > benefits as long as there is some delay to let kfree_rcu() do its job.
> > I'm still gathering results and will most likely post them tomorrow.
>
> Here are the promised test results:
>
> First I ran an Android app cycle test comparing the baseline against sheaves
> used for maple tree nodes (as this patchset implements). I registered about
> 3% improvement in app launch times, indicating improvement in mmap syscall
> performance.
> Next I ran an mmap stress test which maps 5 1-page readable file-backed
> areas, faults them in and finally unmaps them, timing mmap syscalls.

I forgot to mention that I also added a 500us delay after each cycle
described above to give kfree_rcu() a chance to run.

> Repeats that 200000 cycles and reports the total time. Average of 10 such
> runs is used as the final result.
> 3 configurations were tested:
>
> 1. Sheaves used for maple tree nodes only (this patchset).
>
> 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1].
> This patchset avoids allocating additional vm_lock structure on each mmap
> syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache.
>
> 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock
> to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace
> TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache.
>
> The values represent the total time it took to perform mmap syscalls, less is
> better.
>
> (1)                  baseline       control
> Little core       7.58327       6.614939 (-12.77%)
> Medium core  2.125315     1.428702 (-32.78%)
> Big core          0.514673     0.422948 (-17.82%)
>
> (2)                  baseline      control
> Little core       7.58327       5.141478 (-32.20%)
> Medium core  2.125315     0.427692 (-79.88%)
> Big core          0.514673    0.046642 (-90.94%)
>
> (3)                   baseline      control
> Little core        7.58327      4.779624 (-36.97%)
> Medium core   2.125315    0.450368 (-78.81%)
> Big core           0.514673    0.037776 (-92.66%)
>
> Results in (3) vs (2) indicate that using sheaves for vm_area_struct
> yields slightly better averages and I noticed that this was mostly due
> to sheaves results missing occasional spikes that worsened
> TYPESAFE_BY_RCU averages (the results seemed more stable with
> sheaves).
>
> [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
>
> >
> > >
> > > Especially if you now have to use pushf/popf...
> > >
> > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
> > > >   separate percpu sheaf and only submit the whole sheaf to call_rcu()
> > > >   when full. After the grace period, the sheaf can be used for
> > > >   allocations, which is more efficient than freeing and reallocating
> > > >   individual slab objects (even with the batching done by kfree_rcu()
> > > >   implementation itself). In case only some cpus are allowed to handle rcu
> > > >   callbacks, the sheaf can still be made available to other cpus on the
> > > >   same node via the shared barn. The maple_node cache uses kfree_rcu() and
> > > >   thus can benefit from this.
> > >
> > > Have you looked at fs/bcachefs/rcu_pending.c?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-02-14 16:27 ` [PATCH RFC v2 01/10] slab: add opt-in caching layer of " Vlastimil Babka
  2025-02-22 22:46   ` Suren Baghdasaryan
@ 2025-02-24  8:04   ` Harry Yoo
  2025-03-12 14:59     ` Vlastimil Babka
  1 sibling, 1 reply; 55+ messages in thread
From: Harry Yoo @ 2025-02-24  8:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 05:27:37PM +0100, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> refill one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() with a specific
> node (not NUMA_NO_NODE), sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> allocated or freed using the sheaves. Counters sheaf_refill,
> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> from or to slab pages, and can be used to assess how effective the
> caching is. The refill and flush operations will also count towards the
> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> counters.
> 
> Access to the percpu sheaves is protected by local_lock_irqsave()
> operations, each per-NUMA-node barn has a spin_lock.
> 
> A current limitation is that when slub_debug is enabled for a cache with
> percpu sheaves, the objects in the array are considered as allocated from
> the slub_debug perspective, and the alloc/free debugging hooks occur
> when moving the objects between the array and slab pages. This means
> that e.g. an use-after-free that occurs for an object cached in the
> array is undetected. Collected alloc/free stacktraces might also be less
> useful. This limitation could be changed in the future.
> 
> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> allocations and frees by kmem_cache users even if those use the array,
> so their debugging or accounting accuracy should be unaffected.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |  34 ++
>  mm/slab.h            |   2 +
>  mm/slab_common.c     |   5 +-
>  mm/slub.c            | 982 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 973 insertions(+), 50 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index e8273f28656936c05d015c53923f8fe69cd161b2..c06734912972b799f537359f7fe6a750918ffe9e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
>  
>  /********************************************************************
>   * 			Core slab cache functions
> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +
> +	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +	if (pcs->spare) {
> +		sheaf_flush(s, pcs->spare);
> +		free_empty_sheaf(s, pcs->spare);
> +		pcs->spare = NULL;
> +	}
> +
> +	// TODO: handle rcu_free
> +	BUG_ON(pcs->rcu_free);
> +
> +	sheaf_flush_main(s);
> +}

+1 on what Suren mentioned.

> +static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
> +{
> +	struct list_head empty_list;
> +	struct list_head full_list;
> +	struct slab_sheaf *sheaf, *sheaf2;
> +	unsigned long flags;
> +
> +	INIT_LIST_HEAD(&empty_list);
> +	INIT_LIST_HEAD(&full_list);
> +
> +	spin_lock_irqsave(&barn->lock, flags);
> +
> +	list_splice_init(&barn->sheaves_full, &full_list);
> +	barn->nr_full = 0;
> +	list_splice_init(&barn->sheaves_empty, &empty_list);
> +	barn->nr_empty = 0;
> +
> +	spin_unlock_irqrestore(&barn->lock, flags);
> +
> +	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
> +		sheaf_flush(s, sheaf);
> +		list_move(&sheaf->barn_list, &empty_list);
> +	}

nit: is this list_move() necessary?

> +
> +	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
> +		free_empty_sheaf(s, sheaf);
> +}

Otherwise looks good to me.

-- 
Cheers,
Harry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations
  2025-02-14 16:27 ` [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
  2025-02-22 23:08   ` Suren Baghdasaryan
@ 2025-02-24  8:40   ` Harry Yoo
  2025-03-12 16:16     ` Vlastimil Babka
  1 sibling, 1 reply; 55+ messages in thread
From: Harry Yoo @ 2025-02-24  8:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 05:27:38PM +0100, Vlastimil Babka wrote:
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put in in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() machinery.
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too?
> 
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab.h        |   2 +
>  mm/slab_common.c |  21 ++++++++
>  mm/slub.c        | 151 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 170 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 8daaec53b6ecfc44171191d421adb12e5cba2c58..94e9959e1aefa350d3d74e3f5309fde7a5cf2ec8 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -459,6 +459,8 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>  	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>  }
>  
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
> +
>  /* Legal flag mask for kmem_cache_create(), for various configurations */
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index ceeefb287899a82f30ad79b403556001c1860311..c6853450ed74160cfcb497c09f92c1f9f7b12629 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1613,6 +1613,24 @@ static void kfree_rcu_work(struct work_struct *work)
>  		kvfree_rcu_list(head);
>  }
>  
> +static bool kfree_rcu_sheaf(void *obj)
> +{
> +	struct kmem_cache *s;
> +	struct folio *folio;
> +	struct slab *slab;
> +
> +	folio = virt_to_folio(obj);
> +	if (unlikely(!folio_test_slab(folio)))
> +		return false;

Does virt_to_folio() work for vmalloc addresses?
Probably it should check is_vmalloc_addr() first?

Otherwise look good to me.

> +
> +	slab = folio_slab(folio);
> +	s = slab->slab_cache;
> +	if (s->cpu_sheaves)
> +		return __kfree_rcu_sheaf(s, obj);
> +
> +	return false;
> +}
> +
>  static bool
>  need_offload_krc(struct kfree_rcu_cpu *krcp)
>  {
> @@ -1957,6 +1975,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  	if (!head)
>  		might_sleep();
>  
> +	if (kfree_rcu_sheaf(ptr))
> +		return;
> +
>  	// Queue the object but don't yet schedule the batch.
>  	if (debug_rcu_head_queue(ptr)) {
>  		// Probable double kfree_rcu(), just leak.

-- 
Cheers,
Harry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock
  2025-02-14 16:27 ` [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock Vlastimil Babka
  2025-02-23  2:33   ` Suren Baghdasaryan
@ 2025-02-24 13:08   ` Harry Yoo
  1 sibling, 0 replies; 55+ messages in thread
From: Harry Yoo @ 2025-02-24 13:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 05:27:41PM +0100, Vlastimil Babka wrote:
> Instead of local_lock_irqsave(), use localtry_trylock() when potential
> callers include irq context, and localtry_lock() otherwise (such as when
> we already know the gfp flags allow blocking).
> 
> This should reduce the locking (due to irq disabling/enabling) overhead.
> Failing to use percpu sheaves in an irq due to preempting an already
> locked user of sheaves should be rare so it's a favorable tradeoff.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry

> ---
>  mm/slub.c | 122 ++++++++++++++++++++++++++++++++++++++------------------------
>  1 file changed, 76 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 40175747212fefb27137309b27571abe8d0966e2..3d7345e7e938d53950ed0d6abe8eb0e93cf8f5b1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -450,7 +450,7 @@ struct slab_sheaf {
>  };
>  
>  struct slub_percpu_sheaves {
> -	local_lock_t lock;
> +	localtry_lock_t lock;
>  	struct slab_sheaf *main; /* never NULL when unlocked */
>  	struct slab_sheaf *spare; /* empty or full, may be NULL */
>  	struct slab_sheaf *rcu_free;
> @@ -2529,16 +2529,19 @@ static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
>  
>  static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
>  
> -static void sheaf_flush_main(struct kmem_cache *s)
> +/* returns true if at least partially flushed */
> +static bool sheaf_flush_main(struct kmem_cache *s)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	unsigned int batch, remaining;
>  	void *objects[PCS_BATCH_MAX];
>  	struct slab_sheaf *sheaf;
> -	unsigned long flags;
> +	bool ret = false;
>  
>  next_batch:
> -	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		return ret;
> +
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  	sheaf = pcs->main;
>  
> @@ -2549,14 +2552,18 @@ static void sheaf_flush_main(struct kmem_cache *s)
>  
>  	remaining = sheaf->size;
>  
> -	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +	localtry_unlock(&s->cpu_sheaves->lock);
>  
>  	__kmem_cache_free_bulk(s, batch, &objects[0]);
>  
>  	stat_add(s, SHEAF_FLUSH_MAIN, batch);
>  
> +	ret = true;
> +
>  	if (remaining)
>  		goto next_batch;
> +
> +	return ret;
>  }
>  
>  static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
> @@ -2593,6 +2600,8 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
>   * Caller needs to make sure migration is disabled in order to fully flush
>   * single cpu's sheaves
>   *
> + * must not be called from an irq
> + *
>   * flushing operations are rare so let's keep it simple and flush to slabs
>   * directly, skipping the barn
>   */
> @@ -2600,9 +2609,8 @@ static void pcs_flush_all(struct kmem_cache *s)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *spare, *rcu_free;
> -	unsigned long flags;
>  
> -	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +	localtry_lock(&s->cpu_sheaves->lock);
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	spare = pcs->spare;
> @@ -2611,7 +2619,7 @@ static void pcs_flush_all(struct kmem_cache *s)
>  	rcu_free = pcs->rcu_free;
>  	pcs->rcu_free = NULL;
>  
> -	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +	localtry_unlock(&s->cpu_sheaves->lock);
>  
>  	if (spare) {
>  		sheaf_flush(s, spare);
> @@ -4554,10 +4562,11 @@ static __fastpath_inline
>  void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>  {
>  	struct slub_percpu_sheaves *pcs;
> -	unsigned long flags;
>  	void *object;
>  
> -	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		return NULL;
> +
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (unlikely(pcs->main->size == 0)) {
> @@ -4590,7 +4599,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>  			}
>  		}
>  
> -		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +		localtry_unlock(&s->cpu_sheaves->lock);
>  
>  		if (!can_alloc)
>  			return NULL;
> @@ -4612,7 +4621,11 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>  		if (!full)
>  			return NULL;
>  
> -		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +		/*
> +		 * we can reach here only when gfpflags_allow_blocking
> +		 * so this must not be an irq
> +		 */
> +		localtry_lock(&s->cpu_sheaves->lock);
>  		pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  		/*
> @@ -4646,7 +4659,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>  do_alloc:
>  	object = pcs->main->objects[--pcs->main->size];
>  
> -	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +	localtry_unlock(&s->cpu_sheaves->lock);
>  
>  	stat(s, ALLOC_PCS);
>  
> @@ -4658,12 +4671,13 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *main;
> -	unsigned long flags;
>  	unsigned int allocated = 0;
>  	unsigned int batch;
>  
>  next_batch:
> -	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		return allocated;
> +
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (unlikely(pcs->main->size == 0)) {
> @@ -4683,7 +4697,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  			goto do_alloc;
>  		}
>  
> -		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +		localtry_unlock(&s->cpu_sheaves->lock);
>  
>  		/*
>  		 * Once full sheaves in barn are depleted, let the bulk
> @@ -4701,7 +4715,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  	main->size -= batch;
>  	memcpy(p, main->objects + main->size, batch * sizeof(void *));
>  
> -	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +	localtry_unlock(&s->cpu_sheaves->lock);
>  
>  	stat_add(s, ALLOC_PCS, batch);
>  
> @@ -5121,13 +5135,14 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>   * The object is expected to have passed slab_free_hook() already.
>   */
>  static __fastpath_inline
> -void free_to_pcs(struct kmem_cache *s, void *object)
> +bool free_to_pcs(struct kmem_cache *s, void *object)
>  {
>  	struct slub_percpu_sheaves *pcs;
> -	unsigned long flags;
>  
>  restart:
> -	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		return false;
> +
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> @@ -5162,7 +5177,7 @@ void free_to_pcs(struct kmem_cache *s, void *object)
>  			struct slab_sheaf *to_flush = pcs->spare;
>  
>  			pcs->spare = NULL;
> -			local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +			localtry_unlock(&s->cpu_sheaves->lock);
>  
>  			sheaf_flush(s, to_flush);
>  			empty = to_flush;
> @@ -5170,17 +5185,27 @@ void free_to_pcs(struct kmem_cache *s, void *object)
>  		}
>  
>  alloc_empty:
> -		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +		localtry_unlock(&s->cpu_sheaves->lock);
>  
>  		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>  
>  		if (!empty) {
> -			sheaf_flush_main(s);
> -			goto restart;
> +			if (sheaf_flush_main(s))
> +				goto restart;
> +			else
> +				return false;
>  		}
>  
>  got_empty:
> -		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +		if (!localtry_trylock(&s->cpu_sheaves->lock)) {
> +			struct node_barn *barn;
> +
> +			barn = get_node(s, numa_mem_id())->barn;
> +
> +			barn_put_empty_sheaf(barn, empty, true);
> +			return false;
> +		}
> +
>  		pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  		/*
> @@ -5209,9 +5234,11 @@ void free_to_pcs(struct kmem_cache *s, void *object)
>  do_free:
>  	pcs->main->objects[pcs->main->size++] = object;
>  
> -	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +	localtry_unlock(&s->cpu_sheaves->lock);
>  
>  	stat(s, FREE_PCS);
> +
> +	return true;
>  }
>  
>  static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> @@ -5270,9 +5297,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *rcu_sheaf;
> -	unsigned long flags;
>  
> -	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		goto fail;
> +
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (unlikely(!pcs->rcu_free)) {
> @@ -5286,16 +5314,16 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>  			goto do_free;
>  		}
>  
> -		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +		localtry_unlock(&s->cpu_sheaves->lock);
>  
>  		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>  
> -		if (!empty) {
> -			stat(s, FREE_RCU_SHEAF_FAIL);
> -			return false;
> -		}
> +		if (!empty)
> +			goto fail;
> +
> +		if (!localtry_trylock(&s->cpu_sheaves->lock))
> +			goto fail;
>  
> -		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>  		pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  		if (unlikely(pcs->rcu_free))
> @@ -5311,19 +5339,22 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>  	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
>  
>  	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
> -		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +		localtry_unlock(&s->cpu_sheaves->lock);
>  		stat(s, FREE_RCU_SHEAF);
>  		return true;
>  	}
>  
>  	pcs->rcu_free = NULL;
> -	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +	localtry_unlock(&s->cpu_sheaves->lock);
>  
>  	call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
>  
>  	stat(s, FREE_RCU_SHEAF);
> -
>  	return true;
> +
> +fail:
> +	stat(s, FREE_RCU_SHEAF_FAIL);
> +	return false;
>  }
>  
>  /*
> @@ -5335,7 +5366,6 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *main;
> -	unsigned long flags;
>  	unsigned int batch, i = 0;
>  	bool init;
>  
> @@ -5358,7 +5388,9 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  	}
>  
>  next_batch:
> -	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> +	if (!localtry_trylock(&s->cpu_sheaves->lock))
> +		goto fallback;
> +
>  	pcs = this_cpu_ptr(s->cpu_sheaves);
>  
>  	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> @@ -5389,13 +5421,13 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  		}
>  
>  no_empty:
> -		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +		localtry_unlock(&s->cpu_sheaves->lock);
>  
>  		/*
>  		 * if we depleted all empty sheaves in the barn or there are too
>  		 * many full sheaves, free the rest to slab pages
>  		 */
> -
> +fallback:
>  		__kmem_cache_free_bulk(s, size, p);
>  		return;
>  	}
> @@ -5407,7 +5439,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>  	memcpy(main->objects + main->size, p, batch * sizeof(void *));
>  	main->size += batch;
>  
> -	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> +	localtry_unlock(&s->cpu_sheaves->lock);
>  
>  	stat_add(s, FREE_PCS, batch);
>  
> @@ -5507,9 +5539,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>  		return;
>  
> -	if (s->cpu_sheaves)
> -		free_to_pcs(s, object);
> -	else
> +	if (!s->cpu_sheaves || !free_to_pcs(s, object))
>  		do_slab_free(s, slab, object, object, 1, addr);
>  }
>  
> @@ -6288,7 +6318,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
>  
>  		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>  
> -		local_lock_init(&pcs->lock);
> +		localtry_lock_init(&pcs->lock);
>  
>  		nid = cpu_to_mem(cpu);
>  
> 
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-24  1:36     ` Suren Baghdasaryan
  2025-02-24  1:43       ` Suren Baghdasaryan
@ 2025-02-24 20:53       ` Vlastimil Babka
  2025-02-24 21:12         ` Suren Baghdasaryan
  1 sibling, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2025-02-24 20:53 UTC (permalink / raw)
  To: Suren Baghdasaryan, Kent Overstreet
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Sebastian Andrzej Siewior,
	Alexei Starovoitov

On 2/24/25 02:36, Suren Baghdasaryan wrote:
> On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> Don't know about this particular part but testing sheaves with maple
>> node cache and stress testing mmap/munmap syscalls shows performance
>> benefits as long as there is some delay to let kfree_rcu() do its job.
>> I'm still gathering results and will most likely post them tomorrow.

Without such delay, the perf is same or worse?

> Here are the promised test results:
> 
> First I ran an Android app cycle test comparing the baseline against sheaves
> used for maple tree nodes (as this patchset implements). I registered about
> 3% improvement in app launch times, indicating improvement in mmap syscall
> performance.

There was no artificial 500us delay added for this test, right?

> Next I ran an mmap stress test which maps 5 1-page readable file-backed
> areas, faults them in and finally unmaps them, timing mmap syscalls.
> Repeats that 200000 cycles and reports the total time. Average of 10 such
> runs is used as the final result.
> 3 configurations were tested:
> 
> 1. Sheaves used for maple tree nodes only (this patchset).
> 
> 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1].
> This patchset avoids allocating additional vm_lock structure on each mmap
> syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache.
> 
> 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock
> to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace
> TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache.

Hm why we can't use both? I don't think any kmem_cache_create check makes
them exclusive? TYPESAFE_BY_RCU only affects how slab pages are freed, it
doesn't e.g. delay reuse of individual objects, and caching in a sheaf
doesn't write to the object. Am I missing something?

> The values represent the total time it took to perform mmap syscalls, less is
> better.
> 
> (1)                  baseline       control
> Little core       7.58327       6.614939 (-12.77%)
> Medium core  2.125315     1.428702 (-32.78%)
> Big core          0.514673     0.422948 (-17.82%)
> 
> (2)                  baseline      control
> Little core       7.58327       5.141478 (-32.20%)
> Medium core  2.125315     0.427692 (-79.88%)
> Big core          0.514673    0.046642 (-90.94%)
> 
> (3)                   baseline      control
> Little core        7.58327      4.779624 (-36.97%)
> Medium core   2.125315    0.450368 (-78.81%)
> Big core           0.514673    0.037776 (-92.66%)
> 
> Results in (3) vs (2) indicate that using sheaves for vm_area_struct
> yields slightly better averages and I noticed that this was mostly due
> to sheaves results missing occasional spikes that worsened
> TYPESAFE_BY_RCU averages (the results seemed more stable with
> sheaves).

Thanks a lot, that looks promising!

> [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-24 20:53       ` Vlastimil Babka
@ 2025-02-24 21:12         ` Suren Baghdasaryan
  2025-02-25 20:26           ` Suren Baghdasaryan
  0 siblings, 1 reply; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-24 21:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kent Overstreet, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Mon, Feb 24, 2025 at 12:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/24/25 02:36, Suren Baghdasaryan wrote:
> > On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> Don't know about this particular part but testing sheaves with maple
> >> node cache and stress testing mmap/munmap syscalls shows performance
> >> benefits as long as there is some delay to let kfree_rcu() do its job.
> >> I'm still gathering results and will most likely post them tomorrow.
>
> Without such delay, the perf is same or worse?

The perf is about the same if there is no delay.

>
> > Here are the promised test results:
> >
> > First I ran an Android app cycle test comparing the baseline against sheaves
> > used for maple tree nodes (as this patchset implements). I registered about
> > 3% improvement in app launch times, indicating improvement in mmap syscall
> > performance.
>
> There was no artificial 500us delay added for this test, right?

Correct. No artificial changes in this test.

>
> > Next I ran an mmap stress test which maps 5 1-page readable file-backed
> > areas, faults them in and finally unmaps them, timing mmap syscalls.
> > Repeats that 200000 cycles and reports the total time. Average of 10 such
> > runs is used as the final result.
> > 3 configurations were tested:
> >
> > 1. Sheaves used for maple tree nodes only (this patchset).
> >
> > 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1].
> > This patchset avoids allocating additional vm_lock structure on each mmap
> > syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache.
> >
> > 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock
> > to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace
> > TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache.
>
> Hm why we can't use both? I don't think any kmem_cache_create check makes
> them exclusive? TYPESAFE_BY_RCU only affects how slab pages are freed, it
> doesn't e.g. delay reuse of individual objects, and caching in a sheaf
> doesn't write to the object. Am I missing something?

Ah, I was under impression that to use sheaves I would have to ensure
the freeing happens via kfree_rcu()->kfree_rcu_sheaf() path but now
that you mentioned that, I guess I could keep using kmem_cache_free()
and that would use free_to_pcs() internally... When time comes to free
the page, TYPESAFE_BY_RCU will free it after the grace period.
I can try that combination as well and see if anything breaks.

>
> > The values represent the total time it took to perform mmap syscalls, less is
> > better.
> >
> > (1)                  baseline       control
> > Little core       7.58327       6.614939 (-12.77%)
> > Medium core  2.125315     1.428702 (-32.78%)
> > Big core          0.514673     0.422948 (-17.82%)
> >
> > (2)                  baseline      control
> > Little core       7.58327       5.141478 (-32.20%)
> > Medium core  2.125315     0.427692 (-79.88%)
> > Big core          0.514673    0.046642 (-90.94%)
> >
> > (3)                   baseline      control
> > Little core        7.58327      4.779624 (-36.97%)
> > Medium core   2.125315    0.450368 (-78.81%)
> > Big core           0.514673    0.037776 (-92.66%)
> >
> > Results in (3) vs (2) indicate that using sheaves for vm_area_struct
> > yields slightly better averages and I noticed that this was mostly due
> > to sheaves results missing occasional spikes that worsened
> > TYPESAFE_BY_RCU averages (the results seemed more stable with
> > sheaves).
>
> Thanks a lot, that looks promising!

Indeed, that looks better than I expected :)
Cheers!

>
> > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
> >
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations
  2025-02-23  3:54   ` Suren Baghdasaryan
@ 2025-02-25  7:30     ` Harry Yoo
  2025-03-12 17:09     ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Harry Yoo @ 2025-02-25  7:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On Sat, Feb 22, 2025 at 07:54:16PM -0800, Suren Baghdasaryan wrote:
> On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Add functions for efficient guaranteed allocations e.g. in a critical
> > section that cannot sleep, when the exact number of allocations is not
> > known beforehand, but an upper limit can be calculated.
> >
> > kmem_cache_prefill_sheaf() returns a sheaf containing at least given
> > number of objects.
> >
> > kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
> > and is guaranteed not to fail until depleted.
> >
> > kmem_cache_return_sheaf() is for giving the sheaf back to the slab
> > allocator after the critical section. This will also attempt to refill
> > it to cache's sheaf capacity for better efficiency of sheaves handling,
> > but it's not stricly necessary to succeed.
> >
> > kmem_cache_refill_sheaf() can be used to refill a previously obtained
> > sheaf to requested size. If the current size is sufficient, it does
> > nothing. If the requested size exceeds cache's sheaf_capacity and the
> > sheaf's current capacity, the sheaf will be replaced with a new one,
> > hence the indirect pointer parameter.
> >
> > kmem_cache_sheaf_size() can be used to query the current size.
> >
> > The implementation supports requesting sizes that exceed cache's
> > sheaf_capacity, but it is not efficient - such sheaves are allocated
> > fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by
> > kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be expecially
> 
> s/expecially/especially
> 
> > ineffective when replacing a sheaf with a new one of a larger capacity.
> > It is therefore better to size cache's sheaf_capacity accordingly.
> 
> If support for sizes exceeding sheaf_capacity adds much complexity
> with no performance benefits, I think it would be ok not to support
> them at all. Users know the capacity of a particular kmem_cache, so
> they can use this API only when their needs are within sheaf_capacity,
> otherwise either size the sheaf appropriately or use slab bulk
> allocation.

At least for maple tree, I think the reason why it support varying size
(that may exceed sheaf_capacity) of sheaves is because the upper limit depends
on the store operation maple tree is going to perform, and height of a maple
tree?

Or can we set a single maximum sheaf capacity that works for any
store operation and any height of maple trees?

Liam may have an opinion on it...

> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

-- 
Cheers,
Harry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations
  2025-02-14 16:27 ` [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
  2025-02-23  3:54   ` Suren Baghdasaryan
@ 2025-02-25  8:00   ` Harry Yoo
  2025-03-12 18:16     ` Vlastimil Babka
  1 sibling, 1 reply; 55+ messages in thread
From: Harry Yoo @ 2025-02-25  8:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 05:27:42PM +0100, Vlastimil Babka wrote:
> Add functions for efficient guaranteed allocations e.g. in a critical
> section that cannot sleep, when the exact number of allocations is not
> known beforehand, but an upper limit can be calculated.
> 
> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
> number of objects.
> 
> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
> and is guaranteed not to fail until depleted.
> 
> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
> allocator after the critical section. This will also attempt to refill
> it to cache's sheaf capacity for better efficiency of sheaves handling,
> but it's not stricly necessary to succeed.
> 
> kmem_cache_refill_sheaf() can be used to refill a previously obtained
> sheaf to requested size. If the current size is sufficient, it does
> nothing. If the requested size exceeds cache's sheaf_capacity and the
> sheaf's current capacity, the sheaf will be replaced with a new one,
> hence the indirect pointer parameter.
> 
> kmem_cache_sheaf_size() can be used to query the current size.
> 
> The implementation supports requesting sizes that exceed cache's
> sheaf_capacity, but it is not efficient - such sheaves are allocated
> fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by
> kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be expecially
> ineffective when replacing a sheaf with a new one of a larger capacity.
> It is therefore better to size cache's sheaf_capacity accordingly.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |  16 ++++
>  mm/slub.c            | 227 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 243 insertions(+)

[... snip ... ]

> @@ -4831,6 +4857,207 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
>  
> +
> +/*
> + * returns a sheaf that has least the requested size
> + * when prefilling is needed, do so with given gfp flags
> + *
> + * return NULL if sheaf allocation or prefilling failed
> + */
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +	struct slab_sheaf *sheaf = NULL;
> +
> +	if (unlikely(size > s->sheaf_capacity)) {
> +		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
> +		if (!sheaf)
> +			return NULL;
> +
> +		sheaf->cache = s;
> +		sheaf->capacity = size;
> +
> +		if (!__kmem_cache_alloc_bulk(s, gfp, size,
> +					     &sheaf->objects[0])) {
> +			kfree(sheaf);
> +			return NULL;
> +		}
> +
> +		sheaf->size = size;
> +
> +		return sheaf;
> +	}
> +
> +	localtry_lock(&s->cpu_sheaves->lock);
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (pcs->spare) {
> +		sheaf = pcs->spare;
> +		pcs->spare = NULL;
> +	}
> +
> +	if (!sheaf)
> +		sheaf = barn_get_full_or_empty_sheaf(pcs->barn);

Can this be outside localtry lock?

> +
> +	localtry_unlock(&s->cpu_sheaves->lock);
> +
> +	if (!sheaf) {
> +		sheaf = alloc_empty_sheaf(s, gfp);
> +	}
> +
> +	if (sheaf && sheaf->size < size) {
> +		if (refill_sheaf(s, sheaf, gfp)) {
> +			sheaf_flush(s, sheaf);
> +			free_empty_sheaf(s, sheaf);
> +			sheaf = NULL;
> +		}
> +	}
> +
> +	if (sheaf)
> +		sheaf->capacity = s->sheaf_capacity;
> +
> +	return sheaf;
> +}
> +
> +/*
> + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
> + * It tries to refill the sheaf back to the cache's sheaf_capacity
> + * to avoid handling partially full sheaves.
> + *
> + * If the refill fails because gfp is e.g. GFP_NOWAIT, the sheaf is
> + * instead dissolved
> + */
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +			     struct slab_sheaf *sheaf)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +	bool refill = false;
> +	struct node_barn *barn;
> +
> +	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
> +		sheaf_flush(s, sheaf);
> +		kfree(sheaf);
> +		return;
> +	}
> +
> +	localtry_lock(&s->cpu_sheaves->lock);
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (!pcs->spare) {
> +		pcs->spare = sheaf;
> +		sheaf = NULL;
> +	} else if (pcs->barn->nr_full >= MAX_FULL_SHEAVES) {

Did you mean (pcs->barn->nr_full < MAX_FULL_SHEAVES)?

Otherwise looks good to me.

-- 
Cheers,
Harry

> +		/* racy check */
> +		barn = pcs->barn;
> +		refill = true;
> +	}
> +
> +	localtry_unlock(&s->cpu_sheaves->lock);
> +
> +	if (!sheaf)
> +		return;
> +
> +	/*
> +	 * if the barn is full of full sheaves or we fail to refill the sheaf,
> +	 * simply flush and free it
> +	 */
> +	if (!refill || refill_sheaf(s, sheaf, gfp)) {
> +		sheaf_flush(s, sheaf);
> +		free_empty_sheaf(s, sheaf);
> +		return;
> +	}
> +
> +	/* we racily determined the sheaf would fit, so now force it */
> +	barn_put_full_sheaf(barn, sheaf, true);
> +}


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock
  2025-02-14 16:27 ` [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock Vlastimil Babka
  2025-02-23  4:00   ` Suren Baghdasaryan
@ 2025-02-25  8:54   ` Harry Yoo
  2025-03-12 18:23     ` Vlastimil Babka
  1 sibling, 1 reply; 55+ messages in thread
From: Harry Yoo @ 2025-02-25  8:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On Fri, Feb 14, 2025 at 05:27:43PM +0100, Vlastimil Babka wrote:
> The possibility of many barn operations is determined by the current
> number of full or empty sheaves. Taking the barn->lock just to find out
> that e.g. there are no empty sheaves results in unnecessary overhead and
> lock contention. Thus perform these checks outside of the lock with a
> data_race() annotated variable read and fail quickly without taking the
> lock.
> 
> Checks for sheaf availability that racily succeed have to be obviously
> repeated under the lock for correctness, but we can skip repeating
> checks if there are too many sheaves on the given list as the limits
> don't need to be strict.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

in kmem_cache_return_sheaf:
> if (!pcs->spare) {                                                      
> 	pcs->spare = sheaf;                                             
>	sheaf = NULL;                                                   
> } else if (pcs->barn->nr_full >= MAX_FULL_SHEAVES) {                    
>	/* racy check */                                                
>	barn = pcs->barn;                                               
>	keep = true;                                                    
> }  

By the way this code also needs data_race()?

-- 
Cheers,
Harry

> ---
>  mm/slub.c | 57 ++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 34 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index c1df7cf22267f28f743404531bef921e25fac086..72e6437f1d74bfacbb1cd7642af42929c48cc66a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2685,9 +2685,12 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
>  	struct slab_sheaf *empty = NULL;
>  	unsigned long flags;
>  
> +	if (!data_race(barn->nr_empty))
> +		return NULL;
> +
>  	spin_lock_irqsave(&barn->lock, flags);
>  
> -	if (barn->nr_empty) {
> +	if (likely(barn->nr_empty)) {
>  		empty = list_first_entry(&barn->sheaves_empty,
>  					 struct slab_sheaf, barn_list);
>  		list_del(&empty->barn_list);
> @@ -2703,38 +2706,36 @@ static int barn_put_empty_sheaf(struct node_barn *barn,
>  				struct slab_sheaf *sheaf, bool ignore_limit)
>  {
>  	unsigned long flags;
> -	int ret = 0;
> +
> +	/* we don't repeat the check under barn->lock as it's not critical */
> +	if (!ignore_limit && data_race(barn->nr_empty) >= MAX_EMPTY_SHEAVES)
> +		return -E2BIG;
>  
>  	spin_lock_irqsave(&barn->lock, flags);
>  
> -	if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
> -		ret = -E2BIG;
> -	} else {
> -		list_add(&sheaf->barn_list, &barn->sheaves_empty);
> -		barn->nr_empty++;
> -	}
> +	list_add(&sheaf->barn_list, &barn->sheaves_empty);
> +	barn->nr_empty++;
>  
>  	spin_unlock_irqrestore(&barn->lock, flags);
> -	return ret;
> +	return 0;
>  }
>  
>  static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
>  			       bool ignore_limit)
>  {
>  	unsigned long flags;
> -	int ret = 0;
> +
> +	/* we don't repeat the check under barn->lock as it's not critical */
> +	if (!ignore_limit && data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
> +		return -E2BIG;
>  
>  	spin_lock_irqsave(&barn->lock, flags);
>  
> -	if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
> -		ret = -E2BIG;
> -	} else {
> -		list_add(&sheaf->barn_list, &barn->sheaves_full);
> -		barn->nr_full++;
> -	}
> +	list_add(&sheaf->barn_list, &barn->sheaves_full);
> +	barn->nr_full++;
>  
>  	spin_unlock_irqrestore(&barn->lock, flags);
> -	return ret;
> +	return 0;
>  }
>  
>  static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
> @@ -2742,6 +2743,9 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
>  	struct slab_sheaf *sheaf = NULL;
>  	unsigned long flags;
>  
> +	if (!data_race(barn->nr_full) && !data_race(barn->nr_empty))
> +		return NULL;
> +
>  	spin_lock_irqsave(&barn->lock, flags);
>  
>  	if (barn->nr_full) {
> @@ -2772,9 +2776,12 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
>  	struct slab_sheaf *full = NULL;
>  	unsigned long flags;
>  
> +	if (!data_race(barn->nr_full))
> +		return NULL;
> +
>  	spin_lock_irqsave(&barn->lock, flags);
>  
> -	if (barn->nr_full) {
> +	if (likely(barn->nr_full)) {
>  		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
>  					barn_list);
>  		list_del(&full->barn_list);
> @@ -2797,19 +2804,23 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
>  	struct slab_sheaf *empty;
>  	unsigned long flags;
>  
> +	/* we don't repeat this check under barn->lock as it's not critical */
> +	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
> +		return ERR_PTR(-E2BIG);
> +	if (!data_race(barn->nr_empty))
> +		return ERR_PTR(-ENOMEM);
> +
>  	spin_lock_irqsave(&barn->lock, flags);
>  
> -	if (barn->nr_full >= MAX_FULL_SHEAVES) {
> -		empty = ERR_PTR(-E2BIG);
> -	} else if (!barn->nr_empty) {
> -		empty = ERR_PTR(-ENOMEM);
> -	} else {
> +	if (likely(barn->nr_empty)) {
>  		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
>  					 barn_list);
>  		list_del(&empty->barn_list);
>  		list_add(&full->barn_list, &barn->sheaves_full);
>  		barn->nr_empty--;
>  		barn->nr_full++;
> +	} else {
> +		empty = ERR_PTR(-ENOMEM);
>  	}
>  
>  	spin_unlock_irqrestore(&barn->lock, flags);
> 
> -- 
> 2.48.1
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-24 21:12         ` Suren Baghdasaryan
@ 2025-02-25 20:26           ` Suren Baghdasaryan
  2025-03-04 10:54             ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-02-25 20:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kent Overstreet, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Feb 24, 2025 at 12:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 2/24/25 02:36, Suren Baghdasaryan wrote:
> > > On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >>
> > >> Don't know about this particular part but testing sheaves with maple
> > >> node cache and stress testing mmap/munmap syscalls shows performance
> > >> benefits as long as there is some delay to let kfree_rcu() do its job.
> > >> I'm still gathering results and will most likely post them tomorrow.
> >
> > Without such delay, the perf is same or worse?
>
> The perf is about the same if there is no delay.
>
> >
> > > Here are the promised test results:
> > >
> > > First I ran an Android app cycle test comparing the baseline against sheaves
> > > used for maple tree nodes (as this patchset implements). I registered about
> > > 3% improvement in app launch times, indicating improvement in mmap syscall
> > > performance.
> >
> > There was no artificial 500us delay added for this test, right?
>
> Correct. No artificial changes in this test.
>
> >
> > > Next I ran an mmap stress test which maps 5 1-page readable file-backed
> > > areas, faults them in and finally unmaps them, timing mmap syscalls.
> > > Repeats that 200000 cycles and reports the total time. Average of 10 such
> > > runs is used as the final result.
> > > 3 configurations were tested:
> > >
> > > 1. Sheaves used for maple tree nodes only (this patchset).
> > >
> > > 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1].
> > > This patchset avoids allocating additional vm_lock structure on each mmap
> > > syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache.
> > >
> > > 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock
> > > to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace
> > > TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache.
> >
> > Hm why we can't use both? I don't think any kmem_cache_create check makes
> > them exclusive? TYPESAFE_BY_RCU only affects how slab pages are freed, it
> > doesn't e.g. delay reuse of individual objects, and caching in a sheaf
> > doesn't write to the object. Am I missing something?
>
> Ah, I was under impression that to use sheaves I would have to ensure
> the freeing happens via kfree_rcu()->kfree_rcu_sheaf() path but now
> that you mentioned that, I guess I could keep using kmem_cache_free()
> and that would use free_to_pcs() internally... When time comes to free
> the page, TYPESAFE_BY_RCU will free it after the grace period.
> I can try that combination as well and see if anything breaks.

This seems to be working fine. The new configuration is:

4. Sheaves used for maple tree nodes and for vm_area_struct cache with
vm_lock to vm_refcnt conversion [1]. vm_area_struct cache uses both
TYPESAFE_BY_RCU and sheaves (but obviously not kfree_rcu_sheaf()).

>
> >
> > > The values represent the total time it took to perform mmap syscalls, less is
> > > better.
> > >
> > > (1)                  baseline       control
> > > Little core       7.58327       6.614939 (-12.77%)
> > > Medium core  2.125315     1.428702 (-32.78%)
> > > Big core          0.514673     0.422948 (-17.82%)
> > >
> > > (2)                  baseline      control
> > > Little core       7.58327       5.141478 (-32.20%)
> > > Medium core  2.125315     0.427692 (-79.88%)
> > > Big core          0.514673    0.046642 (-90.94%)
> > >
> > > (3)                   baseline      control
> > > Little core        7.58327      4.779624 (-36.97%)
> > > Medium core   2.125315    0.450368 (-78.81%)
> > > Big core           0.514673    0.037776 (-92.66%)

(4)                   baseline      control
Little core        7.58327      4.642977 (-38.77%)
Medium core   2.125315    0.373692 (-82.42%)
Big core           0.514673    0.043613 (-91.53%)

I think the difference between (3) and (4) is noise.
Thanks,
Suren.

> > >
> > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct
> > > yields slightly better averages and I noticed that this was mostly due
> > > to sheaves results missing occasional spikes that worsened
> > > TYPESAFE_BY_RCU averages (the results seemed more stable with
> > > sheaves).
> >
> > Thanks a lot, that looks promising!
>
> Indeed, that looks better than I expected :)
> Cheers!
>
> >
> > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
> > >
> >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-14 16:27 ` [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t Vlastimil Babka
  2025-02-17 14:19   ` Sebastian Andrzej Siewior
@ 2025-02-26 17:00   ` Davidlohr Bueso
  2025-02-26 17:15     ` Alexei Starovoitov
  1 sibling, 1 reply; 55+ messages in thread
From: Davidlohr Bueso @ 2025-02-26 17:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Fri, 14 Feb 2025, Vlastimil Babka wrote:

>From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
>In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
>critical section, but it doesn't prevent NMI, so the fully reentrant
>code cannot use local_lock_irqsave() for exclusive access.
>
>Introduce localtry_lock_t and localtry_lock_irqsave() that
>disables interrupts and sets acquired=1, so localtry_lock_irqsave()
>from NMI attempting to acquire the same lock will return false.
>
>In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
>Map localtry_lock_irqsave() to preemptible spin_trylock().
>When in hard IRQ or NMI return false right away, since
>spin_trylock() is not safe due to PI issues.
>
>Note there is no need to use local_inc for acquired variable,
>since it's a percpu variable with strict nesting scopes.
>

LGTM.

Acked-by: Davidlohr Bueso <dave@stgolabs.net>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-26 17:00   ` Davidlohr Bueso
@ 2025-02-26 17:15     ` Alexei Starovoitov
  2025-02-26 19:28       ` Davidlohr Bueso
  0 siblings, 1 reply; 55+ messages in thread
From: Alexei Starovoitov @ 2025-02-26 17:15 UTC (permalink / raw)
  To: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin, Hyeonggon Yoo,
	Uladzislau Rezki, linux-mm, LKML, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Wed, Feb 26, 2025 at 9:01 AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>
> On Fri, 14 Feb 2025, Vlastimil Babka wrote:
>
> >From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >
> >In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
> >critical section, but it doesn't prevent NMI, so the fully reentrant
> >code cannot use local_lock_irqsave() for exclusive access.
> >
> >Introduce localtry_lock_t and localtry_lock_irqsave() that
> >disables interrupts and sets acquired=1, so localtry_lock_irqsave()
> >from NMI attempting to acquire the same lock will return false.
> >
> >In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
> >Map localtry_lock_irqsave() to preemptible spin_trylock().
> >When in hard IRQ or NMI return false right away, since
> >spin_trylock() is not safe due to PI issues.
> >
> >Note there is no need to use local_inc for acquired variable,
> >since it's a percpu variable with strict nesting scopes.
> >
>
> LGTM.
>
> Acked-by: Davidlohr Bueso <dave@stgolabs.net>

Thanks for the review.
Do you mind if I apply your ack to the latest version of this patch?
https://lore.kernel.org/bpf/20250222024427.30294-2-alexei.starovoitov@gmail.com/


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t
  2025-02-26 17:15     ` Alexei Starovoitov
@ 2025-02-26 19:28       ` Davidlohr Bueso
  0 siblings, 0 replies; 55+ messages in thread
From: Davidlohr Bueso @ 2025-02-26 19:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin, Hyeonggon Yoo,
	Uladzislau Rezki, linux-mm, LKML, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Wed, 26 Feb 2025, Alexei Starovoitov wrote:

>On Wed, Feb 26, 2025 at 9:01???AM Davidlohr Bueso <dave@stgolabs.net> wrote:
>>
>> On Fri, 14 Feb 2025, Vlastimil Babka wrote:
>>
>> >From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> >
>> >In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
>> >critical section, but it doesn't prevent NMI, so the fully reentrant
>> >code cannot use local_lock_irqsave() for exclusive access.
>> >
>> >Introduce localtry_lock_t and localtry_lock_irqsave() that
>> >disables interrupts and sets acquired=1, so localtry_lock_irqsave()
>> >from NMI attempting to acquire the same lock will return false.
>> >
>> >In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
>> >Map localtry_lock_irqsave() to preemptible spin_trylock().
>> >When in hard IRQ or NMI return false right away, since
>> >spin_trylock() is not safe due to PI issues.
>> >
>> >Note there is no need to use local_inc for acquired variable,
>> >since it's a percpu variable with strict nesting scopes.
>> >
>>
>> LGTM.
>>
>> Acked-by: Davidlohr Bueso <dave@stgolabs.net>
>
>Thanks for the review.
>Do you mind if I apply your ack to the latest version of this patch?
>https://lore.kernel.org/bpf/20250222024427.30294-2-alexei.starovoitov@gmail.com/

Yes, that is fine.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-02-25 20:26           ` Suren Baghdasaryan
@ 2025-03-04 10:54             ` Vlastimil Babka
  2025-03-04 18:35               ` Suren Baghdasaryan
  2025-03-04 19:08               ` Liam R. Howlett
  0 siblings, 2 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-04 10:54 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Kent Overstreet, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On 2/25/25 21:26, Suren Baghdasaryan wrote:
> On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> >
>> > > The values represent the total time it took to perform mmap syscalls, less is
>> > > better.
>> > >
>> > > (1)                  baseline       control
>> > > Little core       7.58327       6.614939 (-12.77%)
>> > > Medium core  2.125315     1.428702 (-32.78%)
>> > > Big core          0.514673     0.422948 (-17.82%)
>> > >
>> > > (2)                  baseline      control
>> > > Little core       7.58327       5.141478 (-32.20%)
>> > > Medium core  2.125315     0.427692 (-79.88%)
>> > > Big core          0.514673    0.046642 (-90.94%)
>> > >
>> > > (3)                   baseline      control
>> > > Little core        7.58327      4.779624 (-36.97%)
>> > > Medium core   2.125315    0.450368 (-78.81%)
>> > > Big core           0.514673    0.037776 (-92.66%)
> 
> (4)                   baseline      control
> Little core        7.58327      4.642977 (-38.77%)
> Medium core   2.125315    0.373692 (-82.42%)
> Big core           0.514673    0.043613 (-91.53%)
> 
> I think the difference between (3) and (4) is noise.
> Thanks,
> Suren.

Hi, as we discussed yesterday, it would be useful to set the baseline to
include everything before sheaves as that's already on the way to 6.15, so
we can see more clearly what sheaves do relative to that. So at this point
it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
thus like in scenario (4)), and benchmark the following:

- baseline - vma locking conversion with TYPESAFE_BY_RCU
- baseline+maple tree node reduction from mm-unstable (Liam might point out
which patches?)
- the above + this series + sheaves enabled for vm_area_struct cache
- the above + full maple node sheaves conversion [1]
- the above + the top-most patches from [1] that are optimizations with a
tradeoff (not clear win-win) so it would be good to know if they are useful

[1] currently the 4 commits here:
https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
but as Liam noted, they won't cherry pick without conflict once maple tree
node reduction is backported, but he's working on a rebase

Thanks in advance!

>> > >
>> > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct
>> > > yields slightly better averages and I noticed that this was mostly due
>> > > to sheaves results missing occasional spikes that worsened
>> > > TYPESAFE_BY_RCU averages (the results seemed more stable with
>> > > sheaves).
>> >
>> > Thanks a lot, that looks promising!
>>
>> Indeed, that looks better than I expected :)
>> Cheers!
>>
>> >
>> > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
>> > >
>> >



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-03-04 10:54             ` Vlastimil Babka
@ 2025-03-04 18:35               ` Suren Baghdasaryan
  2025-03-04 19:08               ` Liam R. Howlett
  1 sibling, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-03-04 18:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kent Overstreet, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

On Tue, Mar 4, 2025 at 2:55 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/25/25 21:26, Suren Baghdasaryan wrote:
> > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> >
> >> > > The values represent the total time it took to perform mmap syscalls, less is
> >> > > better.
> >> > >
> >> > > (1)                  baseline       control
> >> > > Little core       7.58327       6.614939 (-12.77%)
> >> > > Medium core  2.125315     1.428702 (-32.78%)
> >> > > Big core          0.514673     0.422948 (-17.82%)
> >> > >
> >> > > (2)                  baseline      control
> >> > > Little core       7.58327       5.141478 (-32.20%)
> >> > > Medium core  2.125315     0.427692 (-79.88%)
> >> > > Big core          0.514673    0.046642 (-90.94%)
> >> > >
> >> > > (3)                   baseline      control
> >> > > Little core        7.58327      4.779624 (-36.97%)
> >> > > Medium core   2.125315    0.450368 (-78.81%)
> >> > > Big core           0.514673    0.037776 (-92.66%)
> >
> > (4)                   baseline      control
> > Little core        7.58327      4.642977 (-38.77%)
> > Medium core   2.125315    0.373692 (-82.42%)
> > Big core           0.514673    0.043613 (-91.53%)
> >
> > I think the difference between (3) and (4) is noise.
> > Thanks,
> > Suren.
>
> Hi, as we discussed yesterday, it would be useful to set the baseline to
> include everything before sheaves as that's already on the way to 6.15, so
> we can see more clearly what sheaves do relative to that. So at this point
> it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
> thus like in scenario (4)), and benchmark the following:
>
> - baseline - vma locking conversion with TYPESAFE_BY_RCU
> - baseline+maple tree node reduction from mm-unstable (Liam might point out
> which patches?)
> - the above + this series + sheaves enabled for vm_area_struct cache
> - the above + full maple node sheaves conversion [1]
> - the above + the top-most patches from [1] that are optimizations with a
> tradeoff (not clear win-win) so it would be good to know if they are useful
>
> [1] currently the 4 commits here:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
> but as Liam noted, they won't cherry pick without conflict once maple tree
> node reduction is backported, but he's working on a rebase
>
> Thanks in advance!

Sure, I'll run the tests and post results sometime later this week.
Thanks!

>
> >> > >
> >> > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct
> >> > > yields slightly better averages and I noticed that this was mostly due
> >> > > to sheaves results missing occasional spikes that worsened
> >> > > TYPESAFE_BY_RCU averages (the results seemed more stable with
> >> > > sheaves).
> >> >
> >> > Thanks a lot, that looks promising!
> >>
> >> Indeed, that looks better than I expected :)
> >> Cheers!
> >>
> >> >
> >> > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
> >> > >
> >> >
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-03-04 10:54             ` Vlastimil Babka
  2025-03-04 18:35               ` Suren Baghdasaryan
@ 2025-03-04 19:08               ` Liam R. Howlett
  2025-03-14 17:10                 ` Suren Baghdasaryan
  1 sibling, 1 reply; 55+ messages in thread
From: Liam R. Howlett @ 2025-03-04 19:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Kent Overstreet, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov

* Vlastimil Babka <vbabka@suse.cz> [250304 05:55]:
> On 2/25/25 21:26, Suren Baghdasaryan wrote:
> > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> >
> >> > > The values represent the total time it took to perform mmap syscalls, less is
> >> > > better.
> >> > >
> >> > > (1)                  baseline       control
> >> > > Little core       7.58327       6.614939 (-12.77%)
> >> > > Medium core  2.125315     1.428702 (-32.78%)
> >> > > Big core          0.514673     0.422948 (-17.82%)
> >> > >
> >> > > (2)                  baseline      control
> >> > > Little core       7.58327       5.141478 (-32.20%)
> >> > > Medium core  2.125315     0.427692 (-79.88%)
> >> > > Big core          0.514673    0.046642 (-90.94%)
> >> > >
> >> > > (3)                   baseline      control
> >> > > Little core        7.58327      4.779624 (-36.97%)
> >> > > Medium core   2.125315    0.450368 (-78.81%)
> >> > > Big core           0.514673    0.037776 (-92.66%)
> > 
> > (4)                   baseline      control
> > Little core        7.58327      4.642977 (-38.77%)
> > Medium core   2.125315    0.373692 (-82.42%)
> > Big core           0.514673    0.043613 (-91.53%)
> > 
> > I think the difference between (3) and (4) is noise.
> > Thanks,
> > Suren.
> 
> Hi, as we discussed yesterday, it would be useful to set the baseline to
> include everything before sheaves as that's already on the way to 6.15, so
> we can see more clearly what sheaves do relative to that. So at this point
> it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
> thus like in scenario (4)), and benchmark the following:
> 
> - baseline - vma locking conversion with TYPESAFE_BY_RCU
> - baseline+maple tree node reduction from mm-unstable (Liam might point out
> which patches?)

Sid's patches [1] are already in mm-unstable.


> - the above + this series + sheaves enabled for vm_area_struct cache
> - the above + full maple node sheaves conversion [1]
> - the above + the top-most patches from [1] that are optimizations with a
> tradeoff (not clear win-win) so it would be good to know if they are useful
> 
> [1] currently the 4 commits here:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
> but as Liam noted, they won't cherry pick without conflict once maple tree
> node reduction is backported, but he's working on a rebase

Rebased maple tree sheaves, patches are here [2].

> 
> 
...

Thanks,
Liam

[1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
[2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-02-22 22:46   ` Suren Baghdasaryan
  2025-02-22 22:56     ` Suren Baghdasaryan
@ 2025-03-12 14:57     ` Vlastimil Babka
  2025-03-12 15:14       ` Suren Baghdasaryan
  1 sibling, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-12 14:57 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 2/22/25 23:46, Suren Baghdasaryan wrote:
> On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Specifying a non-zero value for a new struct kmem_cache_args field
>> sheaf_capacity will setup a caching layer of percpu arrays called
>> sheaves of given capacity for the created cache.
>>
>> Allocations from the cache will allocate via the percpu sheaves (main or
>> spare) as long as they have no NUMA node preference. Frees will also
>> refill one of the sheaves.
>>
>> When both percpu sheaves are found empty during an allocation, an empty
>> sheaf may be replaced with a full one from the per-node barn. If none
>> are available and the allocation is allowed to block, an empty sheaf is
>> refilled from slab(s) by an internal bulk alloc operation. When both
>> percpu sheaves are full during freeing, the barn can replace a full one
>> with an empty one, unless over a full sheaves limit. In that case a
>> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
>> sheaves and barns is also wired to the existing cpu flushing and cache
>> shrinking operations.
>>
>> The sheaves do not distinguish NUMA locality of the cached objects. If
>> an allocation is requested with kmem_cache_alloc_node() with a specific
>> node (not NUMA_NO_NODE), sheaves are bypassed.
>>
>> The bulk operations exposed to slab users also try to utilize the
>> sheaves as long as the necessary (full or empty) sheaves are available
>> on the cpu or in the barn. Once depleted, they will fallback to bulk
>> alloc/free to slabs directly to avoid double copying.
>>
>> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
>> allocated or freed using the sheaves. Counters sheaf_refill,
>> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
>> from or to slab pages, and can be used to assess how effective the
>> caching is. The refill and flush operations will also count towards the
>> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
>> counters.
>>
>> Access to the percpu sheaves is protected by local_lock_irqsave()
>> operations, each per-NUMA-node barn has a spin_lock.
>>
>> A current limitation is that when slub_debug is enabled for a cache with
>> percpu sheaves, the objects in the array are considered as allocated from
>> the slub_debug perspective, and the alloc/free debugging hooks occur
>> when moving the objects between the array and slab pages. This means
>> that e.g. an use-after-free that occurs for an object cached in the
>> array is undetected. Collected alloc/free stacktraces might also be less
>> useful. This limitation could be changed in the future.
>>
>> On the other hand, KASAN, kmemcg and other hooks are executed on actual
>> allocations and frees by kmem_cache users even if those use the array,
>> so their debugging or accounting accuracy should be unaffected.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Only one possible issue in __pcs_flush_all_cpu(), all other comments
> are nits and suggestions.

Thanks.

>> +        * Limitations: when slub_debug is enabled for the cache, all relevant
>> +        * actions (i.e. poisoning, obtaining stacktraces) and checks happen
>> +        * when objects move between sheaves and slab pages, which may result in
>> +        * e.g. not detecting a use-after-free while the object is in the array
>> +        * cache, and the stacktraces may be less useful.
> 
> I would also love to see a short comparison of sheaves (when objects
> are freed using kfree_rcu()) vs SLAB_TYPESAFE_BY_RCU. I think both
> mechanisms rcu-free objects in bulk but sheaves would not reuse an
> object before RCU grace period is passed. Is that right?

I don't think that's right. SLAB_TYPESAFE_BY_RCU doesn't rcu-free objects in
bulk, the objects are freed immediately. It only rcu-delays freeing the slab
folio once all objects are freed.

>> +struct slub_percpu_sheaves {
>> +       local_lock_t lock;
>> +       struct slab_sheaf *main; /* never NULL when unlocked */
>> +       struct slab_sheaf *spare; /* empty or full, may be NULL */
>> +       struct slab_sheaf *rcu_free;
> 
> Would be nice to have a short comment for rcu_free as well. I could
> guess what main and spare are but for rcu_free had to look further.

Added.

>> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>> +                                  size_t size, void **p);
>> +
>> +
>> +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
>> +                        gfp_t gfp)
>> +{
>> +       int to_fill = s->sheaf_capacity - sheaf->size;
>> +       int filled;
>> +
>> +       if (!to_fill)
>> +               return 0;
>> +
>> +       filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
>> +                                        &sheaf->objects[sheaf->size]);
>> +
>> +       if (!filled)
>> +               return -ENOMEM;
>> +
>> +       sheaf->size = s->sheaf_capacity;
> 
> nit: __kmem_cache_alloc_bulk() either allocates requested number of
> objects or returns 0, so the current code is fine but if at some point
> the implementation changes so that it can return smaller number of
> objects than requested (filled < to_fill) then the above assignment
> will become invalid. I think a safer thing here would be to just:
> 
>        sheaf->size += filled;
> 
> which also makes logical sense. Alternatively you could add
> VM_BUG_ON(filled != to_fill) but the increment I think would be
> better.

It's useful to indicate the refill was not successful, for patch 6. So I'm
changing this to:

        sheaf->size += filled;

        stat_add(s, SHEAF_REFILL, filled);

        if (filled < to_fill)
                return -ENOMEM;

        return 0;

>> +
>> +       stat_add(s, SHEAF_REFILL, filled);
>> +
>> +       return 0;
>> +}
>> +
>> +
>> +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
>> +{
>> +       struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
>> +
>> +       if (!sheaf)
>> +               return NULL;
>> +
>> +       if (refill_sheaf(s, sheaf, gfp)) {
>> +               free_empty_sheaf(s, sheaf);
>> +               return NULL;
>> +       }
>> +
>> +       return sheaf;
>> +}
>> +
>> +/*
>> + * Maximum number of objects freed during a single flush of main pcs sheaf.
>> + * Translates directly to an on-stack array size.
>> + */
>> +#define PCS_BATCH_MAX  32U
>> +
> .> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t
> size, void **p);
>> +
> 
> A comment clarifying why you are freeing in PCS_BATCH_MAX batches here
> would be helpful. My understanding is that you do that to free objects
> outside of the cpu_sheaves->lock, so you isolate a batch, release the
> lock and then free the batch.

OK.

>> +static void sheaf_flush_main(struct kmem_cache *s)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +       unsigned int batch, remaining;
>> +       void *objects[PCS_BATCH_MAX];
>> +       struct slab_sheaf *sheaf;
>> +       unsigned long flags;
>> +
>> +next_batch:
>> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>> +       pcs = this_cpu_ptr(s->cpu_sheaves);
>> +       sheaf = pcs->main;
>> +
>> +       batch = min(PCS_BATCH_MAX, sheaf->size);
>> +
>> +       sheaf->size -= batch;
>> +       memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
>> +
>> +       remaining = sheaf->size;
>> +
>> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> +       __kmem_cache_free_bulk(s, batch, &objects[0]);
>> +
>> +       stat_add(s, SHEAF_FLUSH_MAIN, batch);
>> +
>> +       if (remaining)
>> +               goto next_batch;
>> +}
>> +
> 
> This function seems to be used against either isolated sheaves or in
> slub_cpu_dead() --> __pcs_flush_all_cpu() path where we hold
> slab_mutex and I think that guarantees that the sheaf is unused. Maybe
> a short comment clarifying this requirement or rename the function to
> reflect that? Something like flush_unused_sheaf()?

It's not slab_mutex, but the fact slub_cpu_dead() is executed in a hotplug
phase when the given cpu is already not executing anymore and thus cannot be
manipulating its percpu sheaves, so we are the only one that does.
So I will clarify and rename to sheaf_flush_unused().

>> +
>> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +
>> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> +
>> +       if (pcs->spare) {
>> +               sheaf_flush(s, pcs->spare);
>> +               free_empty_sheaf(s, pcs->spare);
>> +               pcs->spare = NULL;
>> +       }
>> +
>> +       // TODO: handle rcu_free
>> +       BUG_ON(pcs->rcu_free);
>> +
>> +       sheaf_flush_main(s);
> 
> Hmm. sheaf_flush_main() always flushes for this_cpu only, so IIUC this
> call will not necessarily flush the main sheaf for the cpu passed to
> __pcs_flush_all_cpu().

Thanks, yes I need to call sheaf_flush_unused(pcs->main). It's ok to do
given my reply above.

>> +/*
>> + * Free an object to the percpu sheaves.
>> + * The object is expected to have passed slab_free_hook() already.
>> + */
>> +static __fastpath_inline
>> +void free_to_pcs(struct kmem_cache *s, void *object)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +       unsigned long flags;
>> +
>> +restart:
>> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>> +       pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
>> +
>> +               struct slab_sheaf *empty;
>> +
>> +               if (!pcs->spare) {
>> +                       empty = barn_get_empty_sheaf(pcs->barn);
>> +                       if (empty) {
>> +                               pcs->spare = pcs->main;
>> +                               pcs->main = empty;
>> +                               goto do_free;
>> +                       }
>> +                       goto alloc_empty;
>> +               }
>> +
>> +               if (pcs->spare->size < s->sheaf_capacity) {
>> +                       stat(s, SHEAF_SWAP);
>> +                       swap(pcs->main, pcs->spare);
>> +                       goto do_free;
>> +               }
>> +
>> +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
>> +
>> +               if (!IS_ERR(empty)) {
>> +                       pcs->main = empty;
>> +                       goto do_free;
>> +               }
>> +
>> +               if (PTR_ERR(empty) == -E2BIG) {
>> +                       /* Since we got here, spare exists and is full */
>> +                       struct slab_sheaf *to_flush = pcs->spare;
>> +
>> +                       pcs->spare = NULL;
>> +                       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> +                       sheaf_flush(s, to_flush);
>> +                       empty = to_flush;
>> +                       goto got_empty;
>> +               }
>> +
>> +alloc_empty:
>> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>> +
>> +               if (!empty) {
>> +                       sheaf_flush_main(s);
>> +                       goto restart;
>> +               }
>> +
>> +got_empty:
>> +               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
>> +               pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +               /*
>> +                * if we put any sheaf to barn here, it's because we raced or
>> +                * have been migrated to a different cpu, which should be rare
>> +                * enough so just ignore the barn's limits to simplify
>> +                */
>> +               if (unlikely(pcs->main->size < s->sheaf_capacity)) {
>> +                       if (!pcs->spare)
>> +                               pcs->spare = empty;
>> +                       else
>> +                               barn_put_empty_sheaf(pcs->barn, empty, true);
>> +                       goto do_free;
>> +               }
>> +
>> +               if (!pcs->spare) {
>> +                       pcs->spare = pcs->main;
>> +                       pcs->main = empty;
>> +                       goto do_free;
>> +               }
>> +
>> +               barn_put_full_sheaf(pcs->barn, pcs->main, true);
>> +               pcs->main = empty;
> 
> I find the program flow in this function quite complex and hard to
> follow. I think refactoring the above block starting from "pcs =
> this_cpu_ptr(s->cpu_sheaves)" would somewhat simplify it. That
> eliminates the need for the "got_empty" label and makes the
> locking/unlocking sequence of s->cpu_sheaves->lock a bit more clear.

I'm a bit lost, refactoring how exactly?

>> +       }
>> +
>> +do_free:
>> +       pcs->main->objects[pcs->main->size++] = object;
>> +
>> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> +
>> +       stat(s, FREE_PCS);
>> +}


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-02-24  8:04   ` Harry Yoo
@ 2025-03-12 14:59     ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-12 14:59 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On 2/24/25 09:04, Harry Yoo wrote:
>> +static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
>> +{
>> +	struct list_head empty_list;
>> +	struct list_head full_list;
>> +	struct slab_sheaf *sheaf, *sheaf2;
>> +	unsigned long flags;
>> +
>> +	INIT_LIST_HEAD(&empty_list);
>> +	INIT_LIST_HEAD(&full_list);
>> +
>> +	spin_lock_irqsave(&barn->lock, flags);
>> +
>> +	list_splice_init(&barn->sheaves_full, &full_list);
>> +	barn->nr_full = 0;
>> +	list_splice_init(&barn->sheaves_empty, &empty_list);
>> +	barn->nr_empty = 0;
>> +
>> +	spin_unlock_irqrestore(&barn->lock, flags);
>> +
>> +	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
>> +		sheaf_flush(s, sheaf);
>> +		list_move(&sheaf->barn_list, &empty_list);
>> +	}
> 
> nit: is this list_move() necessary?

You mean I can just do free_empty_sheaf(s, sheaf); ? Yeah why not.

>> +
>> +	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
>> +		free_empty_sheaf(s, sheaf);
>> +}
> 
> Otherwise looks good to me.

Thanks.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-03-12 14:57     ` Vlastimil Babka
@ 2025-03-12 15:14       ` Suren Baghdasaryan
  2025-03-17 10:09         ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-03-12 15:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Mar 12, 2025 at 7:58 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/22/25 23:46, Suren Baghdasaryan wrote:
> > On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> Specifying a non-zero value for a new struct kmem_cache_args field
> >> sheaf_capacity will setup a caching layer of percpu arrays called
> >> sheaves of given capacity for the created cache.
> >>
> >> Allocations from the cache will allocate via the percpu sheaves (main or
> >> spare) as long as they have no NUMA node preference. Frees will also
> >> refill one of the sheaves.
> >>
> >> When both percpu sheaves are found empty during an allocation, an empty
> >> sheaf may be replaced with a full one from the per-node barn. If none
> >> are available and the allocation is allowed to block, an empty sheaf is
> >> refilled from slab(s) by an internal bulk alloc operation. When both
> >> percpu sheaves are full during freeing, the barn can replace a full one
> >> with an empty one, unless over a full sheaves limit. In that case a
> >> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> >> sheaves and barns is also wired to the existing cpu flushing and cache
> >> shrinking operations.
> >>
> >> The sheaves do not distinguish NUMA locality of the cached objects. If
> >> an allocation is requested with kmem_cache_alloc_node() with a specific
> >> node (not NUMA_NO_NODE), sheaves are bypassed.
> >>
> >> The bulk operations exposed to slab users also try to utilize the
> >> sheaves as long as the necessary (full or empty) sheaves are available
> >> on the cpu or in the barn. Once depleted, they will fallback to bulk
> >> alloc/free to slabs directly to avoid double copying.
> >>
> >> Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
> >> allocated or freed using the sheaves. Counters sheaf_refill,
> >> sheaf_flush_main and sheaf_flush_other count objects filled or flushed
> >> from or to slab pages, and can be used to assess how effective the
> >> caching is. The refill and flush operations will also count towards the
> >> usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
> >> counters.
> >>
> >> Access to the percpu sheaves is protected by local_lock_irqsave()
> >> operations, each per-NUMA-node barn has a spin_lock.
> >>
> >> A current limitation is that when slub_debug is enabled for a cache with
> >> percpu sheaves, the objects in the array are considered as allocated from
> >> the slub_debug perspective, and the alloc/free debugging hooks occur
> >> when moving the objects between the array and slab pages. This means
> >> that e.g. an use-after-free that occurs for an object cached in the
> >> array is undetected. Collected alloc/free stacktraces might also be less
> >> useful. This limitation could be changed in the future.
> >>
> >> On the other hand, KASAN, kmemcg and other hooks are executed on actual
> >> allocations and frees by kmem_cache users even if those use the array,
> >> so their debugging or accounting accuracy should be unaffected.
> >>
> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >
> > Only one possible issue in __pcs_flush_all_cpu(), all other comments
> > are nits and suggestions.
>
> Thanks.
>
> >> +        * Limitations: when slub_debug is enabled for the cache, all relevant
> >> +        * actions (i.e. poisoning, obtaining stacktraces) and checks happen
> >> +        * when objects move between sheaves and slab pages, which may result in
> >> +        * e.g. not detecting a use-after-free while the object is in the array
> >> +        * cache, and the stacktraces may be less useful.
> >
> > I would also love to see a short comparison of sheaves (when objects
> > are freed using kfree_rcu()) vs SLAB_TYPESAFE_BY_RCU. I think both
> > mechanisms rcu-free objects in bulk but sheaves would not reuse an
> > object before RCU grace period is passed. Is that right?
>
> I don't think that's right. SLAB_TYPESAFE_BY_RCU doesn't rcu-free objects in
> bulk, the objects are freed immediately. It only rcu-delays freeing the slab
> folio once all objects are freed.

Yes, you are right.

>
> >> +struct slub_percpu_sheaves {
> >> +       local_lock_t lock;
> >> +       struct slab_sheaf *main; /* never NULL when unlocked */
> >> +       struct slab_sheaf *spare; /* empty or full, may be NULL */
> >> +       struct slab_sheaf *rcu_free;
> >
> > Would be nice to have a short comment for rcu_free as well. I could
> > guess what main and spare are but for rcu_free had to look further.
>
> Added.
>
> >> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> >> +                                  size_t size, void **p);
> >> +
> >> +
> >> +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
> >> +                        gfp_t gfp)
> >> +{
> >> +       int to_fill = s->sheaf_capacity - sheaf->size;
> >> +       int filled;
> >> +
> >> +       if (!to_fill)
> >> +               return 0;
> >> +
> >> +       filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> >> +                                        &sheaf->objects[sheaf->size]);
> >> +
> >> +       if (!filled)
> >> +               return -ENOMEM;
> >> +
> >> +       sheaf->size = s->sheaf_capacity;
> >
> > nit: __kmem_cache_alloc_bulk() either allocates requested number of
> > objects or returns 0, so the current code is fine but if at some point
> > the implementation changes so that it can return smaller number of
> > objects than requested (filled < to_fill) then the above assignment
> > will become invalid. I think a safer thing here would be to just:
> >
> >        sheaf->size += filled;
> >
> > which also makes logical sense. Alternatively you could add
> > VM_BUG_ON(filled != to_fill) but the increment I think would be
> > better.
>
> It's useful to indicate the refill was not successful, for patch 6. So I'm
> changing this to:
>
>         sheaf->size += filled;
>
>         stat_add(s, SHEAF_REFILL, filled);
>
>         if (filled < to_fill)
>                 return -ENOMEM;
>
>         return 0;

That looks good to me.

>
> >> +
> >> +       stat_add(s, SHEAF_REFILL, filled);
> >> +
> >> +       return 0;
> >> +}
> >> +
> >> +
> >> +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
> >> +{
> >> +       struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
> >> +
> >> +       if (!sheaf)
> >> +               return NULL;
> >> +
> >> +       if (refill_sheaf(s, sheaf, gfp)) {
> >> +               free_empty_sheaf(s, sheaf);
> >> +               return NULL;
> >> +       }
> >> +
> >> +       return sheaf;
> >> +}
> >> +
> >> +/*
> >> + * Maximum number of objects freed during a single flush of main pcs sheaf.
> >> + * Translates directly to an on-stack array size.
> >> + */
> >> +#define PCS_BATCH_MAX  32U
> >> +
> > .> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t
> > size, void **p);
> >> +
> >
> > A comment clarifying why you are freeing in PCS_BATCH_MAX batches here
> > would be helpful. My understanding is that you do that to free objects
> > outside of the cpu_sheaves->lock, so you isolate a batch, release the
> > lock and then free the batch.
>
> OK.
>
> >> +static void sheaf_flush_main(struct kmem_cache *s)
> >> +{
> >> +       struct slub_percpu_sheaves *pcs;
> >> +       unsigned int batch, remaining;
> >> +       void *objects[PCS_BATCH_MAX];
> >> +       struct slab_sheaf *sheaf;
> >> +       unsigned long flags;
> >> +
> >> +next_batch:
> >> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> >> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> >> +       sheaf = pcs->main;
> >> +
> >> +       batch = min(PCS_BATCH_MAX, sheaf->size);
> >> +
> >> +       sheaf->size -= batch;
> >> +       memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
> >> +
> >> +       remaining = sheaf->size;
> >> +
> >> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> >> +
> >> +       __kmem_cache_free_bulk(s, batch, &objects[0]);
> >> +
> >> +       stat_add(s, SHEAF_FLUSH_MAIN, batch);
> >> +
> >> +       if (remaining)
> >> +               goto next_batch;
> >> +}
> >> +
> >
> > This function seems to be used against either isolated sheaves or in
> > slub_cpu_dead() --> __pcs_flush_all_cpu() path where we hold
> > slab_mutex and I think that guarantees that the sheaf is unused. Maybe
> > a short comment clarifying this requirement or rename the function to
> > reflect that? Something like flush_unused_sheaf()?
>
> It's not slab_mutex, but the fact slub_cpu_dead() is executed in a hotplug
> phase when the given cpu is already not executing anymore and thus cannot be
> manipulating its percpu sheaves, so we are the only one that does.
> So I will clarify and rename to sheaf_flush_unused().

I see. Thanks for explaining.

>
> >> +
> >> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
> >> +{
> >> +       struct slub_percpu_sheaves *pcs;
> >> +
> >> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> >> +
> >> +       if (pcs->spare) {
> >> +               sheaf_flush(s, pcs->spare);
> >> +               free_empty_sheaf(s, pcs->spare);
> >> +               pcs->spare = NULL;
> >> +       }
> >> +
> >> +       // TODO: handle rcu_free
> >> +       BUG_ON(pcs->rcu_free);
> >> +
> >> +       sheaf_flush_main(s);
> >
> > Hmm. sheaf_flush_main() always flushes for this_cpu only, so IIUC this
> > call will not necessarily flush the main sheaf for the cpu passed to
> > __pcs_flush_all_cpu().
>
> Thanks, yes I need to call sheaf_flush_unused(pcs->main). It's ok to do
> given my reply above.
>
> >> +/*
> >> + * Free an object to the percpu sheaves.
> >> + * The object is expected to have passed slab_free_hook() already.
> >> + */
> >> +static __fastpath_inline
> >> +void free_to_pcs(struct kmem_cache *s, void *object)
> >> +{
> >> +       struct slub_percpu_sheaves *pcs;
> >> +       unsigned long flags;
> >> +
> >> +restart:
> >> +       local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> >> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> >> +
> >> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> >> +
> >> +               struct slab_sheaf *empty;
> >> +
> >> +               if (!pcs->spare) {
> >> +                       empty = barn_get_empty_sheaf(pcs->barn);
> >> +                       if (empty) {
> >> +                               pcs->spare = pcs->main;
> >> +                               pcs->main = empty;
> >> +                               goto do_free;
> >> +                       }
> >> +                       goto alloc_empty;
> >> +               }
> >> +
> >> +               if (pcs->spare->size < s->sheaf_capacity) {
> >> +                       stat(s, SHEAF_SWAP);
> >> +                       swap(pcs->main, pcs->spare);
> >> +                       goto do_free;
> >> +               }
> >> +
> >> +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> >> +
> >> +               if (!IS_ERR(empty)) {
> >> +                       pcs->main = empty;
> >> +                       goto do_free;
> >> +               }
> >> +
> >> +               if (PTR_ERR(empty) == -E2BIG) {
> >> +                       /* Since we got here, spare exists and is full */
> >> +                       struct slab_sheaf *to_flush = pcs->spare;
> >> +
> >> +                       pcs->spare = NULL;
> >> +                       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> >> +
> >> +                       sheaf_flush(s, to_flush);
> >> +                       empty = to_flush;
> >> +                       goto got_empty;
> >> +               }
> >> +
> >> +alloc_empty:
> >> +               local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> >> +
> >> +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> >> +
> >> +               if (!empty) {
> >> +                       sheaf_flush_main(s);
> >> +                       goto restart;
> >> +               }
> >> +
> >> +got_empty:
> >> +               local_lock_irqsave(&s->cpu_sheaves->lock, flags);
> >> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> >> +
> >> +               /*
> >> +                * if we put any sheaf to barn here, it's because we raced or
> >> +                * have been migrated to a different cpu, which should be rare
> >> +                * enough so just ignore the barn's limits to simplify
> >> +                */
> >> +               if (unlikely(pcs->main->size < s->sheaf_capacity)) {
> >> +                       if (!pcs->spare)
> >> +                               pcs->spare = empty;
> >> +                       else
> >> +                               barn_put_empty_sheaf(pcs->barn, empty, true);
> >> +                       goto do_free;
> >> +               }
> >> +
> >> +               if (!pcs->spare) {
> >> +                       pcs->spare = pcs->main;
> >> +                       pcs->main = empty;
> >> +                       goto do_free;
> >> +               }
> >> +
> >> +               barn_put_full_sheaf(pcs->barn, pcs->main, true);
> >> +               pcs->main = empty;
> >
> > I find the program flow in this function quite complex and hard to
> > follow. I think refactoring the above block starting from "pcs =
> > this_cpu_ptr(s->cpu_sheaves)" would somewhat simplify it. That
> > eliminates the need for the "got_empty" label and makes the
> > locking/unlocking sequence of s->cpu_sheaves->lock a bit more clear.
>
> I'm a bit lost, refactoring how exactly?

I thought moving the code above into a function above starting from
"pcs = this_cpu_ptr(s->cpu_sheaves)" into its own function would
simplify the flow. But as I said, it's a nit. If you try and don't
like that feel free to ignore this suggestion.

>
> >> +       }
> >> +
> >> +do_free:
> >> +       pcs->main->objects[pcs->main->size++] = object;
> >> +
> >> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
> >> +
> >> +       stat(s, FREE_PCS);
> >> +}


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations
  2025-02-24  8:40   ` Harry Yoo
@ 2025-03-12 16:16     ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-12 16:16 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On 2/24/25 09:40, Harry Yoo wrote:
>> +static bool kfree_rcu_sheaf(void *obj)
>> +{
>> +	struct kmem_cache *s;
>> +	struct folio *folio;
>> +	struct slab *slab;
>> +
>> +	folio = virt_to_folio(obj);
>> +	if (unlikely(!folio_test_slab(folio)))
>> +		return false;
> 
> Does virt_to_folio() work for vmalloc addresses?

Hm, no. Good catch.

> Probably it should check is_vmalloc_addr() first?

Yes, thanks!

> Otherwise look good to me.
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations
  2025-02-22 23:08   ` Suren Baghdasaryan
@ 2025-03-12 16:19     ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-12 16:19 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 2/23/25 00:08, Suren Baghdasaryan wrote:
> On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> addition to main and spare sheaves.
>>
>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> the sheaf is detached and submitted to call_rcu() with a handler that
>> will try to put in in the barn, or flush to slab pages using bulk free,
> 
> s/in in/it in
> 
>> when the barn is full. Then a new empty sheaf must be obtained to put
>> more objects there.
>>
>> It's possible that no free sheaves are available to use for a new
>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> kfree_rcu() machinery.
>>
>> Expected advantages:
>> - batching the kfree_rcu() operations, that could eventually replace the
>>   existing batching
>> - sheaves can be reused for allocations via barn instead of being
>>   flushed to slabs, which is more efficient
>>   - this includes cases where only some cpus are allowed to process rcu
>>     callbacks (Android)
>>
>> Possible disadvantage:
>> - objects might be waiting for more than their grace period (it is
>>   determined by the last object freed into the sheaf), increasing memory
>>   usage - but the existing batching does that too?
>>
>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> implementation favors smaller memory footprint over performance.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks.

>> @@ -2569,6 +2571,24 @@ static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
>>         sheaf->size = 0;
>>  }
>>
>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
>> +                                    struct slab_sheaf *sheaf);
>> +
>> +static void rcu_free_sheaf_nobarn(struct rcu_head *head)
>> +{
>> +       struct slab_sheaf *sheaf;
>> +       struct kmem_cache *s;
>> +
>> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
>> +       s = sheaf->cache;
> 
> Ah, that's where you are using sheaf->cache. Maybe you should
> introduce it in this patch?

Yeah. Will also move the addition of rcu_free to struct slub_percpu_sheaves
instead of those TODOs.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations
  2025-02-23  3:54   ` Suren Baghdasaryan
  2025-02-25  7:30     ` Harry Yoo
@ 2025-03-12 17:09     ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-12 17:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 2/23/25 04:54, Suren Baghdasaryan wrote:
> On Fri, Feb 14, 2025 at 8:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Add functions for efficient guaranteed allocations e.g. in a critical
>> section that cannot sleep, when the exact number of allocations is not
>> known beforehand, but an upper limit can be calculated.
>>
>> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
>> number of objects.
>>
>> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
>> and is guaranteed not to fail until depleted.
>>
>> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
>> allocator after the critical section. This will also attempt to refill
>> it to cache's sheaf capacity for better efficiency of sheaves handling,
>> but it's not stricly necessary to succeed.
>>
>> kmem_cache_refill_sheaf() can be used to refill a previously obtained
>> sheaf to requested size. If the current size is sufficient, it does
>> nothing. If the requested size exceeds cache's sheaf_capacity and the
>> sheaf's current capacity, the sheaf will be replaced with a new one,
>> hence the indirect pointer parameter.
>>
>> kmem_cache_sheaf_size() can be used to query the current size.
>>
>> The implementation supports requesting sizes that exceed cache's
>> sheaf_capacity, but it is not efficient - such sheaves are allocated
>> fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by
>> kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be expecially
> 
> s/expecially/especially
> 
>> ineffective when replacing a sheaf with a new one of a larger capacity.
>> It is therefore better to size cache's sheaf_capacity accordingly.
> 
> If support for sizes exceeding sheaf_capacity adds much complexity
> with no performance benefits, I think it would be ok not to support
> them at all. Users know the capacity of a particular kmem_cache, so
> they can use this API only when their needs are within sheaf_capacity,
> otherwise either size the sheaf appropriately or use slab bulk
> allocation.

As Harry explained, the users (e.g. maple tree) would have to implement the
fallback for unusual situations instead, so it's better to implement it just
once here.

>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks.

>> +/*
>> + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
>> + * It tries to refill the sheaf back to the cache's sheaf_capacity
>> + * to avoid handling partially full sheaves.
>> + *
>> + * If the refill fails because gfp is e.g. GFP_NOWAIT, the sheaf is
>> + * instead dissolved
> 
> Refilling the sheaf here assumes that in the future we are more likely
> to allocate than to free objects or shrink the slab. If the reverse is
> true then it would make sense to flush the sheaf and add it as an
> empty one into the barn. The fact that flushing can't fail would be
> another advantage... We don't know the future but should we be
> predicting a more costly case?

What the comment doesn't say is we first try to make the sheaf become
pcs->spare without any refill. This is the ideal scenario if nobody
interrupts us between prefill (we grab the spare) and return (we return the
spare).

Also the refill is only attempted if the barn can accept full sheaf.

I have clarified the comment.

Maybe we could make the decision to flush e.g. if the sheaf is below half of
the capacity, but that can be subject to further performance evaluation.

>> + */
>> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
>> +                            struct slab_sheaf *sheaf)
>> +{
>> +       struct slub_percpu_sheaves *pcs;
>> +       bool refill = false;
>> +       struct node_barn *barn;
>> +
>> +       if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
>> +               sheaf_flush(s, sheaf);
>> +               kfree(sheaf);
>> +               return;
>> +       }
>> +
>> +       localtry_lock(&s->cpu_sheaves->lock);
>> +       pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +       if (!pcs->spare) {
>> +               pcs->spare = sheaf;
>> +               sheaf = NULL;
>> +       } else if (pcs->barn->nr_full >= MAX_FULL_SHEAVES) {
>> +               /* racy check */
>> +               barn = pcs->barn;
>> +               refill = true;
>> +       }
>> +
>> +       localtry_unlock(&s->cpu_sheaves->lock);
>> +
>> +       if (!sheaf)
>> +               return;
>> +
>> +       /*
>> +        * if the barn is full of full sheaves or we fail to refill the sheaf,
>> +        * simply flush and free it
>> +        */
>> +       if (!refill || refill_sheaf(s, sheaf, gfp)) {
>> +               sheaf_flush(s, sheaf);
>> +               free_empty_sheaf(s, sheaf);
>> +               return;
>> +       }
>> +
>> +       /* we racily determined the sheaf would fit, so now force it */
>> +       barn_put_full_sheaf(barn, sheaf, true);
>> +}
>> +
>> +/*
>> + * refill a sheaf previously returned by kmem_cache_prefill_sheaf to at least
>> + * the given size
>> + *
>> + * the sheaf might be replaced by a new one when requesting more than
>> + * s->sheaf_capacity objects if such replacement is necessary, but the refill
>> + * fails (with -ENOMEM), the existing sheaf is left intact
>> + */
>> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
>> +                           struct slab_sheaf **sheafp, unsigned int size)
>> +{
>> +       struct slab_sheaf *sheaf;
>> +
>> +       /*
>> +        * TODO: do we want to support *sheaf == NULL to be equivalent of
>> +        * kmem_cache_prefill_sheaf() ?
>> +        */
>> +       if (!sheafp || !(*sheafp))
>> +               return -EINVAL;
>> +
>> +       sheaf = *sheafp;
>> +       if (sheaf->size >= size)
>> +               return 0;
>> +
>> +       if (likely(sheaf->capacity >= size)) {
>> +               if (likely(sheaf->capacity == s->sheaf_capacity))
>> +                       return refill_sheaf(s, sheaf, gfp);
>> +
>> +               if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
>> +                                            &sheaf->objects[sheaf->size])) {
>> +                       return -ENOMEM;
>> +               }
>> +               sheaf->size = sheaf->capacity;
>> +
>> +               return 0;
>> +       }
>> +
>> +       /*
>> +        * We had a regular sized sheaf and need an oversize one, or we had an
>> +        * oversize one already but need a larger one now.
>> +        * This should be a very rare path so let's not complicate it.
>> +        */
>> +       sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
> 
> WIth all the above I think you always end up refilling up to
> sheaf->capacity. Not sure if we should mention that in the comment for
> this function because your statement about refilling to at least the
> given size is still correct.

OK mentioned it in the comment.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations
  2025-02-25  8:00   ` Harry Yoo
@ 2025-03-12 18:16     ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-12 18:16 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On 2/25/25 09:00, Harry Yoo wrote:
> On Fri, Feb 14, 2025 at 05:27:42PM +0100, Vlastimil Babka wrote:
>> Add functions for efficient guaranteed allocations e.g. in a critical
>> section that cannot sleep, when the exact number of allocations is not
>> known beforehand, but an upper limit can be calculated.
>> 
>> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
>> number of objects.
>> 
>> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
>> and is guaranteed not to fail until depleted.
>> 
>> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
>> allocator after the critical section. This will also attempt to refill
>> it to cache's sheaf capacity for better efficiency of sheaves handling,
>> but it's not stricly necessary to succeed.
>> 
>> kmem_cache_refill_sheaf() can be used to refill a previously obtained
>> sheaf to requested size. If the current size is sufficient, it does
>> nothing. If the requested size exceeds cache's sheaf_capacity and the
>> sheaf's current capacity, the sheaf will be replaced with a new one,
>> hence the indirect pointer parameter.
>> 
>> kmem_cache_sheaf_size() can be used to query the current size.
>> 
>> The implementation supports requesting sizes that exceed cache's
>> sheaf_capacity, but it is not efficient - such sheaves are allocated
>> fresh in kmem_cache_prefill_sheaf() and flushed and freed immediately by
>> kmem_cache_return_sheaf(). kmem_cache_refill_sheaf() might be expecially
>> ineffective when replacing a sheaf with a new one of a larger capacity.
>> It is therefore better to size cache's sheaf_capacity accordingly.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  include/linux/slab.h |  16 ++++
>>  mm/slub.c            | 227 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 243 insertions(+)
> 
> [... snip ... ]
> 
>> @@ -4831,6 +4857,207 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
>>  }
>>  EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
>>  
>> +
>> +/*
>> + * returns a sheaf that has least the requested size
>> + * when prefilling is needed, do so with given gfp flags
>> + *
>> + * return NULL if sheaf allocation or prefilling failed
>> + */
>> +struct slab_sheaf *
>> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
>> +{
>> +	struct slub_percpu_sheaves *pcs;
>> +	struct slab_sheaf *sheaf = NULL;
>> +
>> +	if (unlikely(size > s->sheaf_capacity)) {
>> +		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
>> +		if (!sheaf)
>> +			return NULL;
>> +
>> +		sheaf->cache = s;
>> +		sheaf->capacity = size;
>> +
>> +		if (!__kmem_cache_alloc_bulk(s, gfp, size,
>> +					     &sheaf->objects[0])) {
>> +			kfree(sheaf);
>> +			return NULL;
>> +		}
>> +
>> +		sheaf->size = size;
>> +
>> +		return sheaf;
>> +	}
>> +
>> +	localtry_lock(&s->cpu_sheaves->lock);
>> +	pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +	if (pcs->spare) {
>> +		sheaf = pcs->spare;
>> +		pcs->spare = NULL;
>> +	}
>> +
>> +	if (!sheaf)
>> +		sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
> 
> Can this be outside localtry lock?

Strictly speaking we'd have to save the barn pointer first, otherwise cpu
hotremove could bite us, I think. But not worth the trouble, as localtry
lock is just disabling preemption and taking the barn lock would disable
irqs anyway. So we're not increasing contention by holding the localtry lock
more than strictly necessary.

> 
>> +
>> +	localtry_unlock(&s->cpu_sheaves->lock);
>> +
>> +	if (!sheaf) {
>> +		sheaf = alloc_empty_sheaf(s, gfp);
>> +	}
>> +
>> +	if (sheaf && sheaf->size < size) {
>> +		if (refill_sheaf(s, sheaf, gfp)) {
>> +			sheaf_flush(s, sheaf);
>> +			free_empty_sheaf(s, sheaf);
>> +			sheaf = NULL;
>> +		}
>> +	}
>> +
>> +	if (sheaf)
>> +		sheaf->capacity = s->sheaf_capacity;
>> +
>> +	return sheaf;
>> +}
>> +
>> +/*
>> + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
>> + * It tries to refill the sheaf back to the cache's sheaf_capacity
>> + * to avoid handling partially full sheaves.
>> + *
>> + * If the refill fails because gfp is e.g. GFP_NOWAIT, the sheaf is
>> + * instead dissolved
>> + */
>> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
>> +			     struct slab_sheaf *sheaf)
>> +{
>> +	struct slub_percpu_sheaves *pcs;
>> +	bool refill = false;
>> +	struct node_barn *barn;
>> +
>> +	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
>> +		sheaf_flush(s, sheaf);
>> +		kfree(sheaf);
>> +		return;
>> +	}
>> +
>> +	localtry_lock(&s->cpu_sheaves->lock);
>> +	pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +	if (!pcs->spare) {
>> +		pcs->spare = sheaf;
>> +		sheaf = NULL;
>> +	} else if (pcs->barn->nr_full >= MAX_FULL_SHEAVES) {
> 
> Did you mean (pcs->barn->nr_full < MAX_FULL_SHEAVES)?

Oops yeah, fixing this can potentially improve performance.

> Otherwise looks good to me.

Thanks a lot!


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock
  2025-02-25  8:54   ` Harry Yoo
@ 2025-03-12 18:23     ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-12 18:23 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On 2/25/25 09:54, Harry Yoo wrote:
> On Fri, Feb 14, 2025 at 05:27:43PM +0100, Vlastimil Babka wrote:
>> The possibility of many barn operations is determined by the current
>> number of full or empty sheaves. Taking the barn->lock just to find out
>> that e.g. there are no empty sheaves results in unnecessary overhead and
>> lock contention. Thus perform these checks outside of the lock with a
>> data_race() annotated variable read and fail quickly without taking the
>> lock.
>> 
>> Checks for sheaf availability that racily succeed have to be obviously
>> repeated under the lock for correctness, but we can skip repeating
>> checks if there are too many sheaves on the given list as the limits
>> don't need to be strict.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> in kmem_cache_return_sheaf:
>> if (!pcs->spare) {                                                      
>> 	pcs->spare = sheaf;                                             
>>	sheaf = NULL;                                                   
>> } else if (pcs->barn->nr_full >= MAX_FULL_SHEAVES) {                    
>>	/* racy check */                                                
>>	barn = pcs->barn;                                               
>>	keep = true;                                                    
>> }  
> 
> By the way this code also needs data_race()?

Right, will add, thanks.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-03-04 19:08               ` Liam R. Howlett
@ 2025-03-14 17:10                 ` Suren Baghdasaryan
  2025-03-17 11:08                   ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-03-14 17:10 UTC (permalink / raw)
  To: Liam R. Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Kent Overstreet, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Sebastian Andrzej Siewior,
	Alexei Starovoitov, Sidhartha Kumar

On Tue, Mar 4, 2025 at 11:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Vlastimil Babka <vbabka@suse.cz> [250304 05:55]:
> > On 2/25/25 21:26, Suren Baghdasaryan wrote:
> > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >>
> > >> >
> > >> > > The values represent the total time it took to perform mmap syscalls, less is
> > >> > > better.
> > >> > >
> > >> > > (1)                  baseline       control
> > >> > > Little core       7.58327       6.614939 (-12.77%)
> > >> > > Medium core  2.125315     1.428702 (-32.78%)
> > >> > > Big core          0.514673     0.422948 (-17.82%)
> > >> > >
> > >> > > (2)                  baseline      control
> > >> > > Little core       7.58327       5.141478 (-32.20%)
> > >> > > Medium core  2.125315     0.427692 (-79.88%)
> > >> > > Big core          0.514673    0.046642 (-90.94%)
> > >> > >
> > >> > > (3)                   baseline      control
> > >> > > Little core        7.58327      4.779624 (-36.97%)
> > >> > > Medium core   2.125315    0.450368 (-78.81%)
> > >> > > Big core           0.514673    0.037776 (-92.66%)
> > >
> > > (4)                   baseline      control
> > > Little core        7.58327      4.642977 (-38.77%)
> > > Medium core   2.125315    0.373692 (-82.42%)
> > > Big core           0.514673    0.043613 (-91.53%)
> > >
> > > I think the difference between (3) and (4) is noise.
> > > Thanks,
> > > Suren.
> >
> > Hi, as we discussed yesterday, it would be useful to set the baseline to
> > include everything before sheaves as that's already on the way to 6.15, so
> > we can see more clearly what sheaves do relative to that. So at this point
> > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
> > thus like in scenario (4)), and benchmark the following:
> >
> > - baseline - vma locking conversion with TYPESAFE_BY_RCU
> > - baseline+maple tree node reduction from mm-unstable (Liam might point out
> > which patches?)
>
> Sid's patches [1] are already in mm-unstable.
>
>
> > - the above + this series + sheaves enabled for vm_area_struct cache
> > - the above + full maple node sheaves conversion [1]
> > - the above + the top-most patches from [1] that are optimizations with a
> > tradeoff (not clear win-win) so it would be good to know if they are useful
> >
> > [1] currently the 4 commits here:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
> > but as Liam noted, they won't cherry pick without conflict once maple tree
> > node reduction is backported, but he's working on a rebase
>
> Rebased maple tree sheaves, patches are here [2].

Hi Folks,
Sorry for the delay. I got the numbers last week but they looked a bit
weird, so I reran the test increasing the number of iterations to make
sure noise is not a factor. That took most of this week. Below are the
results. Please note that I had to backport the patchsets to 6.12
because that's the closest stable Android kernel I can use. I measure
cumulative time to execute mmap syscalls, so the smaller the number
the better mmap performance is:

baseline: 6.12 + vm_lock conversion and TYPESAFE_BY_RCU
config1: baseline + Sid's patches [1]
config2: sheaves RFC
config3: config1 + vm_area_struct with sheaves
config4: config2 + maple_tree Sheaf conversion [2]
config5: config3 + 2 last optimization patches from [3]

               config1     config2     config3     config4     config5
Little core    -0.10%      -10.10%     -12.89%     -10.02%     -13.64%
Mid core       -21.05%     -37.31%     -44.97%     -15.81%     -22.15%
Big core       -17.17%     -34.41%     -45.68%     -11.39%     -15.29%

[1] https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
[2] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
[3] https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple

From the numbers, it looks like config4 regresses the performance and
that's what looked weird to me last week and I wanted to confirm this.
But from sheaves POV, it looks like they provide the benefits I saw
before. Sid's patches which I did not test separately before also look
beneficial.
Thanks,
Suren.

>
>
> >
> >
> ...
>
> Thanks,
> Liam
>
> [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
> [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 01/10] slab: add opt-in caching layer of percpu sheaves
  2025-03-12 15:14       ` Suren Baghdasaryan
@ 2025-03-17 10:09         ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-17 10:09 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 3/12/25 16:14, Suren Baghdasaryan wrote:
> On Wed, Mar 12, 2025 at 7:58 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > I find the program flow in this function quite complex and hard to
>> > follow. I think refactoring the above block starting from "pcs =
>> > this_cpu_ptr(s->cpu_sheaves)" would somewhat simplify it. That
>> > eliminates the need for the "got_empty" label and makes the
>> > locking/unlocking sequence of s->cpu_sheaves->lock a bit more clear.
>>
>> I'm a bit lost, refactoring how exactly?
> 
> I thought moving the code above into a function above starting from
> "pcs = this_cpu_ptr(s->cpu_sheaves)" into its own function would
> simplify the flow. But as I said, it's a nit. If you try and don't
> like that feel free to ignore this suggestion.

OK did it and although I didn't manage to remove the got_empty label, it's
better and I realized I can handle the cases there in a better order and add
one extra possible fallback in the unlikely cases. Please check the result
when I send v3? thanks

>>
>> >> +       }
>> >> +
>> >> +do_free:
>> >> +       pcs->main->objects[pcs->main->size++] = object;
>> >> +
>> >> +       local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
>> >> +
>> >> +       stat(s, FREE_PCS);
>> >> +}



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-03-14 17:10                 ` Suren Baghdasaryan
@ 2025-03-17 11:08                   ` Vlastimil Babka
  2025-03-17 18:56                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2025-03-17 11:08 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Kent Overstreet,
	Christoph Lameter, David Rientjes, Roman Gushchin, Hyeonggon Yoo,
	Uladzislau Rezki, linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Sidhartha Kumar

On 3/14/25 18:10, Suren Baghdasaryan wrote:
> On Tue, Mar 4, 2025 at 11:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>>
>> * Vlastimil Babka <vbabka@suse.cz> [250304 05:55]:
>> > On 2/25/25 21:26, Suren Baghdasaryan wrote:
>> > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
>> > >>
>> > >> >
>> > >> > > The values represent the total time it took to perform mmap syscalls, less is
>> > >> > > better.
>> > >> > >
>> > >> > > (1)                  baseline       control
>> > >> > > Little core       7.58327       6.614939 (-12.77%)
>> > >> > > Medium core  2.125315     1.428702 (-32.78%)
>> > >> > > Big core          0.514673     0.422948 (-17.82%)
>> > >> > >
>> > >> > > (2)                  baseline      control
>> > >> > > Little core       7.58327       5.141478 (-32.20%)
>> > >> > > Medium core  2.125315     0.427692 (-79.88%)
>> > >> > > Big core          0.514673    0.046642 (-90.94%)
>> > >> > >
>> > >> > > (3)                   baseline      control
>> > >> > > Little core        7.58327      4.779624 (-36.97%)
>> > >> > > Medium core   2.125315    0.450368 (-78.81%)
>> > >> > > Big core           0.514673    0.037776 (-92.66%)
>> > >
>> > > (4)                   baseline      control
>> > > Little core        7.58327      4.642977 (-38.77%)
>> > > Medium core   2.125315    0.373692 (-82.42%)
>> > > Big core           0.514673    0.043613 (-91.53%)
>> > >
>> > > I think the difference between (3) and (4) is noise.
>> > > Thanks,
>> > > Suren.
>> >
>> > Hi, as we discussed yesterday, it would be useful to set the baseline to
>> > include everything before sheaves as that's already on the way to 6.15, so
>> > we can see more clearly what sheaves do relative to that. So at this point
>> > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
>> > thus like in scenario (4)), and benchmark the following:
>> >
>> > - baseline - vma locking conversion with TYPESAFE_BY_RCU
>> > - baseline+maple tree node reduction from mm-unstable (Liam might point out
>> > which patches?)
>>
>> Sid's patches [1] are already in mm-unstable.
>>
>>
>> > - the above + this series + sheaves enabled for vm_area_struct cache
>> > - the above + full maple node sheaves conversion [1]
>> > - the above + the top-most patches from [1] that are optimizations with a
>> > tradeoff (not clear win-win) so it would be good to know if they are useful
>> >
>> > [1] currently the 4 commits here:
>> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
>> > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
>> > but as Liam noted, they won't cherry pick without conflict once maple tree
>> > node reduction is backported, but he's working on a rebase
>>
>> Rebased maple tree sheaves, patches are here [2].
> 
> Hi Folks,
> Sorry for the delay. I got the numbers last week but they looked a bit
> weird, so I reran the test increasing the number of iterations to make
> sure noise is not a factor. That took most of this week. Below are the
> results. Please note that I had to backport the patchsets to 6.12
> because that's the closest stable Android kernel I can use. I measure
> cumulative time to execute mmap syscalls, so the smaller the number
> the better mmap performance is:

Is that a particular benchmark doing those syscalls, or you time them within
actual workloads?

> baseline: 6.12 + vm_lock conversion and TYPESAFE_BY_RCU
> config1: baseline + Sid's patches [1]
> config2: sheaves RFC
> config3: config1 + vm_area_struct with sheaves
> config4: config2 + maple_tree Sheaf conversion [2]
> config5: config3 + 2 last optimization patches from [3]
> 
>                config1     config2     config3     config4     config5
> Little core    -0.10%      -10.10%     -12.89%     -10.02%     -13.64%
> Mid core       -21.05%     -37.31%     -44.97%     -15.81%     -22.15%
> Big core       -17.17%     -34.41%     -45.68%     -11.39%     -15.29%

Thanks a lot, Suren.

> [1] https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
> [2] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
> [3] https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> 
> From the numbers, it looks like config4 regresses the performance and
> that's what looked weird to me last week and I wanted to confirm this.
> But from sheaves POV, it looks like they provide the benefits I saw
> before. Sid's patches which I did not test separately before also look
> beneficial.

Indeed, good job, Sid. It's weird that config4 isn't doing well. The problem
can be either in sheaves side (the sheaves preallocation isn't effective) or
maple tree side doing some excessive work. It could be caused by the wrong
condition in kmem_cache_return_sheaf() that Harry pointed out, so v3 might
improve if that was it. Otherwise we'll probably need to fill the gaps in
sheaf-related stats and see what are the differences between config3 and
config4.

> Thanks,
> Suren.
> 
>>
>>
>> >
>> >
>> ...
>>
>> Thanks,
>> Liam
>>
>> [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
>> [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH RFC v2 00/10] SLUB percpu sheaves
  2025-03-17 11:08                   ` Vlastimil Babka
@ 2025-03-17 18:56                     ` Suren Baghdasaryan
  0 siblings, 0 replies; 55+ messages in thread
From: Suren Baghdasaryan @ 2025-03-17 18:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Kent Overstreet, Christoph Lameter,
	David Rientjes, Roman Gushchin, Hyeonggon Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree,
	Sebastian Andrzej Siewior, Alexei Starovoitov, Sidhartha Kumar

On Mon, Mar 17, 2025 at 4:08 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/14/25 18:10, Suren Baghdasaryan wrote:
> > On Tue, Mar 4, 2025 at 11:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >>
> >> * Vlastimil Babka <vbabka@suse.cz> [250304 05:55]:
> >> > On 2/25/25 21:26, Suren Baghdasaryan wrote:
> >> > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >> > >>
> >> > >> >
> >> > >> > > The values represent the total time it took to perform mmap syscalls, less is
> >> > >> > > better.
> >> > >> > >
> >> > >> > > (1)                  baseline       control
> >> > >> > > Little core       7.58327       6.614939 (-12.77%)
> >> > >> > > Medium core  2.125315     1.428702 (-32.78%)
> >> > >> > > Big core          0.514673     0.422948 (-17.82%)
> >> > >> > >
> >> > >> > > (2)                  baseline      control
> >> > >> > > Little core       7.58327       5.141478 (-32.20%)
> >> > >> > > Medium core  2.125315     0.427692 (-79.88%)
> >> > >> > > Big core          0.514673    0.046642 (-90.94%)
> >> > >> > >
> >> > >> > > (3)                   baseline      control
> >> > >> > > Little core        7.58327      4.779624 (-36.97%)
> >> > >> > > Medium core   2.125315    0.450368 (-78.81%)
> >> > >> > > Big core           0.514673    0.037776 (-92.66%)
> >> > >
> >> > > (4)                   baseline      control
> >> > > Little core        7.58327      4.642977 (-38.77%)
> >> > > Medium core   2.125315    0.373692 (-82.42%)
> >> > > Big core           0.514673    0.043613 (-91.53%)
> >> > >
> >> > > I think the difference between (3) and (4) is noise.
> >> > > Thanks,
> >> > > Suren.
> >> >
> >> > Hi, as we discussed yesterday, it would be useful to set the baseline to
> >> > include everything before sheaves as that's already on the way to 6.15, so
> >> > we can see more clearly what sheaves do relative to that. So at this point
> >> > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
> >> > thus like in scenario (4)), and benchmark the following:
> >> >
> >> > - baseline - vma locking conversion with TYPESAFE_BY_RCU
> >> > - baseline+maple tree node reduction from mm-unstable (Liam might point out
> >> > which patches?)
> >>
> >> Sid's patches [1] are already in mm-unstable.
> >>
> >>
> >> > - the above + this series + sheaves enabled for vm_area_struct cache
> >> > - the above + full maple node sheaves conversion [1]
> >> > - the above + the top-most patches from [1] that are optimizations with a
> >> > tradeoff (not clear win-win) so it would be good to know if they are useful
> >> >
> >> > [1] currently the 4 commits here:
> >> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> >> > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
> >> > but as Liam noted, they won't cherry pick without conflict once maple tree
> >> > node reduction is backported, but he's working on a rebase
> >>
> >> Rebased maple tree sheaves, patches are here [2].
> >
> > Hi Folks,
> > Sorry for the delay. I got the numbers last week but they looked a bit
> > weird, so I reran the test increasing the number of iterations to make
> > sure noise is not a factor. That took most of this week. Below are the
> > results. Please note that I had to backport the patchsets to 6.12
> > because that's the closest stable Android kernel I can use. I measure
> > cumulative time to execute mmap syscalls, so the smaller the number
> > the better mmap performance is:
>
> Is that a particular benchmark doing those syscalls, or you time them within
> actual workloads?

I time them inside my workload.

>
> > baseline: 6.12 + vm_lock conversion and TYPESAFE_BY_RCU
> > config1: baseline + Sid's patches [1]
> > config2: sheaves RFC
> > config3: config1 + vm_area_struct with sheaves
> > config4: config2 + maple_tree Sheaf conversion [2]
> > config5: config3 + 2 last optimization patches from [3]
> >
> >                config1     config2     config3     config4     config5
> > Little core    -0.10%      -10.10%     -12.89%     -10.02%     -13.64%
> > Mid core       -21.05%     -37.31%     -44.97%     -15.81%     -22.15%
> > Big core       -17.17%     -34.41%     -45.68%     -11.39%     -15.29%
>
> Thanks a lot, Suren.
>
> > [1] https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
> > [2] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
> > [3] https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> >
> > From the numbers, it looks like config4 regresses the performance and
> > that's what looked weird to me last week and I wanted to confirm this.
> > But from sheaves POV, it looks like they provide the benefits I saw
> > before. Sid's patches which I did not test separately before also look
> > beneficial.
>
> Indeed, good job, Sid. It's weird that config4 isn't doing well. The problem
> can be either in sheaves side (the sheaves preallocation isn't effective) or
> maple tree side doing some excessive work. It could be caused by the wrong
> condition in kmem_cache_return_sheaf() that Harry pointed out, so v3 might
> improve if that was it. Otherwise we'll probably need to fill the gaps in
> sheaf-related stats and see what are the differences between config3 and
> config4.
>
> > Thanks,
> > Suren.
> >
> >>
> >>
> >> >
> >> >
> >> ...
> >>
> >> Thanks,
> >> Liam
> >>
> >> [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
> >> [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-03-17 18:56 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-14 16:27 [PATCH RFC v2 00/10] SLUB percpu sheaves Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 01/10] slab: add opt-in caching layer of " Vlastimil Babka
2025-02-22 22:46   ` Suren Baghdasaryan
2025-02-22 22:56     ` Suren Baghdasaryan
2025-03-12 14:57     ` Vlastimil Babka
2025-03-12 15:14       ` Suren Baghdasaryan
2025-03-17 10:09         ` Vlastimil Babka
2025-02-24  8:04   ` Harry Yoo
2025-03-12 14:59     ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 02/10] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-02-22 23:08   ` Suren Baghdasaryan
2025-03-12 16:19     ` Vlastimil Babka
2025-02-24  8:40   ` Harry Yoo
2025-03-12 16:16     ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 03/10] locking/local_lock: Introduce localtry_lock_t Vlastimil Babka
2025-02-17 14:19   ` Sebastian Andrzej Siewior
2025-02-17 14:35     ` Vlastimil Babka
2025-02-17 15:07       ` Sebastian Andrzej Siewior
2025-02-18 18:41       ` Alexei Starovoitov
2025-02-26 17:00   ` Davidlohr Bueso
2025-02-26 17:15     ` Alexei Starovoitov
2025-02-26 19:28       ` Davidlohr Bueso
2025-02-14 16:27 ` [PATCH RFC v2 04/10] locking/local_lock: add localtry_trylock() Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 05/10] slab: switch percpu sheaves locking to localtry_lock Vlastimil Babka
2025-02-23  2:33   ` Suren Baghdasaryan
2025-02-24 13:08   ` Harry Yoo
2025-02-14 16:27 ` [PATCH RFC v2 06/10] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-02-23  3:54   ` Suren Baghdasaryan
2025-02-25  7:30     ` Harry Yoo
2025-03-12 17:09     ` Vlastimil Babka
2025-02-25  8:00   ` Harry Yoo
2025-03-12 18:16     ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 07/10] slab: determine barn status racily outside of lock Vlastimil Babka
2025-02-23  4:00   ` Suren Baghdasaryan
2025-02-25  8:54   ` Harry Yoo
2025-03-12 18:23     ` Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 08/10] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
2025-02-23  4:24   ` Suren Baghdasaryan
2025-02-14 16:27 ` [PATCH RFC v2 09/10] tools: Add sheafs support to testing infrastructure Vlastimil Babka
2025-02-14 16:27 ` [PATCH RFC v2 10/10] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2025-02-23  4:27   ` Suren Baghdasaryan
2025-02-14 18:28 ` [PATCH RFC v2 00/10] SLUB percpu sheaves Christoph Lameter (Ampere)
2025-02-23  0:19 ` Kent Overstreet
2025-02-23  4:44   ` Suren Baghdasaryan
2025-02-24  1:36     ` Suren Baghdasaryan
2025-02-24  1:43       ` Suren Baghdasaryan
2025-02-24 20:53       ` Vlastimil Babka
2025-02-24 21:12         ` Suren Baghdasaryan
2025-02-25 20:26           ` Suren Baghdasaryan
2025-03-04 10:54             ` Vlastimil Babka
2025-03-04 18:35               ` Suren Baghdasaryan
2025-03-04 19:08               ` Liam R. Howlett
2025-03-14 17:10                 ` Suren Baghdasaryan
2025-03-17 11:08                   ` Vlastimil Babka
2025-03-17 18:56                     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox