linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] SLUB percpu sheaves
@ 2024-11-12 16:38 Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 1/6] mm/slub: add opt-in caching layer of " Vlastimil Babka
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-12 16:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka, Mateusz Guzik,
	Jann Horn

Hi,

This is a RFC to add an opt-in percpu array-based caching layer to SLUB.
The name "sheaf" was invented by Matthew so we don't call it magazine
like the original Bonwick paper. The per-NUMA-node cache of sheaves is
thus called "barn".

This may seem similar to the arrays in SLAB, but the main differences
are:

- opt-in, not used for every cache
- does not distinguish NUMA locality, thus no "alien" arrays that would
  need periodical flushing
- improves kfree_rcu() handling
- API for obtaining a preallocated sheaf that can be used for guaranteed
  and efficient allocations in a restricted context, when upper bound is
  known but rarely reached

The motivation comes mainly from the ongoing work related to VMA
scalability and the related maple tree operations. This is why maple
tree node and vma caches are sheaf-enabled in the RFC. Performance benefits
were measured by Suren in preliminary non-public versions.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  with Patch 5 it's preempt_disable() and no atomic operations. Same for
  freeing, which is normally a local double cmpxchg only for a short
  term allocations (so the same slab is still active on the same cpu when
  freeing the object) and a more costly locked double cmpxchg otherwise.
  The downside is lack of NUMA locality guarantees for the allocated
  objects.

  I hope this scheme will also allow (non-guaranteed) slab allocations
  in context where it's impossible today and achieved by building caches
  on top of slab, i.e. the BPF allocator.

- kfree_rcu() batching. kfree_rcu() will put objects to a separate
  percpu sheaf and only submit the whole sheaf to call_rcu() when full.
  After the grace period, the sheaf can be used for allocations, which
  is more efficient than handling individual slab objects (even with the
  batching done by kfree_rcu() implementation itself). In case only some
  cpus are allowed to handle rcu callbacks, the sheaf can still be made
  available to other cpus on the same node via the shared barn.
  Both maple_node and vma caches can benefit from this.

- Preallocation support. A prefilled sheaf can be borrowed for a short
  term operation that is not allowed to block and may need to allocate
  some objects. If an upper bound (worst case) for the number of
  allocations is known, but only much fewer allocations actually needed
  on average, borrowing and returning a sheaf is much more efficient then
  a bulk allocation for the worst case followed by a bulk free of the
  many unused objects. Maple tree write operations should benefit from
  this.

Patch 1 implements the basic sheaf functionality and using
local_lock_irqsave() for percpu sheaf locking.

Patch 2 adds the kfree_rcu() support.

Patches 3 and 4 enable sheaves for maple tree nodes and vma's.

Patch 5 replaces the local_lock_irqsave() locking with a cheaper scheme
inspired by online conversations with Mateusz Guzik and Jann Horn. In
the past I have tried to copy the scheme from page allocator's pcplists
that also avoids disabling irqs by using a trylock for operations that
might be attempted from an irq handler conext. But spin locks used for
pcplists are more costly than a simple flag with only compiler barriers.
On the other hand it's not possible to take the lock from a different
cpu (except for hotplug handling when the actual local cpu cannot race
with us), but we don't need that remote locking for sheaves.

Patch 6 implements borrowing prefilled sheaf, with maple tree being the
ancticipated user once converted to use it by someone more knowledgeable
than myself.

(RFC) LIMITATIONS:

- with slub_debug enabled, objects in sheaves are considered allocated
  so allocation/free stacktraces may become imprecise and checking of
  e.g. redzone violations may be delayed

- kfree_rcu() via sheaf is only hooked to tree rcu, not tiny rcu. Also
  in case we fail to allocate a sheaf, and fallback to the existing
  implementation, it may use kfree_bulk() where destructors are not
  hooked. It's however possible we won't need the destructor support
  for now at all if vma_lock is moved to vma itself [1] and if it's
  possible to free anon_name and numa balancing tracking immediately
  and not after a grace period.

- in case a prefilled sheaf is requested with more objects than the
  cache's sheaf_capacity, it will fail. This should be possible to
  handle by allocating a bigger sheaf and then freeing it when returned,
  to avoid mixing up different sizes. Ineffective, but acceptable if
  very rare.

[1] https://lore.kernel.org/all/20241111205506.3404479-1-surenb@google.com/

Vlastimil

git branch: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v1r5

---
Vlastimil Babka (6):
      mm/slub: add opt-in caching layer of percpu sheaves
      mm/slub: add sheaf support for batching kfree_rcu() operations
      maple_tree: use percpu sheaves for maple_node_cache
      mm, vma: use sheaves for vm_area_struct cache
      mm, slub: cheaper locking for percpu sheaves
      mm, slub: sheaf prefilling for guaranteed allocations

 include/linux/slab.h |   60 +++
 kernel/fork.c        |   27 +-
 kernel/rcu/tree.c    |    8 +-
 lib/maple_tree.c     |   11 +-
 mm/slab.h            |   27 +
 mm/slab_common.c     |    8 +-
 mm/slub.c            | 1427 ++++++++++++++++++++++++++++++++++++++++++++++++--
 7 files changed, 1503 insertions(+), 65 deletions(-)
---
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 1/6] mm/slub: add opt-in caching layer of percpu sheaves
  2024-11-12 16:38 [PATCH RFC 0/6] SLUB percpu sheaves Vlastimil Babka
@ 2024-11-12 16:38 ` Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-12 16:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.

Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
refill the the sheaves.

When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.

The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() with a specific
node (not NUMA_NO_NODE), sheaves are bypassed.

The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.

Sysfs stat counters alloc_cpu_sheaf and free_cpu_sheaf count objects
allocated or freed using the sheaves. Counters sheaf_refill,
sheaf_flush_main and sheaf_flush_other count objects filled or flushed
from or to slab pages, and can be used to assess how effective the
caching is. The refill and flush operations will also count towards the
usual alloc_fastpath/slowpath, free_fastpath/slowpath and other
counters.

Access to the percpu sheaves is protected by local_lock_irqsave()
operations, each per-NUMA-node barns has a spin_lock.

A current limitation is that when slub_debug is enabled for a cache with
percpu sheaves, the objects in the array are considered as allocated from
the slub_debug perspective, and the alloc/free debugging hooks occur
when moving the objects between the array and slab pages. This means
that e.g. an use-after-free that occurs for an object cached in the
array is undetected. Collected alloc/free stacktraces might also be less
useful. This limitation could be changed in the future.

On the other hand, KASAN, kmemcg and other hooks are executed on actual
allocations and frees by kmem_cache users even if those use the array,
so their debugging or accounting accuracy should be unaffected.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  34 ++
 mm/slab.h            |   2 +
 mm/slab_common.c     |   5 +-
 mm/slub.c            | 982 ++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 973 insertions(+), 50 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index b35e2db7eb0ecc4933126f56b2c3dbf369cbb44b..b13fb1c1f03c14a5b45bc6a64a2096883aef9f83 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -309,6 +309,40 @@ struct kmem_cache_args {
 	 * %NULL means no constructor.
 	 */
 	void (*ctor)(void *);
+	/**
+	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
+	 *
+	 * With a non-zero value, allocations from the cache go through caching
+	 * arrays called sheaves. Each cpu has a main sheaf that's always
+	 * present, and a spare sheaf thay may be not present. When both become
+	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
+	 * from the per-node barn.
+	 *
+	 * When no full sheaf is available, and gfp flags allow blocking, a
+	 * sheaf is allocated and filled from slab(s) using bulk allocation.
+	 * Otherwise the allocation falls back to the normal operation
+	 * allocating a single object from a slab.
+	 *
+	 * Analogically when freeing and both percpu sheaves are full, the barn
+	 * may replace it with an empty sheaf, unless it's over capacity. In
+	 * that case a sheaf is bulk freed to slab pages.
+	 *
+	 * The sheaves does not distinguish NUMA placement of objects, so
+	 * allocations via kmem_cache_alloc_node() with a node specified other
+	 * than NUMA_NO_NODE will bypass them.
+	 *
+	 * Bulk allocation and free operations also try to use the cpu sheaves
+	 * and barn, but fallback to using slab pages directly.
+	 *
+	 * Limitations: when slub_debug is enabled for the cache, all relevant
+	 * actions (i.e. poisoning, obtaining stacktraces) and checks happen
+	 * when objects move between sheaves and slab pages, which may result in
+	 * e.g. not detecting a use-after-free while the object is in the array
+	 * cache, and the stacktraces may be less useful.
+	 *
+	 * %0 means no sheaves will be created
+	 */
+	unsigned int sheaf_capacity;
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/mm/slab.h b/mm/slab.h
index 6c6fe6d630ce3d919c29bafd15b401324618da1a..001e0d55467bb4803b5dff718ba8e0c775f42b3f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -254,6 +254,7 @@ struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
 #endif
+	struct slub_percpu_sheaves __percpu *cpu_sheaves;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
 	unsigned long min_partial;
@@ -267,6 +268,7 @@ struct kmem_cache {
 	/* Number of per cpu partial slabs to keep around */
 	unsigned int cpu_partial_slabs;
 #endif
+	unsigned int sheaf_capacity;
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 893d320599151845973b4eee9c7accc0d807aa72..7939f3f017740e0ac49ffa971c45409d0fbe2f23 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -161,6 +161,9 @@ int slab_unmergeable(struct kmem_cache *s)
 		return 1;
 #endif
 
+	if (s->cpu_sheaves)
+		return 1;
+
 	/*
 	 * We may have set a slab to be unmergeable during bootstrap.
 	 */
@@ -317,7 +320,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
-	if (!args->usersize)
+	if (!args->usersize && !args->sheaf_capacity)
 		s = __kmem_cache_alias(name, object_size, args->align, flags,
 				       args->ctor);
 	if (s)
diff --git a/mm/slub.c b/mm/slub.c
index 5b832512044e3ead8ccde2c02308bd8954246db5..7da08112213b203993b5159eb35a1ea640d706fe 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -347,8 +347,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
 #endif
 
 enum stat_item {
+	ALLOC_PCS,		/* Allocation from percpu sheaf */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
+	FREE_PCS,		/* Free to percpu sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -373,6 +375,12 @@ enum stat_item {
 	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
 	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
+	SHEAF_FLUSH_MAIN,	/* Objects flushed from main percpu sheaf */
+	SHEAF_FLUSH_OTHER,	/* Objects flushed from other sheaves */
+	SHEAF_REFILL,		/* Objects refilled to a sheaf */
+	SHEAF_SWAP,		/* Swapping main and spare sheaf */
+	SHEAF_ALLOC,		/* Allocation of an empty sheaf */
+	SHEAF_FREE,		/* Freeing of an empty sheaf */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -419,6 +427,35 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
 #endif
 }
 
+#define MAX_FULL_SHEAVES	10
+#define MAX_EMPTY_SHEAVES	10
+
+struct node_barn {
+	spinlock_t lock;
+	struct list_head sheaves_full;
+	struct list_head sheaves_empty;
+	unsigned int nr_full;
+	unsigned int nr_empty;
+};
+
+struct slab_sheaf {
+	union {
+		struct rcu_head rcu_head;
+		struct list_head barn_list;
+	};
+	struct kmem_cache *cache;
+	unsigned int size;
+	void *objects[];
+};
+
+struct slub_percpu_sheaves {
+	local_lock_t lock;
+	struct slab_sheaf *main; /* never NULL when unlocked */
+	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct slab_sheaf *rcu_free;
+	struct node_barn *barn;
+};
+
 /*
  * The slab lists for all objects.
  */
@@ -431,6 +468,7 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+	struct node_barn *barn;
 };
 
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
@@ -454,12 +492,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
  */
 static nodemask_t slab_nodes;
 
-#ifndef CONFIG_SLUB_TINY
 /*
  * Workqueue used for flush_cpu_slab().
  */
 static struct workqueue_struct *flushwq;
-#endif
+
+struct slub_flush_work {
+	struct work_struct work;
+	struct kmem_cache *s;
+	bool skip;
+};
+
+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
 
 /********************************************************************
  * 			Core slab cache functions
@@ -2398,6 +2443,349 @@ static void *setup_object(struct kmem_cache *s, void *object)
 	return object;
 }
 
+static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
+					s->sheaf_capacity), gfp);
+
+	if (unlikely(!sheaf))
+		return NULL;
+
+	sheaf->cache = s;
+
+	stat(s, SHEAF_ALLOC);
+
+	return sheaf;
+}
+
+static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	kfree(sheaf);
+
+	stat(s, SHEAF_FREE);
+}
+
+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+				   size_t size, void **p);
+
+
+static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
+			 gfp_t gfp)
+{
+	int to_fill = s->sheaf_capacity - sheaf->size;
+	int filled;
+
+	if (!to_fill)
+		return 0;
+
+	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
+					 &sheaf->objects[sheaf->size]);
+
+	if (!filled)
+		return -ENOMEM;
+
+	sheaf->size = s->sheaf_capacity;
+
+	stat_add(s, SHEAF_REFILL, filled);
+
+	return 0;
+}
+
+
+static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (!sheaf)
+		return NULL;
+
+	if (refill_sheaf(s, sheaf, gfp)) {
+		free_empty_sheaf(s, sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+/*
+ * Maximum number of objects freed during a single flush of main pcs sheaf.
+ * Translates directly to an on-stack array size.
+ */
+#define PCS_BATCH_MAX	32U
+
+static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
+
+static void sheaf_flush_main(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned int batch, remaining;
+	void *objects[PCS_BATCH_MAX];
+	struct slab_sheaf *sheaf;
+	unsigned long flags;
+
+next_batch:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	sheaf = pcs->main;
+
+	batch = min(PCS_BATCH_MAX, sheaf->size);
+
+	sheaf->size -= batch;
+	memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
+
+	remaining = sheaf->size;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	__kmem_cache_free_bulk(s, batch, &objects[0]);
+
+	stat_add(s, SHEAF_FLUSH_MAIN, batch);
+
+	if (remaining)
+		goto next_batch;
+}
+
+static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	if (!sheaf->size)
+		return;
+
+	stat_add(s, SHEAF_FLUSH_OTHER, sheaf->size);
+
+	__kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+
+	sheaf->size = 0;
+}
+
+/*
+ * Caller needs to make sure migration is disabled in order to fully flush
+ * single cpu's sheaves
+ *
+ * flushing operations are rare so let's keep it simple and flush to slabs
+ * directly, skipping the barn
+ */
+static void pcs_flush_all(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *spare, *rcu_free;
+	unsigned long flags;
+
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	spare = pcs->spare;
+	pcs->spare = NULL;
+
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	if (spare) {
+		sheaf_flush(s, spare);
+		free_empty_sheaf(s, spare);
+	}
+
+	// TODO: handle rcu_free
+	BUG_ON(rcu_free);
+
+	sheaf_flush_main(s);
+}
+
+static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	if (pcs->spare) {
+		sheaf_flush(s, pcs->spare);
+		free_empty_sheaf(s, pcs->spare);
+		pcs->spare = NULL;
+	}
+
+	// TODO: handle rcu_free
+	BUG_ON(pcs->rcu_free);
+
+	sheaf_flush_main(s);
+}
+
+static void pcs_destroy(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		/* can happen when unwinding failed create */
+		if (!pcs->main)
+			continue;
+
+		WARN_ON(pcs->spare);
+		WARN_ON(pcs->rcu_free);
+
+		if (!WARN_ON(pcs->main->size)) {
+			free_empty_sheaf(s, pcs->main);
+			pcs->main = NULL;
+		}
+	}
+
+	free_percpu(s->cpu_sheaves);
+	s->cpu_sheaves = NULL;
+}
+
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *empty = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_empty) {
+		empty = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&empty->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+static int barn_put_empty_sheaf(struct node_barn *barn,
+				struct slab_sheaf *sheaf, bool ignore_limit)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (!ignore_limit && barn->nr_empty >= MAX_EMPTY_SHEAVES) {
+		ret = -E2BIG;
+	} else {
+		list_add(&sheaf->barn_list, &barn->sheaves_empty);
+		barn->nr_empty++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+	return ret;
+}
+
+static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
+			       bool ignore_limit)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (!ignore_limit && barn->nr_full >= MAX_FULL_SHEAVES) {
+		ret = -E2BIG;
+	} else {
+		list_add(&sheaf->barn_list, &barn->sheaves_full);
+		barn->nr_full++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+	return ret;
+}
+
+/*
+ * If a full sheaf is available, return it and put the supplied empty one to
+ * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
+ * change.
+ */
+static struct slab_sheaf *
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&full->barn_list);
+		list_add(&empty->barn_list, &barn->sheaves_empty);
+		barn->nr_full--;
+		barn->nr_empty++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+/*
+ * If a empty sheaf is available, return it and put the supplied full one to
+ * barn. But if there are too many full sheaves, reject this with -E2BIG.
+ */
+static struct slab_sheaf *
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+{
+	struct slab_sheaf *empty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full >= MAX_FULL_SHEAVES) {
+		empty = ERR_PTR(-E2BIG);
+	} else if (!barn->nr_empty) {
+		empty = ERR_PTR(-ENOMEM);
+	} else {
+		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
+					 barn_list);
+		list_del(&empty->barn_list);
+		list_add(&full->barn_list, &barn->sheaves_full);
+		barn->nr_empty--;
+		barn->nr_full++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+static void barn_init(struct node_barn *barn)
+{
+	spin_lock_init(&barn->lock);
+	INIT_LIST_HEAD(&barn->sheaves_full);
+	INIT_LIST_HEAD(&barn->sheaves_empty);
+	barn->nr_full = 0;
+	barn->nr_empty = 0;
+}
+
+static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
+{
+	struct list_head empty_list;
+	struct list_head full_list;
+	struct slab_sheaf *sheaf, *sheaf2;
+	unsigned long flags;
+
+	INIT_LIST_HEAD(&empty_list);
+	INIT_LIST_HEAD(&full_list);
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_splice_init(&barn->sheaves_full, &full_list);
+	barn->nr_full = 0;
+	list_splice_init(&barn->sheaves_empty, &empty_list);
+	barn->nr_empty = 0;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
+		sheaf_flush(s, sheaf);
+		list_move(&sheaf->barn_list, &empty_list);
+	}
+
+	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
+		free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Slab allocation and freeing
  */
@@ -3271,11 +3659,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 	put_partials_cpu(s, c);
 }
 
-struct slub_flush_work {
-	struct work_struct work;
-	struct kmem_cache *s;
-	bool skip;
-};
+static inline void flush_this_cpu_slab(struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+
+	if (c->slab)
+		flush_slab(s, c);
+
+	put_partials(s);
+}
+
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+	return c->slab || slub_percpu_partial(c);
+}
+
+#else /* CONFIG_SLUB_TINY */
+static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
+static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
+static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
+#endif /* CONFIG_SLUB_TINY */
+
+static bool has_pcs_used(int cpu, struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	if (!s->cpu_sheaves)
+		return false;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	return (pcs->spare || pcs->rcu_free || pcs->main->size);
+}
+
+static void pcs_flush_all(struct kmem_cache *s);
 
 /*
  * Flush cpu slab.
@@ -3285,30 +3704,18 @@ struct slub_flush_work {
 static void flush_cpu_slab(struct work_struct *w)
 {
 	struct kmem_cache *s;
-	struct kmem_cache_cpu *c;
 	struct slub_flush_work *sfw;
 
 	sfw = container_of(w, struct slub_flush_work, work);
 
 	s = sfw->s;
-	c = this_cpu_ptr(s->cpu_slab);
 
-	if (c->slab)
-		flush_slab(s, c);
-
-	put_partials(s);
-}
-
-static bool has_cpu_slab(int cpu, struct kmem_cache *s)
-{
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+	if (s->cpu_sheaves)
+		pcs_flush_all(s);
 
-	return c->slab || slub_percpu_partial(c);
+	flush_this_cpu_slab(s);
 }
 
-static DEFINE_MUTEX(flush_lock);
-static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
-
 static void flush_all_cpus_locked(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
@@ -3319,7 +3726,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		if (!has_cpu_slab(cpu, s)) {
+		if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
@@ -3355,19 +3762,14 @@ static int slub_cpu_dead(unsigned int cpu)
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list)
+	list_for_each_entry(s, &slab_caches, list) {
 		__flush_cpu_slab(s, cpu);
+		__pcs_flush_all_cpu(s, cpu);
+	}
 	mutex_unlock(&slab_mutex);
 	return 0;
 }
 
-#else /* CONFIG_SLUB_TINY */
-static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
-static inline void flush_all(struct kmem_cache *s) { }
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
-static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
-#endif /* CONFIG_SLUB_TINY */
-
 /*
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
@@ -4095,6 +4497,173 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
 }
 
+static __fastpath_inline
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned long flags;
+	void *object;
+
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *empty = NULL;
+		struct slab_sheaf *full;
+		bool can_alloc;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+		if (full) {
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		can_alloc = gfpflags_allow_blocking(gfp);
+
+		if (can_alloc) {
+			if (pcs->spare) {
+				empty = pcs->spare;
+				pcs->spare = NULL;
+			} else {
+				empty = barn_get_empty_sheaf(pcs->barn);
+			}
+		}
+
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		if (!can_alloc)
+			return NULL;
+
+		if (empty) {
+			if (!refill_sheaf(s, empty, gfp)) {
+				full = empty;
+			} else {
+				/*
+				 * we must be very low on memory so don't bother
+				 * with the barn
+				 */
+				free_empty_sheaf(s, empty);
+			}
+		} else {
+			full = alloc_full_sheaf(s, gfp);
+		}
+
+		if (!full)
+			return NULL;
+
+		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		/*
+		 * If we are returning empty sheaf, we either got it from the
+		 * barn or had to allocate one. If we are returning a full
+		 * sheaf, it's due to racing or being migrated to a different
+		 * cpu. Breaching the barn's sheaf limits should be thus rare
+		 * enough so just ignore them to simplify the recovery.
+		 */
+
+		if (pcs->main->size == 0) {
+			barn_put_empty_sheaf(pcs->barn, pcs->main, true);
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		if (!pcs->spare) {
+			pcs->spare = full;
+			goto do_alloc;
+		}
+
+		if (pcs->spare->size == 0) {
+			barn_put_empty_sheaf(pcs->barn, pcs->spare, true);
+			pcs->spare = full;
+			goto do_alloc;
+		}
+
+		barn_put_full_sheaf(pcs->barn, full, true);
+	}
+
+do_alloc:
+	object = pcs->main->objects[--pcs->main->size];
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat(s, ALLOC_PCS);
+
+	return object;
+}
+
+static __fastpath_inline
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main;
+	unsigned long flags;
+	unsigned int allocated = 0;
+	unsigned int batch;
+
+next_batch:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *full;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+		if (full) {
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		/*
+		 * Once full sheaves in barn are depleted, let the bulk
+		 * allocation continue from slab pages, otherwise we would just
+		 * be copying arrays of pointers twice.
+		 */
+		return allocated;
+	}
+
+do_alloc:
+
+	main = pcs->main;
+	batch = min(size, main->size);
+
+	main->size -= batch;
+	memcpy(p, main->objects + main->size, batch * sizeof(void *));
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat_add(s, ALLOC_PCS, batch);
+
+	allocated += batch;
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return allocated;
+}
+
+
 /*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
@@ -4119,7 +4688,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (s->cpu_sheaves && (node == NUMA_NO_NODE))
+		object = alloc_from_pcs(s, gfpflags);
+
+	if (!object)
+		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);
@@ -4490,6 +5063,196 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+/*
+ * Free an object to the percpu sheaves.
+ * The object is expected to have passed slab_free_hook() already.
+ */
+static __fastpath_inline
+void free_to_pcs(struct kmem_cache *s, void *object)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned long flags;
+
+restart:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
+
+		struct slab_sheaf *empty;
+
+		if (!pcs->spare) {
+			empty = barn_get_empty_sheaf(pcs->barn);
+			if (empty) {
+				pcs->spare = pcs->main;
+				pcs->main = empty;
+				goto do_free;
+			}
+			goto alloc_empty;
+		}
+
+		if (pcs->spare->size < s->sheaf_capacity) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_free;
+		}
+
+		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+
+		if (!IS_ERR(empty)) {
+			pcs->main = empty;
+			goto do_free;
+		}
+
+		if (PTR_ERR(empty) == -E2BIG) {
+			/* Since we got here, spare exists and is full */
+			struct slab_sheaf *to_flush = pcs->spare;
+
+			pcs->spare = NULL;
+			local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+			sheaf_flush(s, to_flush);
+			empty = to_flush;
+			goto got_empty;
+		}
+
+alloc_empty:
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty) {
+			sheaf_flush_main(s);
+			goto restart;
+		}
+
+got_empty:
+		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		/*
+		 * if we put any sheaf to barn here, it's because we raced or
+		 * have been migrated to a different cpu, which should be rare
+		 * enough so just ignore the barn's limits to simplify
+		 */
+		if (unlikely(pcs->main->size < s->sheaf_capacity)) {
+			if (!pcs->spare)
+				pcs->spare = empty;
+			else
+				barn_put_empty_sheaf(pcs->barn, empty, true);
+			goto do_free;
+		}
+
+		if (!pcs->spare) {
+			pcs->spare = pcs->main;
+			pcs->main = empty;
+			goto do_free;
+		}
+
+		barn_put_full_sheaf(pcs->barn, pcs->main, true);
+		pcs->main = empty;
+	}
+
+do_free:
+	pcs->main->objects[pcs->main->size++] = object;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat(s, FREE_PCS);
+}
+
+/*
+ * Bulk free objects to the percpu sheaves.
+ * Unlike free_to_pcs() this includes the calls to all necessary hooks
+ * and the fallback to freeing to slab pages.
+ */
+static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main;
+	unsigned long flags;
+	unsigned int batch, i = 0;
+	bool init;
+
+	init = slab_want_init_on_free(s);
+
+	while (i < size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+			p[i] = p[--size];
+			if (!size)
+				return;
+			continue;
+		}
+
+		i++;
+	}
+
+next_batch:
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
+
+		struct slab_sheaf *empty;
+
+		if (!pcs->spare) {
+			empty = barn_get_empty_sheaf(pcs->barn);
+			if (empty) {
+				pcs->spare = pcs->main;
+				pcs->main = empty;
+				goto do_free;
+			}
+			goto no_empty;
+		}
+
+		if (pcs->spare->size < s->sheaf_capacity) {
+			stat(s, SHEAF_SWAP);
+			swap(pcs->main, pcs->spare);
+			goto do_free;
+		}
+
+		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+
+		if (!IS_ERR(empty)) {
+			pcs->main = empty;
+			goto do_free;
+		}
+
+no_empty:
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		/*
+		 * if we depleted all empty sheaves in the barn or there are too
+		 * many full sheaves, free the rest to slab pages
+		 */
+
+		__kmem_cache_free_bulk(s, size, p);
+		return;
+	}
+
+do_free:
+	main = pcs->main;
+	batch = min(size, s->sheaf_capacity - main->size);
+
+	memcpy(main->objects + main->size, p, batch * sizeof(void *));
+	main->size += batch;
+
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	stat_add(s, FREE_PCS, batch);
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4576,7 +5339,12 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	memcg_slab_free_hook(s, slab, &object, 1);
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+		return;
+
+	if (s->cpu_sheaves)
+		free_to_pcs(s, object);
+	else
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -4837,6 +5605,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (!size)
 		return;
 
+	/*
+	 * freeing to sheaves is so incompatible with the detached freelist so
+	 * once we go that way, we have to do everything differently
+	 */
+	if (s && s->cpu_sheaves) {
+		free_to_pcs_bulk(s, size, p);
+		return;
+	}
+
 	do {
 		struct detached_freelist df;
 
@@ -4955,7 +5732,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 				 void **p)
 {
-	int i;
+	unsigned int i = 0;
 
 	if (!size)
 		return 0;
@@ -4964,9 +5741,21 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 	if (unlikely(!s))
 		return 0;
 
-	i = __kmem_cache_alloc_bulk(s, flags, size, p);
-	if (unlikely(i == 0))
-		return 0;
+	if (s->cpu_sheaves)
+		i = alloc_from_pcs_bulk(s, size, p);
+
+	if (i < size) {
+		unsigned int j = __kmem_cache_alloc_bulk(s, flags, size - i, p + i);
+		/*
+		 * If we ran out of memory, don't bother with freeing back to
+		 * the percpu sheaves, we have bigger problems.
+		 */
+		if (unlikely(j == 0)) {
+			if (i > 0)
+				__kmem_cache_free_bulk(s, i, p);
+			return 0;
+		}
+	}
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
@@ -4976,11 +5765,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 		    slab_want_init_on_alloc(flags, s), s->object_size))) {
 		return 0;
 	}
-	return i;
+
+	return size;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 
-
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can
@@ -5113,8 +5902,8 @@ static inline int calculate_order(unsigned int size)
 	return -ENOSYS;
 }
 
-static void
-init_kmem_cache_node(struct kmem_cache_node *n)
+static bool
+init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
@@ -5124,6 +5913,11 @@ init_kmem_cache_node(struct kmem_cache_node *n)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+	n->barn = barn;
+	if (barn)
+		barn_init(barn);
+
+	return true;
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -5154,6 +5948,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 }
 #endif /* CONFIG_SLUB_TINY */
 
+static int init_percpu_sheaves(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+		int nid;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		local_lock_init(&pcs->lock);
+
+		nid = cpu_to_mem(cpu);
+
+		pcs->barn = get_node(s, nid)->barn;
+		pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+
+		if (!pcs->main)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static struct kmem_cache *kmem_cache_node;
 
 /*
@@ -5189,7 +6007,7 @@ static void early_kmem_cache_node_alloc(int node)
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
 	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n);
+	init_kmem_cache_node(n, NULL);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
@@ -5205,6 +6023,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn) {
+			WARN_ON(n->barn->nr_full);
+			WARN_ON(n->barn->nr_empty);
+			kfree(n->barn);
+			n->barn = NULL;
+		}
+
 		s->node[node] = NULL;
 		kmem_cache_free(kmem_cache_node, n);
 	}
@@ -5213,6 +6038,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
+	if (s->cpu_sheaves)
+		pcs_destroy(s);
 #ifndef CONFIG_SLUB_TINY
 	free_percpu(s->cpu_slab);
 #endif
@@ -5225,20 +6052,27 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_node_mask(node, slab_nodes) {
 		struct kmem_cache_node *n;
+		struct node_barn *barn = NULL;
 
 		if (slab_state == DOWN) {
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+			if (!barn)
+				return 0;
+		}
+
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
-
-		if (!n) {
-			free_kmem_cache_nodes(s);
+		if (!n)
 			return 0;
-		}
 
-		init_kmem_cache_node(n);
+		init_kmem_cache_node(n, barn);
+
 		s->node[node] = n;
 	}
 	return 1;
@@ -5494,6 +6328,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	flush_all_cpus_locked(s);
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn)
+			barn_shrink(s, n->barn);
 		free_partial(s, n);
 		if (n->nr_partial || node_nr_slabs(n))
 			return 1;
@@ -5680,6 +6516,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 			INIT_LIST_HEAD(promote + i);
 
+		if (n->barn)
+			barn_shrink(s, n->barn);
+
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -5792,12 +6631,24 @@ static int slab_mem_going_online_callback(void *arg)
 	 */
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
+		struct node_barn *barn = NULL;
+
 		/*
 		 * The structure may already exist if the node was previously
 		 * onlined and offlined.
 		 */
 		if (get_node(s, nid))
 			continue;
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
+
+			if (!barn) {
+				ret = -ENOMEM;
+				goto out;
+			}
+		}
+
 		/*
 		 * XXX: kmem_cache_alloc_node will fallback to other nodes
 		 *      since memory is not yet available from the node that
@@ -5808,7 +6659,9 @@ static int slab_mem_going_online_callback(void *arg)
 			ret = -ENOMEM;
 			goto out;
 		}
-		init_kmem_cache_node(n);
+
+		init_kmem_cache_node(n, barn);
+
 		s->node[nid] = n;
 	}
 	/*
@@ -6026,6 +6879,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 
 	set_cpu_partial(s);
 
+	if (args->sheaf_capacity) {
+		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
+		if (!s->cpu_sheaves) {
+			err = -ENOMEM;
+			goto out;
+		}
+		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
+		s->sheaf_capacity = args->sheaf_capacity;
+	}
+
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -6042,6 +6905,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	if (!alloc_kmem_cache_cpus(s))
 		goto out;
 
+	if (s->cpu_sheaves) {
+		err = init_percpu_sheaves(s);
+		if (err)
+			goto out;
+	}
+
 	/* Mutex is not taken during early boot */
 	if (slab_state <= UP) {
 		err = 0;
@@ -6060,7 +6929,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 		__kmem_cache_release(s);
 	return err;
 }
-
 #ifdef SLAB_SUPPORTS_SYSFS
 static int count_inuse(struct slab *slab)
 {
@@ -6838,8 +7706,10 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
+STAT_ATTR(FREE_PCS, free_cpu_sheaf);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -6864,6 +7734,12 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
 STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
 STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
+STAT_ATTR(SHEAF_FLUSH_MAIN, sheaf_flush_main);
+STAT_ATTR(SHEAF_FLUSH_OTHER, sheaf_flush_other);
+STAT_ATTR(SHEAF_REFILL, sheaf_refill);
+STAT_ATTR(SHEAF_SWAP, sheaf_swap);
+STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
+STAT_ATTR(SHEAF_FREE, sheaf_free);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -6925,8 +7801,10 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
+	&alloc_cpu_sheaf_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
+	&free_cpu_sheaf_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
@@ -6951,6 +7829,12 @@ static struct attribute *slab_attrs[] = {
 	&cpu_partial_free_attr.attr,
 	&cpu_partial_node_attr.attr,
 	&cpu_partial_drain_attr.attr,
+	&sheaf_flush_main_attr.attr,
+	&sheaf_flush_other_attr.attr,
+	&sheaf_refill_attr.attr,
+	&sheaf_swap_attr.attr,
+	&sheaf_alloc_attr.attr,
+	&sheaf_free_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.47.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-12 16:38 [PATCH RFC 0/6] SLUB percpu sheaves Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 1/6] mm/slub: add opt-in caching layer of " Vlastimil Babka
@ 2024-11-12 16:38 ` Vlastimil Babka
  2024-11-14 16:57   ` Uladzislau Rezki
  2024-11-12 16:38 ` [PATCH RFC 3/6] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-12 16:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches where sheafs are initialized, on each cpu maintain a rcu_free
sheaf in addition to main and spare sheaves.

kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put in on the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.

It's possible that no free sheafs are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() machinery.

Because some intended users will need to perform additonal cleanups
after the grace period and thus have custom rcu_call() callbacks today,
add the possibility to specify a kfree_rcu() specific destructor.
Because of the fall back possibility, the destructor now needs be
invoked also from within RCU, so add __kvfree_rcu() that RCU can use
instead of kvfree().

Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
  batching done in RCU itself
- sheafs can be reused via barn instead of being flushed to slabs, which
  is more effective
  - this includes cases where only some cpus are allowed to process rcu
    callbacks (Android)

Possible disadvantage:
- objects might be waiting for more than their grace period (it is
  determined by the last object freed into the sheaf), increasing memory
  usage - but that might be true for the batching done by RCU as well?

RFC LIMITATIONS: - only tree rcu is converted, not tiny
- the rcu fallback might resort to kfree_bulk(), not kvfree(). Instead
  of adding a variant of kfree_bulk() with destructors, is there an easy
  way to disable the kfree_bulk() path in the fallback case?

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  15 +++++
 kernel/rcu/tree.c    |   8 ++-
 mm/slab.h            |  25 +++++++
 mm/slab_common.c     |   3 +
 mm/slub.c            | 182 +++++++++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 227 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index b13fb1c1f03c14a5b45bc6a64a2096883aef9f83..23904321992ad2eeb9389d0883cf4d5d5d71d896 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -343,6 +343,21 @@ struct kmem_cache_args {
 	 * %0 means no sheaves will be created
 	 */
 	unsigned int sheaf_capacity;
+	/**
+	 * @sheaf_rcu_dtor: A destructor for objects freed by kfree_rcu()
+	 *
+	 * Only valid when non-zero @sheaf_capacity is specified. When freeing
+	 * objects by kfree_rcu() in a cache with sheaves, the objects are put
+	 * to a special percpu sheaf. When that sheaf is full, it's passed to
+	 * call_rcu() and after a grace period the sheaf can be reused for new
+	 * allocations. In case a cleanup is necessary after the grace period
+	 * and before reusal, a pointer to such function can be given as
+	 * @sheaf_rcu_dtor and will be called on each object in the rcu sheaf
+	 * after the grace period passes and before the sheaf's reuse.
+	 *
+	 * %NULL means no destructor is called.
+	 */
+	void (*sheaf_rcu_dtor)(void *obj);
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b1f883fcd9185a5e22c10102d1024c40688f57fb..42c994fdf9960bfed8d8bd697de90af72c1f4f58 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -65,6 +65,7 @@
 #include <linux/kasan.h>
 #include <linux/context_tracking.h>
 #include "../time/tick-internal.h"
+#include "../../mm/slab.h"
 
 #include "tree.h"
 #include "rcu.h"
@@ -3420,7 +3421,7 @@ kvfree_rcu_list(struct rcu_head *head)
 		trace_rcu_invoke_kvfree_callback(rcu_state.name, head, offset);
 
 		if (!WARN_ON_ONCE(!__is_kvfree_rcu_offset(offset)))
-			kvfree(ptr);
+			__kvfree_rcu(ptr);
 
 		rcu_lock_release(&rcu_callback_map);
 		cond_resched_tasks_rcu_qs();
@@ -3797,6 +3798,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!head)
 		might_sleep();
 
+	if (kfree_rcu_sheaf(ptr))
+		return;
+
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
@@ -3849,7 +3853,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!success) {
 		debug_rcu_head_unqueue((struct rcu_head *) ptr);
 		synchronize_rcu();
-		kvfree(ptr);
+		__kvfree_rcu(ptr);
 	}
 }
 EXPORT_SYMBOL_GPL(kvfree_call_rcu);
diff --git a/mm/slab.h b/mm/slab.h
index 001e0d55467bb4803b5dff718ba8e0c775f42b3f..4dc145c14dfd97677c74a767c22f5dd22f5d6451 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -276,6 +276,9 @@ struct kmem_cache {
 	gfp_t allocflags;		/* gfp flags to use on each alloc */
 	int refcount;			/* Refcount for slab cache destroy */
 	void (*ctor)(void *object);	/* Object constructor */
+	void (*rcu_dtor)(void *object);	/* Object destructor to execute after
+					 * kfree_rcu grace period
+					 */
 	unsigned int inuse;		/* Offset to metadata */
 	unsigned int align;		/* Alignment */
 	unsigned int red_left_pad;	/* Left redzone padding size */
@@ -454,6 +457,28 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
+void __kvfree_rcu(void *obj);
+
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+
+static inline bool kfree_rcu_sheaf(void *obj)
+{
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+
+	folio = virt_to_folio(obj);
+	if (unlikely(!folio_test_slab(folio)))
+		return false;
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+	if (s->cpu_sheaves)
+		return __kfree_rcu_sheaf(s, obj);
+
+	return false;
+}
+
 /* Legal flag mask for kmem_cache_create(), for various configurations */
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7939f3f017740e0ac49ffa971c45409d0fbe2f23..d69ed1e7ea34f9657cb9514fb98a48647f01381b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -236,6 +236,9 @@ static struct kmem_cache *create_cache(const char *name,
 	     !IS_ALIGNED(args->freeptr_offset, sizeof(freeptr_t))))
 		goto out;
 
+	if (args->sheaf_rcu_dtor && !args->sheaf_capacity)
+		goto out;
+
 	err = -ENOMEM;
 	s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
 	if (!s)
diff --git a/mm/slub.c b/mm/slub.c
index 7da08112213b203993b5159eb35a1ea640d706fe..6811d766c0470cd7066c2574ad86e00405c916bb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -351,6 +351,8 @@ enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
 	FREE_PCS,		/* Free to percpu sheaf */
+	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
+	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -2557,6 +2559,24 @@ static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	sheaf->size = 0;
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf);
+
+static void rcu_free_sheaf_nobarn(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	sheaf_flush(s, sheaf);
+
+	free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
@@ -2586,8 +2606,8 @@ static void pcs_flush_all(struct kmem_cache *s)
 		free_empty_sheaf(s, spare);
 	}
 
-	// TODO: handle rcu_free
-	BUG_ON(rcu_free);
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
 
 	sheaf_flush_main(s);
 }
@@ -2604,8 +2624,10 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
 		pcs->spare = NULL;
 	}
 
-	// TODO: handle rcu_free
-	BUG_ON(pcs->rcu_free);
+	if (pcs->rcu_free) {
+		call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+		pcs->rcu_free = NULL;
+	}
 
 	sheaf_flush_main(s);
 }
@@ -5161,6 +5183,121 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 	stat(s, FREE_PCS);
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf)
+{
+	bool init = slab_want_init_on_free(s);
+	void **p = &sheaf->objects[0];
+	unsigned int i = 0;
+
+	while (i < sheaf->size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		if (s->rcu_dtor)
+			s->rcu_dtor(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+			p[i] = p[--sheaf->size];
+			continue;
+		}
+
+		i++;
+	}
+}
+
+static void rcu_free_sheaf(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct node_barn *barn;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	barn = get_node(s, numa_mem_id())->barn;
+
+	/* due to slab_free_hook() */
+	if (unlikely(sheaf->size == 0))
+		goto empty;
+
+	if (!barn_put_full_sheaf(barn, sheaf, false))
+		return;
+
+	sheaf_flush(s, sheaf);
+
+empty:
+	if (!barn_put_empty_sheaf(barn, sheaf, false))
+		return;
+
+	free_empty_sheaf(s, sheaf);
+}
+
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_sheaf;
+	unsigned long flags;
+
+	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(!pcs->rcu_free)) {
+
+		struct slab_sheaf *empty;
+
+		empty = barn_get_empty_sheaf(pcs->barn);
+
+		if (empty) {
+			pcs->rcu_free = empty;
+			goto do_free;
+		}
+
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty) {
+			stat(s, FREE_RCU_SHEAF_FAIL);
+			return false;
+		}
+
+		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		if (unlikely(pcs->rcu_free))
+			barn_put_empty_sheaf(pcs->barn, empty, true);
+		else
+			pcs->rcu_free = empty;
+	}
+
+do_free:
+
+	rcu_sheaf = pcs->rcu_free;
+
+	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
+
+	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
+		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		stat(s, FREE_RCU_SHEAF);
+		return true;
+	}
+
+	pcs->rcu_free = NULL;
+	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+
+	call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+
+	stat(s, FREE_RCU_SHEAF);
+
+	return true;
+}
+
 /*
  * Bulk free objects to the percpu sheaves.
  * Unlike free_to_pcs() this includes the calls to all necessary hooks
@@ -5466,6 +5603,32 @@ static void free_large_kmalloc(struct folio *folio, void *object)
 	folio_put(folio);
 }
 
+void __kvfree_rcu(void *obj)
+{
+	struct folio *folio;
+	struct slab *slab;
+	struct kmem_cache *s;
+
+	if (is_vmalloc_addr(obj)) {
+		vfree(obj);
+		return;
+	}
+
+	folio = virt_to_folio(obj);
+	if (unlikely(!folio_test_slab(folio))) {
+		free_large_kmalloc(folio, obj);
+		return;
+	}
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+
+	if (s->rcu_dtor)
+		s->rcu_dtor(obj);
+
+	slab_free(s, slab, obj, _RET_IP_);
+}
+
 /**
  * kfree - free previously allocated memory
  * @object: pointer returned by kmalloc() or kmem_cache_alloc()
@@ -6326,6 +6489,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	flush_all_cpus_locked(s);
+
+	/* we might have rcu sheaves in flight */
+	if (s->cpu_sheaves)
+		rcu_barrier();
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
 		if (n->barn)
@@ -6887,6 +7055,8 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 		}
 		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
 		s->sheaf_capacity = args->sheaf_capacity;
+
+		s->rcu_dtor = args->sheaf_rcu_dtor;
 	}
 
 #ifdef CONFIG_NUMA
@@ -7710,6 +7880,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_PCS, free_cpu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -7805,6 +7977,8 @@ static struct attribute *slab_attrs[] = {
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
 	&free_cpu_sheaf_attr.attr,
+	&free_rcu_sheaf_attr.attr,
+	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,

-- 
2.47.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 3/6] maple_tree: use percpu sheaves for maple_node_cache
  2024-11-12 16:38 [PATCH RFC 0/6] SLUB percpu sheaves Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 1/6] mm/slub: add opt-in caching layer of " Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2024-11-12 16:38 ` Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 4/6] mm, vma: use sheaves for vm_area_struct cache Vlastimil Babka
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-12 16:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
improve its performance. Change the single node rcu freeing in
ma_free_rcu() to use kfree_rcu() instead of the custom callback, which
allows the rcu_free sheaf batching to be used. Note there are other
users of mt_free_rcu() where larger parts of maple tree are submitted to
call_rcu() as a whole, and that cannot use the rcu_free sheaf, but it's
still possible for maple nodes freed this way to be reused via the barn,
even if only some cpus are allowed to process rcu callbacks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 3619301dda2ebeaaba8a73837389b6ee3c7e1a3f..c69365e17fcbfe963dcedd0de07335fc6bbdfb27 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -194,7 +194,7 @@ static void mt_free_rcu(struct rcu_head *head)
 static void ma_free_rcu(struct maple_node *node)
 {
 	WARN_ON(node->parent != ma_parent_ptr(node));
-	call_rcu(&node->rcu, mt_free_rcu);
+	kfree_rcu(node, rcu);
 }
 
 static void mas_set_height(struct ma_state *mas)
@@ -6299,9 +6299,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 
 void __init maple_tree_init(void)
 {
+	struct kmem_cache_args args = {
+		.align  = sizeof(struct maple_node),
+		.sheaf_capacity = 32,
+	};
+
 	maple_node_cache = kmem_cache_create("maple_node",
-			sizeof(struct maple_node), sizeof(struct maple_node),
-			SLAB_PANIC, NULL);
+			sizeof(struct maple_node), &args,
+			SLAB_PANIC);
 }
 
 /**

-- 
2.47.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 4/6] mm, vma: use sheaves for vm_area_struct cache
  2024-11-12 16:38 [PATCH RFC 0/6] SLUB percpu sheaves Vlastimil Babka
                   ` (2 preceding siblings ...)
  2024-11-12 16:38 ` [PATCH RFC 3/6] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
@ 2024-11-12 16:38 ` Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 5/6] mm, slub: cheaper locking for percpu sheaves Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations Vlastimil Babka
  5 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-12 16:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Create the vm_area_struct cache with percpu sheaves of size 32 to
hopefully improve its performance. For CONFIG_PER_VMA_LOCK, change the
vma freeing from custom call_rcu() callback to kfree_rcu() which
will perform rcu_free sheaf batching. Since there may be additional
structures attached and they are freed only after the grace period,
create a __vma_area_rcu_free_dtor() to do that.

Note I have not investigated whether vma_numab_state_free() or
free_anon_vma_name() must really need to wait for the grace period. For
vma_lock_free() ideally we wouldn't free it at all when freeing the vma
to the sheaf (or even slab page), but that would require using also a
ctor for vmas to allocate the vma lock, and reintroducing dtor support
for deallocating the lock when freeing slab pages containing the vmas.

The plan is to move vma_lock into vma itself anyway, so if the rest can
be freed immediately, the whole destructor support won't be needed
anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/fork.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 22f43721d031d48fd5be2606e86642334be9735f..9b1ae5aaf6a58fded6c9ac378809296825eba9fa 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -516,22 +516,24 @@ void __vm_area_free(struct vm_area_struct *vma)
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
-#ifdef CONFIG_PER_VMA_LOCK
-static void vm_area_free_rcu_cb(struct rcu_head *head)
+static void __vma_area_rcu_free_dtor(void *ptr)
 {
-	struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
-						  vm_rcu);
+	struct vm_area_struct *vma = ptr;
 
 	/* The vma should not be locked while being destroyed. */
+#ifdef CONFIG_PER_VMA_LOCK
 	VM_BUG_ON_VMA(rwsem_is_locked(&vma->vm_lock->lock), vma);
-	__vm_area_free(vma);
-}
 #endif
 
+	vma_numab_state_free(vma);
+	free_anon_vma_name(vma);
+	vma_lock_free(vma);
+}
+
 void vm_area_free(struct vm_area_struct *vma)
 {
 #ifdef CONFIG_PER_VMA_LOCK
-	call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb);
+	kfree_rcu(vma, vm_rcu);
 #else
 	__vm_area_free(vma);
 #endif
@@ -3155,6 +3157,12 @@ void __init mm_cache_init(void)
 
 void __init proc_caches_init(void)
 {
+	struct kmem_cache_args vm_args = {
+		.align = __alignof__(struct vm_area_struct),
+		.sheaf_capacity = 32,
+		.sheaf_rcu_dtor = __vma_area_rcu_free_dtor,
+	};
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -3172,7 +3180,10 @@ void __init proc_caches_init(void)
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
 
-	vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC|SLAB_ACCOUNT);
+	vm_area_cachep = kmem_cache_create("vm_area_struct",
+				sizeof(struct vm_area_struct), &vm_args,
+				SLAB_PANIC|SLAB_ACCOUNT);
+
 #ifdef CONFIG_PER_VMA_LOCK
 	vma_lock_cachep = KMEM_CACHE(vma_lock, SLAB_PANIC|SLAB_ACCOUNT);
 #endif

-- 
2.47.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 5/6] mm, slub: cheaper locking for percpu sheaves
  2024-11-12 16:38 [PATCH RFC 0/6] SLUB percpu sheaves Vlastimil Babka
                   ` (3 preceding siblings ...)
  2024-11-12 16:38 ` [PATCH RFC 4/6] mm, vma: use sheaves for vm_area_struct cache Vlastimil Babka
@ 2024-11-12 16:38 ` Vlastimil Babka
  2024-11-12 16:38 ` [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations Vlastimil Babka
  5 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-12 16:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Mateusz Guzik, Jann Horn,
	Vlastimil Babka

Instead of local_lock_irqsave(), use just get_cpu_ptr() (which only
disables preemption) and then set an active flag. If potential callers
include irq handler, the operation must use a trylock variant that bails
out if the flag is already set to active because we interrupted another
operation in progress.

Changing the flag doesn't need to be atomic as the irq is one the same
cpu. This should make using percpu sheaves cheaper, with the downside of
some unlucky operations in irq handlers have to fallback to non-sheave
variants. That should be rare so there should be a net benefit.

On PREEMPT_RT we can use simply local_lock() as that does the right
thing without the need to disable irqs.

Thanks to Mateusz Guzik and Jann Horn for suggesting this kind of
locking scheme in online conversations. Initially attempted to fully
copy the page allocator's pcplist locking, but its reliance on
spin_trylock() made it much more costly.

Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 230 +++++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 174 insertions(+), 56 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 6811d766c0470cd7066c2574ad86e00405c916bb..1900afa6153ca6d88f9df7db3ce84d98629489e7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -450,14 +450,111 @@ struct slab_sheaf {
 	void *objects[];
 };
 
+struct local_tryirq_lock {
+#ifndef CONFIG_PREEMPT_RT
+	int active;
+#else
+	local_lock_t llock;
+#endif
+};
+
 struct slub_percpu_sheaves {
-	local_lock_t lock;
+	struct local_tryirq_lock lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
 	struct slab_sheaf *rcu_free;
 	struct node_barn *barn;
 };
 
+/*
+ * Generic helper to lookup a per-cpu variable with a lock that allows only
+ * trylock from irq handler context to avoid expensive irq disable or atomic
+ * operations and memory barriers - only compiler barriers are needed.
+ *
+ * On !PREEMPT_RT this is done by get_cpu_ptr(), which disables preemption, and
+ * checking that a variable is not already set to 1. If it is, it means we are
+ * in irq handler that has interrupted the locked operation, and must give up.
+ * Otherwise we set the variable to 1.
+ *
+ * On PREEMPT_RT we can simply use local_lock() as that does the right thing
+ * without actually disabling irqs. Thus the trylock can't actually fail.
+ *
+ */
+#ifndef CONFIG_PREEMPT_RT
+
+#define pcpu_local_tryirq_lock(type, member, ptr)                       \
+({                                                                      \
+	type *_ret;                                                     \
+	lockdep_assert(!irq_count());					\
+	_ret = get_cpu_ptr(ptr);                                        \
+	lockdep_assert(_ret->member.active == 0);			\
+	WRITE_ONCE(_ret->member.active, 1);				\
+	barrier();							\
+	_ret;                                                           \
+})
+
+#define pcpu_local_tryirq_trylock(type, member, ptr)                    \
+({                                                                      \
+	type *_ret;                                                     \
+	_ret = get_cpu_ptr(ptr);                                        \
+	if (unlikely(READ_ONCE(_ret->member.active) == 1)) {		\
+		put_cpu_ptr(ptr);					\
+		_ret = NULL;						\
+	} else {                                                        \
+		WRITE_ONCE(_ret->member.active, 1);			\
+		barrier();						\
+	}								\
+	_ret;                                                           \
+})
+
+#define pcpu_local_tryirq_unlock(member, ptr)                           \
+({                                                                      \
+	lockdep_assert(this_cpu_ptr(ptr)->member.active == 1);		\
+	barrier();							\
+	WRITE_ONCE(this_cpu_ptr(ptr)->member.active, 0);		\
+	put_cpu_ptr(ptr);						\
+})
+
+#define local_tryirq_lock_init(lock)					\
+({									\
+	(lock)->active = 0;						\
+})
+
+#else
+
+#define pcpu_local_tryirq_lock(type, member, ptr)                       \
+({                                                                      \
+	type *_ret;                                                     \
+	local_lock(&ptr->member.llock);					\
+	_ret = this_cpu_ptr(ptr);                                       \
+	_ret;                                                           \
+})
+
+#define pcpu_local_tryirq_trylock(type, member, ptr)                    \
+	pcpu_local_tryirq_lock(type, member, ptr)
+
+#define pcpu_local_tryirq_unlock(member, ptr)                           \
+({                                                                      \
+	local_unlock(&ptr->member.llock);				\
+})
+
+#define local_tryirq_lock_init(lock)					\
+({									\
+	local_lock_init(&(lock)->llock);				\
+})
+
+#endif
+
+/* struct slub_percpu_sheaves specific helpers. */
+#define cpu_sheaves_lock(ptr)                                           \
+	pcpu_local_tryirq_lock(struct slub_percpu_sheaves, lock, ptr)
+
+#define cpu_sheaves_trylock(ptr)                                        \
+	pcpu_local_tryirq_trylock(struct slub_percpu_sheaves, lock, ptr)
+
+#define cpu_sheaves_unlock(ptr)                                         \
+	pcpu_local_tryirq_unlock(lock, ptr)
+
 /*
  * The slab lists for all objects.
  */
@@ -2517,17 +2614,20 @@ static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
 
 static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
 
-static void sheaf_flush_main(struct kmem_cache *s)
+/* returns true if at least partially flushed */
+static bool sheaf_flush_main(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
 	unsigned int batch, remaining;
 	void *objects[PCS_BATCH_MAX];
 	struct slab_sheaf *sheaf;
-	unsigned long flags;
+	bool ret = false;
 
 next_batch:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+	if (!pcs)
+		return ret;
+
 	sheaf = pcs->main;
 
 	batch = min(PCS_BATCH_MAX, sheaf->size);
@@ -2537,14 +2637,18 @@ static void sheaf_flush_main(struct kmem_cache *s)
 
 	remaining = sheaf->size;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	cpu_sheaves_unlock(s->cpu_sheaves);
 
 	__kmem_cache_free_bulk(s, batch, &objects[0]);
 
 	stat_add(s, SHEAF_FLUSH_MAIN, batch);
 
+	ret = true;
+
 	if (remaining)
 		goto next_batch;
+
+	return ret;
 }
 
 static void sheaf_flush(struct kmem_cache *s, struct slab_sheaf *sheaf)
@@ -2581,6 +2685,8 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
  *
+ * must not be called from an irq
+ *
  * flushing operations are rare so let's keep it simple and flush to slabs
  * directly, skipping the barn
  */
@@ -2588,10 +2694,8 @@ static void pcs_flush_all(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *spare, *rcu_free;
-	unsigned long flags;
 
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = cpu_sheaves_lock(s->cpu_sheaves);
 
 	spare = pcs->spare;
 	pcs->spare = NULL;
@@ -2599,7 +2703,7 @@ static void pcs_flush_all(struct kmem_cache *s)
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	cpu_sheaves_unlock(s->cpu_sheaves);
 
 	if (spare) {
 		sheaf_flush(s, spare);
@@ -4523,11 +4627,11 @@ static __fastpath_inline
 void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 {
 	struct slub_percpu_sheaves *pcs;
-	unsigned long flags;
 	void *object;
 
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+	if (!pcs)
+		return NULL;
 
 	if (unlikely(pcs->main->size == 0)) {
 
@@ -4559,7 +4663,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 			}
 		}
 
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		cpu_sheaves_unlock(s->cpu_sheaves);
 
 		if (!can_alloc)
 			return NULL;
@@ -4581,8 +4685,11 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 		if (!full)
 			return NULL;
 
-		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-		pcs = this_cpu_ptr(s->cpu_sheaves);
+		/*
+		 * we can reach here only when gfpflags_allow_blocking
+		 * so this must not be an irq
+		 */
+		pcs = cpu_sheaves_lock(s->cpu_sheaves);
 
 		/*
 		 * If we are returning empty sheaf, we either got it from the
@@ -4615,7 +4722,7 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 do_alloc:
 	object = pcs->main->objects[--pcs->main->size];
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	cpu_sheaves_unlock(s->cpu_sheaves);
 
 	stat(s, ALLOC_PCS);
 
@@ -4627,13 +4734,13 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *main;
-	unsigned long flags;
 	unsigned int allocated = 0;
 	unsigned int batch;
 
 next_batch:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+	if (!pcs)
+		return allocated;
 
 	if (unlikely(pcs->main->size == 0)) {
 
@@ -4652,7 +4759,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 			goto do_alloc;
 		}
 
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		cpu_sheaves_unlock(s->cpu_sheaves);
 
 		/*
 		 * Once full sheaves in barn are depleted, let the bulk
@@ -4670,7 +4777,7 @@ unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	main->size -= batch;
 	memcpy(p, main->objects + main->size, batch * sizeof(void *));
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	cpu_sheaves_unlock(s->cpu_sheaves);
 
 	stat_add(s, ALLOC_PCS, batch);
 
@@ -5090,14 +5197,14 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
  * The object is expected to have passed slab_free_hook() already.
  */
 static __fastpath_inline
-void free_to_pcs(struct kmem_cache *s, void *object)
+bool free_to_pcs(struct kmem_cache *s, void *object)
 {
 	struct slub_percpu_sheaves *pcs;
-	unsigned long flags;
 
 restart:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+	if (!pcs)
+		return false;
 
 	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
 
@@ -5131,7 +5238,7 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 			struct slab_sheaf *to_flush = pcs->spare;
 
 			pcs->spare = NULL;
-			local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+			cpu_sheaves_unlock(s->cpu_sheaves);
 
 			sheaf_flush(s, to_flush);
 			empty = to_flush;
@@ -5139,18 +5246,27 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 		}
 
 alloc_empty:
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		cpu_sheaves_unlock(s->cpu_sheaves);
 
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
 		if (!empty) {
-			sheaf_flush_main(s);
-			goto restart;
+			if (sheaf_flush_main(s))
+				goto restart;
+			else
+				return false;
 		}
 
 got_empty:
-		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-		pcs = this_cpu_ptr(s->cpu_sheaves);
+		pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+		if (!pcs) {
+			struct node_barn *barn;
+
+			barn = get_node(s, numa_mem_id())->barn;
+
+			barn_put_empty_sheaf(barn, empty, true);
+			return false;
+		}
 
 		/*
 		 * if we put any sheaf to barn here, it's because we raced or
@@ -5178,9 +5294,11 @@ void free_to_pcs(struct kmem_cache *s, void *object)
 do_free:
 	pcs->main->objects[pcs->main->size++] = object;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	cpu_sheaves_unlock(s->cpu_sheaves);
 
 	stat(s, FREE_PCS);
+
+	return true;
 }
 
 static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
@@ -5242,10 +5360,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_sheaf;
-	unsigned long flags;
 
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+	if (!pcs)
+		goto fail;
 
 	if (unlikely(!pcs->rcu_free)) {
 
@@ -5258,17 +5376,16 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 			goto do_free;
 		}
 
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		cpu_sheaves_unlock(s->cpu_sheaves);
 
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
-		if (!empty) {
-			stat(s, FREE_RCU_SHEAF_FAIL);
-			return false;
-		}
+		if (!empty)
+			goto fail;
 
-		local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-		pcs = this_cpu_ptr(s->cpu_sheaves);
+		pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+		if (!pcs)
+			goto fail;
 
 		if (unlikely(pcs->rcu_free))
 			barn_put_empty_sheaf(pcs->barn, empty, true);
@@ -5283,19 +5400,22 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
 
 	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		cpu_sheaves_unlock(s->cpu_sheaves);
 		stat(s, FREE_RCU_SHEAF);
 		return true;
 	}
 
 	pcs->rcu_free = NULL;
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	cpu_sheaves_unlock(s->cpu_sheaves);
 
 	call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
 
 	stat(s, FREE_RCU_SHEAF);
-
 	return true;
+
+fail:
+	stat(s, FREE_RCU_SHEAF_FAIL);
+	return false;
 }
 
 /*
@@ -5307,7 +5427,6 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *main;
-	unsigned long flags;
 	unsigned int batch, i = 0;
 	bool init;
 
@@ -5330,8 +5449,9 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	}
 
 next_batch:
-	local_lock_irqsave(&s->cpu_sheaves->lock, flags);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = cpu_sheaves_trylock(s->cpu_sheaves);
+	if (!pcs)
+		goto fallback;
 
 	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
 
@@ -5361,13 +5481,13 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		}
 
 no_empty:
-		local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+		cpu_sheaves_unlock(s->cpu_sheaves);
 
 		/*
 		 * if we depleted all empty sheaves in the barn or there are too
 		 * many full sheaves, free the rest to slab pages
 		 */
-
+fallback:
 		__kmem_cache_free_bulk(s, size, p);
 		return;
 	}
@@ -5379,7 +5499,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	memcpy(main->objects + main->size, p, batch * sizeof(void *));
 	main->size += batch;
 
-	local_unlock_irqrestore(&s->cpu_sheaves->lock, flags);
+	cpu_sheaves_unlock(s->cpu_sheaves);
 
 	stat_add(s, FREE_PCS, batch);
 
@@ -5479,9 +5599,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (s->cpu_sheaves)
-		free_to_pcs(s, object);
-	else
+	if (!s->cpu_sheaves || !free_to_pcs(s, object))
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -6121,7 +6239,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		local_lock_init(&pcs->lock);
+		local_tryirq_lock_init(&pcs->lock);
 
 		nid = cpu_to_mem(cpu);
 

-- 
2.47.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations
  2024-11-12 16:38 [PATCH RFC 0/6] SLUB percpu sheaves Vlastimil Babka
                   ` (4 preceding siblings ...)
  2024-11-12 16:38 ` [PATCH RFC 5/6] mm, slub: cheaper locking for percpu sheaves Vlastimil Babka
@ 2024-11-12 16:38 ` Vlastimil Babka
  2024-11-18 13:13   ` Hyeonggon Yoo
  5 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-12 16:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim
  Cc: Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, Vlastimil Babka

Add three functions for efficient guaranteed allocations in a critical
section (that cannot sleep) when the exact number of allocations is not
known beforehand, but an upper limit can be calculated.

kmem_cache_prefill_sheaf() returns a sheaf containing at least given
number of objects.

kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
and is guaranteed not to fail until depleted.

kmem_cache_return_sheaf() is for giving the sheaf back to the slab
allocator after the critical section. This will also attempt to refill
it to cache's sheaf capacity for better efficiency of sheaves handling,
but it's not stricly necessary to succeed.

TODO: the current implementation is limited to cache's sheaf_capacity

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  11 ++++
 mm/slub.c            | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 160 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 23904321992ad2eeb9389d0883cf4d5d5d71d896..a87dc3c6392fe235de2eabe1792df86d40c3bbf9 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -820,6 +820,17 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
 				   int node) __assume_slab_alignment __malloc;
 #define kmem_cache_alloc_node(...)	alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int count);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+				       struct slab_sheaf *sheaf);
+
+void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
+			struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_from_sheaf(...)	\
+			alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
+
 /*
  * These macros allow declaring a kmem_buckets * parameter alongside size, which
  * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
diff --git a/mm/slub.c b/mm/slub.c
index 1900afa6153ca6d88f9df7db3ce84d98629489e7..a0e2cb7dfb5173f39f36bea1eb9760c3c1b99dd7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -444,6 +444,7 @@ struct slab_sheaf {
 	union {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
+		bool oversize;
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
@@ -2819,6 +2820,30 @@ static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
 	return ret;
 }
 
+static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *sheaf = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_empty) {
+		sheaf = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_empty--;
+	} else if (barn->nr_full) {
+		sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return sheaf;
+}
+
 /*
  * If a full sheaf is available, return it and put the supplied empty one to
  * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
@@ -4893,6 +4918,130 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
 
+
+/*
+ * returns a sheaf that has least the given count of objects
+ * when prefilling is needed, do so with given gfp flags
+ *
+ * return NULL if prefilling failed, or when the requested count is
+ * above cache's sheaf_capacity (TODO: lift this limitation)
+ */
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int count)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *sheaf = NULL;
+
+	//TODO: handle via oversize sheaf
+	if (count > s->sheaf_capacity)
+		return NULL;
+
+	pcs = cpu_sheaves_lock(s->cpu_sheaves);
+
+	if (pcs->spare && pcs->spare->size > 0) {
+		sheaf = pcs->spare;
+		pcs->spare = NULL;
+	}
+
+	if (!sheaf)
+		sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
+
+	cpu_sheaves_unlock(s->cpu_sheaves);
+
+	if (!sheaf)
+		sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (sheaf && sheaf->size < count) {
+		if (refill_sheaf(s, sheaf, gfp)) {
+			sheaf_flush(s, sheaf);
+			free_empty_sheaf(s, sheaf);
+			sheaf = NULL;
+		}
+	}
+
+	return sheaf;
+}
+
+/*
+ * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
+ * It tries to refill the sheaf back to the cache's sheaf_capacity
+ * to avoid handling partially full sheaves.
+ *
+ * If the refill fails because gfp is e.g. GFP_NOWAIT, the sheaf is
+ * instead dissolved
+ */
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+			     struct slab_sheaf *sheaf)
+{
+	struct slub_percpu_sheaves *pcs;
+	bool refill = false;
+	struct node_barn *barn;
+
+	//TODO: handle oversize sheaf
+
+	pcs = cpu_sheaves_lock(s->cpu_sheaves);
+
+	if (!pcs->spare) {
+		pcs->spare = sheaf;
+		sheaf = NULL;
+	}
+
+	/* racy check */
+	if (!sheaf && pcs->barn->nr_full >= MAX_FULL_SHEAVES) {
+		barn = pcs->barn;
+		refill = true;
+	}
+
+	cpu_sheaves_unlock(s->cpu_sheaves);
+
+	if (!sheaf)
+		return;
+
+	/*
+	 * if the barn is full of full sheaves or we fail to refill the sheaf,
+	 * simply flush and free it
+	 */
+	if (!refill || refill_sheaf(s, sheaf, gfp)) {
+		sheaf_flush(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+		return;
+	}
+
+	/* we racily determined the sheaf would fit, so now force it */
+	barn_put_full_sheaf(barn, sheaf, true);
+}
+
+/*
+ * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * Guaranteed not to fail as many allocations as was the requested count.
+ * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
+ *
+ * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
+ * memcg charging is forced over limit if necessary, to avoid failure.
+ */
+void *
+kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
+				   struct slab_sheaf *sheaf)
+{
+	void *ret = NULL;
+	bool init;
+
+	if (sheaf->size == 0)
+		goto out;
+
+	ret = sheaf->objects[--sheaf->size];
+
+	init = slab_want_init_on_alloc(gfp, s);
+
+	/* add __GFP_NOFAIL to force successful memcg charging */
+	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
+out:
+	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
+
+	return ret;
+}
+
 /*
  * To avoid unnecessary overhead, we pass through large allocation requests
  * directly to the page allocator. We use __GFP_COMP, because we will need to

-- 
2.47.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-12 16:38 ` [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2024-11-14 16:57   ` Uladzislau Rezki
  2024-11-17 11:01     ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Uladzislau Rezki @ 2024-11-14 16:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox,
	Boqun Feng, Uladzislau Rezki, linux-mm, linux-kernel, rcu,
	maple-tree

On Tue, Nov 12, 2024 at 05:38:46PM +0100, Vlastimil Babka wrote:
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches where sheafs are initialized, on each cpu maintain a rcu_free
> sheaf in addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put in on the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheafs are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() machinery.
> 
> Because some intended users will need to perform additonal cleanups
> after the grace period and thus have custom rcu_call() callbacks today,
> add the possibility to specify a kfree_rcu() specific destructor.
> Because of the fall back possibility, the destructor now needs be
> invoked also from within RCU, so add __kvfree_rcu() that RCU can use
> instead of kvfree().
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   batching done in RCU itself
> - sheafs can be reused via barn instead of being flushed to slabs, which
>   is more effective
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but that might be true for the batching done by RCU as well?
> 
> RFC LIMITATIONS: - only tree rcu is converted, not tiny
> - the rcu fallback might resort to kfree_bulk(), not kvfree(). Instead
>   of adding a variant of kfree_bulk() with destructors, is there an easy
>   way to disable the kfree_bulk() path in the fallback case?
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |  15 +++++
>  kernel/rcu/tree.c    |   8 ++-
>  mm/slab.h            |  25 +++++++
>  mm/slab_common.c     |   3 +
>  mm/slub.c            | 182 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  5 files changed, 227 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index b13fb1c1f03c14a5b45bc6a64a2096883aef9f83..23904321992ad2eeb9389d0883cf4d5d5d71d896 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -343,6 +343,21 @@ struct kmem_cache_args {
>  	 * %0 means no sheaves will be created
>  	 */
>  	unsigned int sheaf_capacity;
> +	/**
> +	 * @sheaf_rcu_dtor: A destructor for objects freed by kfree_rcu()
> +	 *
> +	 * Only valid when non-zero @sheaf_capacity is specified. When freeing
> +	 * objects by kfree_rcu() in a cache with sheaves, the objects are put
> +	 * to a special percpu sheaf. When that sheaf is full, it's passed to
> +	 * call_rcu() and after a grace period the sheaf can be reused for new
> +	 * allocations. In case a cleanup is necessary after the grace period
> +	 * and before reusal, a pointer to such function can be given as
> +	 * @sheaf_rcu_dtor and will be called on each object in the rcu sheaf
> +	 * after the grace period passes and before the sheaf's reuse.
> +	 *
> +	 * %NULL means no destructor is called.
> +	 */
> +	void (*sheaf_rcu_dtor)(void *obj);
>  };
>  
>  struct kmem_cache *__kmem_cache_create_args(const char *name,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index b1f883fcd9185a5e22c10102d1024c40688f57fb..42c994fdf9960bfed8d8bd697de90af72c1f4f58 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -65,6 +65,7 @@
>  #include <linux/kasan.h>
>  #include <linux/context_tracking.h>
>  #include "../time/tick-internal.h"
> +#include "../../mm/slab.h"
>  
>  #include "tree.h"
>  #include "rcu.h"
> @@ -3420,7 +3421,7 @@ kvfree_rcu_list(struct rcu_head *head)
>  		trace_rcu_invoke_kvfree_callback(rcu_state.name, head, offset);
>  
>  		if (!WARN_ON_ONCE(!__is_kvfree_rcu_offset(offset)))
> -			kvfree(ptr);
> +			__kvfree_rcu(ptr);
>  
>  		rcu_lock_release(&rcu_callback_map);
>  		cond_resched_tasks_rcu_qs();
> @@ -3797,6 +3798,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  	if (!head)
>  		might_sleep();
>  
> +	if (kfree_rcu_sheaf(ptr))
> +		return;
> +
>
This change crosses all effort which has been done in order to improve kvfree_rcu :)

For example:
  performance, app launch improvements for Android devices;
  memory consumption optimizations to minimize LMK triggering;
  batching to speed-up offloading;
  etc.

So we have done a lot of work there. We were thinking about moving all
functionality from "kernel/rcu" to "mm/". As a first step i can do that,
i.e. move kvfree_rcu() as is. After that we can switch to second step.

Sounds good for you or not?

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-14 16:57   ` Uladzislau Rezki
@ 2024-11-17 11:01     ` Vlastimil Babka
  2024-11-20 12:37       ` Uladzislau Rezki
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-17 11:01 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox,
	Boqun Feng, linux-mm, linux-kernel, rcu, maple-tree

On 11/14/24 17:57, Uladzislau Rezki wrote:
> On Tue, Nov 12, 2024 at 05:38:46PM +0100, Vlastimil Babka wrote:
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -65,6 +65,7 @@
>>  #include <linux/kasan.h>
>>  #include <linux/context_tracking.h>
>>  #include "../time/tick-internal.h"
>> +#include "../../mm/slab.h"
>>  
>>  #include "tree.h"
>>  #include "rcu.h"
>> @@ -3420,7 +3421,7 @@ kvfree_rcu_list(struct rcu_head *head)
>>  		trace_rcu_invoke_kvfree_callback(rcu_state.name, head, offset);
>>  
>>  		if (!WARN_ON_ONCE(!__is_kvfree_rcu_offset(offset)))
>> -			kvfree(ptr);
>> +			__kvfree_rcu(ptr);
>>  
>>  		rcu_lock_release(&rcu_callback_map);
>>  		cond_resched_tasks_rcu_qs();
>> @@ -3797,6 +3798,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>>  	if (!head)
>>  		might_sleep();
>>  
>> +	if (kfree_rcu_sheaf(ptr))
>> +		return;
>> +
>>
> This change crosses all effort which has been done in order to improve kvfree_rcu :)

Yeah I know, but it wasn't intended to make it all obsolete as I don't think
every kfree_rcu() user would have a sheaf-enabled cache.

> For example:
>   performance, app launch improvements for Android devices;
>   memory consumption optimizations to minimize LMK triggering;
>   batching to speed-up offloading;
>   etc.

Yes it's a great effort that I appreciate and you did probably all that was
possible to do without changing the slab allocator itself.

> So we have done a lot of work there. We were thinking about moving all
> functionality from "kernel/rcu" to "mm/". As a first step i can do that,
> i.e. move kvfree_rcu() as is. After that we can switch to second step.

Yeah we have discussed that with Paul at LSF/MM as well and I agreed it
makes sense, but didn't get to it yet.

> Sounds good for you or not?

Sounds good, thanks!

> --
> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations
  2024-11-12 16:38 ` [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations Vlastimil Babka
@ 2024-11-18 13:13   ` Hyeonggon Yoo
  2024-11-18 14:26     ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Hyeonggon Yoo @ 2024-11-18 13:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox, Boqun Feng,
	Uladzislau Rezki, linux-mm, linux-kernel, rcu, maple-tree

On Wed, Nov 13, 2024 at 1:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Add three functions for efficient guaranteed allocations in a critical
> section (that cannot sleep) when the exact number of allocations is not
> known beforehand, but an upper limit can be calculated.
>
> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
> number of objects.
>
> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
> and is guaranteed not to fail until depleted.
>
> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
> allocator after the critical section. This will also attempt to refill
> it to cache's sheaf capacity for better efficiency of sheaves handling,
> but it's not stricly necessary to succeed.
>
> TODO: the current implementation is limited to cache's sheaf_capacity
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |  11 ++++
>  mm/slub.c            | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 160 insertions(+)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 23904321992ad2eeb9389d0883cf4d5d5d71d896..a87dc3c6392fe235de2eabe1792df86d40c3bbf9 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -820,6 +820,17 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
>                                    int node) __assume_slab_alignment __malloc;
>  #define kmem_cache_alloc_node(...)     alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
>
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int count);
> +
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                                      struct slab_sheaf *sheaf);
> +
> +void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
> +                       struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
> +#define kmem_cache_alloc_from_sheaf(...)       \
> +                       alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
> +
>  /*
>   * These macros allow declaring a kmem_buckets * parameter alongside size, which
>   * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
> diff --git a/mm/slub.c b/mm/slub.c
> index 1900afa6153ca6d88f9df7db3ce84d98629489e7..a0e2cb7dfb5173f39f36bea1eb9760c3c1b99dd7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -444,6 +444,7 @@ struct slab_sheaf {
>         union {
>                 struct rcu_head rcu_head;
>                 struct list_head barn_list;
> +               bool oversize;
>         };
>         struct kmem_cache *cache;
>         unsigned int size;
> @@ -2819,6 +2820,30 @@ static int barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf,
>         return ret;
>  }
>
> +static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
> +{
> +       struct slab_sheaf *sheaf = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_empty) {
> +               sheaf = list_first_entry(&barn->sheaves_empty,
> +                                        struct slab_sheaf, barn_list);
> +               list_del(&sheaf->barn_list);
> +               barn->nr_empty--;
> +       } else if (barn->nr_full) {
> +               sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> +                                       barn_list);
> +               list_del(&sheaf->barn_list);
> +               barn->nr_full--;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return sheaf;
> +}
> +
>  /*
>   * If a full sheaf is available, return it and put the supplied empty one to
>   * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
> @@ -4893,6 +4918,130 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
>
> +
> +/*
> + * returns a sheaf that has least the given count of objects
> + * when prefilling is needed, do so with given gfp flags
> + *
> + * return NULL if prefilling failed, or when the requested count is
> + * above cache's sheaf_capacity (TODO: lift this limitation)
> + */
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int count)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *sheaf = NULL;
> +
> +       //TODO: handle via oversize sheaf
> +       if (count > s->sheaf_capacity)
> +               return NULL;
> +
> +       pcs = cpu_sheaves_lock(s->cpu_sheaves);
> +
> +       if (pcs->spare && pcs->spare->size > 0) {
> +               sheaf = pcs->spare;
> +               pcs->spare = NULL;
> +       }
> +
> +       if (!sheaf)
> +               sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
> +
> +       cpu_sheaves_unlock(s->cpu_sheaves);
> +
> +       if (!sheaf)
> +               sheaf = alloc_empty_sheaf(s, gfp);
> +
> +       if (sheaf && sheaf->size < count) {
> +               if (refill_sheaf(s, sheaf, gfp)) {
> +                       sheaf_flush(s, sheaf);
> +                       free_empty_sheaf(s, sheaf);
> +                       sheaf = NULL;
> +               }
> +       }
> +
> +       return sheaf;
> +}
> +
> +/*
> + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
> + * It tries to refill the sheaf back to the cache's sheaf_capacity
> + * to avoid handling partially full sheaves.
> + *
> + * If the refill fails because gfp is e.g. GFP_NOWAIT, the sheaf is
> + * instead dissolved
> + */
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                            struct slab_sheaf *sheaf)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       bool refill = false;
> +       struct node_barn *barn;
> +
> +       //TODO: handle oversize sheaf
> +
> +       pcs = cpu_sheaves_lock(s->cpu_sheaves);
> +
> +       if (!pcs->spare) {
> +               pcs->spare = sheaf;
> +               sheaf = NULL;
> +       }
> +
> +       /* racy check */
> +       if (!sheaf && pcs->barn->nr_full >= MAX_FULL_SHEAVES) {
> +               barn = pcs->barn;
> +               refill = true;
> +       }
> +
> +       cpu_sheaves_unlock(s->cpu_sheaves);
> +
> +       if (!sheaf)
> +               return;
> +
> +       /*
> +        * if the barn is full of full sheaves or we fail to refill the sheaf,
> +        * simply flush and free it
> +        */
> +       if (!refill || refill_sheaf(s, sheaf, gfp)) {
> +               sheaf_flush(s, sheaf);
> +               free_empty_sheaf(s, sheaf);
> +               return;
> +       }
> +
> +       /* we racily determined the sheaf would fit, so now force it */
> +       barn_put_full_sheaf(barn, sheaf, true);
> +}
> +
> +/*
> + * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
> + *
> + * Guaranteed not to fail as many allocations as was the requested count.
> + * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
> + *
> + * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
> + * memcg charging is forced over limit if necessary, to avoid failure.
> + */
> +void *
> +kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
> +                                  struct slab_sheaf *sheaf)
> +{
> +       void *ret = NULL;
> +       bool init;
> +
> +       if (sheaf->size == 0)
> +               goto out;
> +
> +       ret = sheaf->objects[--sheaf->size];
> +
> +       init = slab_want_init_on_alloc(gfp, s);
> +
> +       /* add __GFP_NOFAIL to force successful memcg charging */
> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);

Maybe I'm missing something, but how can this be used for non-sleepable contexts
if __GFP_NOFAIL is used? I think we have to charge them when the sheaf
is returned
via kmem_cache_prefill_sheaf(), just like users of bulk alloc/free?

Best,
Hyeonggon

> +out:
> +       trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
> +
> +       return ret;
> +}
> +
>  /*
>   * To avoid unnecessary overhead, we pass through large allocation requests
>   * directly to the page allocator. We use __GFP_COMP, because we will need to
>
> --
> 2.47.0
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations
  2024-11-18 13:13   ` Hyeonggon Yoo
@ 2024-11-18 14:26     ` Vlastimil Babka
  2024-11-19  2:29       ` Hyeonggon Yoo
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-18 14:26 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox, Boqun Feng,
	Uladzislau Rezki, linux-mm, linux-kernel, rcu, maple-tree

On 11/18/24 14:13, Hyeonggon Yoo wrote:
> On Wed, Nov 13, 2024 at 1:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> +
>> +/*
>> + * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
>> + *
>> + * Guaranteed not to fail as many allocations as was the requested count.
>> + * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
>> + *
>> + * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
>> + * memcg charging is forced over limit if necessary, to avoid failure.
>> + */
>> +void *
>> +kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
>> +                                  struct slab_sheaf *sheaf)
>> +{
>> +       void *ret = NULL;
>> +       bool init;
>> +
>> +       if (sheaf->size == 0)
>> +               goto out;
>> +
>> +       ret = sheaf->objects[--sheaf->size];
>> +
>> +       init = slab_want_init_on_alloc(gfp, s);
>> +
>> +       /* add __GFP_NOFAIL to force successful memcg charging */
>> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
> 
> Maybe I'm missing something, but how can this be used for non-sleepable contexts
> if __GFP_NOFAIL is used? I think we have to charge them when the sheaf

AFAIK it forces memcg to simply charge even if allocated memory goes over
the memcg limit. So there's no issue with a non-sleepable context, there
shouldn't be memcg reclaim happening in that case.

> is returned
> via kmem_cache_prefill_sheaf(), just like users of bulk alloc/free?

That would be very costly to charge/uncharge if most of the objects are not
actually used - it's what we want to avoid here.
Going over the memcgs limit a bit in a very rare case isn't considered such
an issue, for example Linus advocated such approach too in another context.

> Best,
> Hyeonggon
> 
>> +out:
>> +       trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
>> +
>> +       return ret;
>> +}
>> +
>>  /*
>>   * To avoid unnecessary overhead, we pass through large allocation requests
>>   * directly to the page allocator. We use __GFP_COMP, because we will need to
>>
>> --
>> 2.47.0
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations
  2024-11-18 14:26     ` Vlastimil Babka
@ 2024-11-19  2:29       ` Hyeonggon Yoo
  2024-11-19  8:27         ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Hyeonggon Yoo @ 2024-11-19  2:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox, Boqun Feng,
	Uladzislau Rezki, linux-mm, linux-kernel, rcu, maple-tree

On Mon, Nov 18, 2024 at 11:26 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 11/18/24 14:13, Hyeonggon Yoo wrote:
> > On Wed, Nov 13, 2024 at 1:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> +
> >> +/*
> >> + * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
> >> + *
> >> + * Guaranteed not to fail as many allocations as was the requested count.
> >> + * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
> >> + *
> >> + * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
> >> + * memcg charging is forced over limit if necessary, to avoid failure.
> >> + */
> >> +void *
> >> +kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
> >> +                                  struct slab_sheaf *sheaf)
> >> +{
> >> +       void *ret = NULL;
> >> +       bool init;
> >> +
> >> +       if (sheaf->size == 0)
> >> +               goto out;
> >> +
> >> +       ret = sheaf->objects[--sheaf->size];
> >> +
> >> +       init = slab_want_init_on_alloc(gfp, s);
> >> +
> >> +       /* add __GFP_NOFAIL to force successful memcg charging */
> >> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
> >
> > Maybe I'm missing something, but how can this be used for non-sleepable contexts
> > if __GFP_NOFAIL is used? I think we have to charge them when the sheaf
>
> AFAIK it forces memcg to simply charge even if allocated memory goes over
> the memcg limit. So there's no issue with a non-sleepable context, there
> shouldn't be memcg reclaim happening in that case.

Ok, but I am still worried about mem alloc profiling/memcg trying to
allocate some memory
with __GFP_NOFAIL flag and eventually passing it to the buddy allocator,
which does not want __GFP_NOFAIL without __GFP_DIRECT_RECLAIM?
e.g.) memcg hook calls
alloc_slab_obj_exts()->kcalloc_node()->....->alloc_pages()

> > is returned
> > via kmem_cache_prefill_sheaf(), just like users of bulk alloc/free?
>
> That would be very costly to charge/uncharge if most of the objects are not
> actually used - it's what we want to avoid here.
> Going over the memcgs limit a bit in a very rare case isn't considered such
> an issue, for example Linus advocated such approach too in another context.

Thanks for the explanation! That was a point I was missing.

> > Best,
> > Hyeonggon
> >
> >> +out:
> >> +       trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
> >> +
> >> +       return ret;
> >> +}
> >> +
> >>  /*
> >>   * To avoid unnecessary overhead, we pass through large allocation requests
> >>   * directly to the page allocator. We use __GFP_COMP, because we will need to
> >>
> >> --
> >> 2.47.0
> >>
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations
  2024-11-19  2:29       ` Hyeonggon Yoo
@ 2024-11-19  8:27         ` Vlastimil Babka
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-19  8:27 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox, Boqun Feng,
	Uladzislau Rezki, linux-mm, linux-kernel, rcu, maple-tree

On 11/19/24 03:29, Hyeonggon Yoo wrote:
> On Mon, Nov 18, 2024 at 11:26 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 11/18/24 14:13, Hyeonggon Yoo wrote:
>> > On Wed, Nov 13, 2024 at 1:39 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >> +
>> >> +/*
>> >> + * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
>> >> + *
>> >> + * Guaranteed not to fail as many allocations as was the requested count.
>> >> + * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
>> >> + *
>> >> + * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
>> >> + * memcg charging is forced over limit if necessary, to avoid failure.
>> >> + */
>> >> +void *
>> >> +kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
>> >> +                                  struct slab_sheaf *sheaf)
>> >> +{
>> >> +       void *ret = NULL;
>> >> +       bool init;
>> >> +
>> >> +       if (sheaf->size == 0)
>> >> +               goto out;
>> >> +
>> >> +       ret = sheaf->objects[--sheaf->size];
>> >> +
>> >> +       init = slab_want_init_on_alloc(gfp, s);
>> >> +
>> >> +       /* add __GFP_NOFAIL to force successful memcg charging */
>> >> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
>> >
>> > Maybe I'm missing something, but how can this be used for non-sleepable contexts
>> > if __GFP_NOFAIL is used? I think we have to charge them when the sheaf
>>
>> AFAIK it forces memcg to simply charge even if allocated memory goes over
>> the memcg limit. So there's no issue with a non-sleepable context, there
>> shouldn't be memcg reclaim happening in that case.
> 
> Ok, but I am still worried about mem alloc profiling/memcg trying to
> allocate some memory
> with __GFP_NOFAIL flag and eventually passing it to the buddy allocator,
> which does not want __GFP_NOFAIL without __GFP_DIRECT_RECLAIM?
> e.g.) memcg hook calls
> alloc_slab_obj_exts()->kcalloc_node()->....->alloc_pages()

alloc_slab_obj_exts() removes __GFP_NOFAIL via OBJCGS_CLEAR_MASK so that's fine.
I think kmemleak_alloc_recursive() is also fine as it ends up in
mem_pool_alloc() and will clear __GFP_NOFAIL it via gfp_nested_mask()
Hope I'm not missing something else.

>> > is returned
>> > via kmem_cache_prefill_sheaf(), just like users of bulk alloc/free?
>>
>> That would be very costly to charge/uncharge if most of the objects are not
>> actually used - it's what we want to avoid here.
>> Going over the memcgs limit a bit in a very rare case isn't considered such
>> an issue, for example Linus advocated such approach too in another context.
> 
> Thanks for the explanation! That was a point I was missing.
> 
>> > Best,
>> > Hyeonggon
>> >
>> >> +out:
>> >> +       trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
>> >> +
>> >> +       return ret;
>> >> +}
>> >> +
>> >>  /*
>> >>   * To avoid unnecessary overhead, we pass through large allocation requests
>> >>   * directly to the page allocator. We use __GFP_COMP, because we will need to
>> >>
>> >> --
>> >> 2.47.0
>> >>
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-17 11:01     ` Vlastimil Babka
@ 2024-11-20 12:37       ` Uladzislau Rezki
  2024-11-25 11:02         ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Uladzislau Rezki @ 2024-11-20 12:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Uladzislau Rezki, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Pekka Enberg, Joonsoo Kim,
	Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, linux-mm, linux-kernel, rcu,
	maple-tree

On Sun, Nov 17, 2024 at 12:01:01PM +0100, Vlastimil Babka wrote:
> On 11/14/24 17:57, Uladzislau Rezki wrote:
> > On Tue, Nov 12, 2024 at 05:38:46PM +0100, Vlastimil Babka wrote:
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -65,6 +65,7 @@
> >>  #include <linux/kasan.h>
> >>  #include <linux/context_tracking.h>
> >>  #include "../time/tick-internal.h"
> >> +#include "../../mm/slab.h"
> >>  
> >>  #include "tree.h"
> >>  #include "rcu.h"
> >> @@ -3420,7 +3421,7 @@ kvfree_rcu_list(struct rcu_head *head)
> >>  		trace_rcu_invoke_kvfree_callback(rcu_state.name, head, offset);
> >>  
> >>  		if (!WARN_ON_ONCE(!__is_kvfree_rcu_offset(offset)))
> >> -			kvfree(ptr);
> >> +			__kvfree_rcu(ptr);
> >>  
> >>  		rcu_lock_release(&rcu_callback_map);
> >>  		cond_resched_tasks_rcu_qs();
> >> @@ -3797,6 +3798,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
> >>  	if (!head)
> >>  		might_sleep();
> >>  
> >> +	if (kfree_rcu_sheaf(ptr))
> >> +		return;
> >> +
> >>
> > This change crosses all effort which has been done in order to improve kvfree_rcu :)
> 
> Yeah I know, but it wasn't intended to make it all obsolete as I don't think
> every kfree_rcu() user would have a sheaf-enabled cache.
> 
> > For example:
> >   performance, app launch improvements for Android devices;
> >   memory consumption optimizations to minimize LMK triggering;
> >   batching to speed-up offloading;
> >   etc.
> 
> Yes it's a great effort that I appreciate and you did probably all that was
> possible to do without changing the slab allocator itself.
> 
> > So we have done a lot of work there. We were thinking about moving all
> > functionality from "kernel/rcu" to "mm/". As a first step i can do that,
> > i.e. move kvfree_rcu() as is. After that we can switch to second step.
> 
> Yeah we have discussed that with Paul at LSF/MM as well and I agreed it
> makes sense, but didn't get to it yet.
> 
> > Sounds good for you or not?
> 
> Sounds good, thanks!
> 
Thank you. Let me try to start moving it into mm/. I am thinking to place
it to the slab_common.c file. I am not sure if it makes sense to have a
dedicated file name for this purpose.

Anyway, share your view if you want to add something. Otherwise i can
proceed with that process.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-20 12:37       ` Uladzislau Rezki
@ 2024-11-25 11:02         ` Vlastimil Babka
  2024-11-25 11:18           ` Uladzislau Rezki
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-25 11:02 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox,
	Boqun Feng, linux-mm, linux-kernel, rcu, maple-tree

On 11/20/24 13:37, Uladzislau Rezki wrote:
> Thank you. Let me try to start moving it into mm/. I am thinking to place
> it to the slab_common.c file. I am not sure if it makes sense to have a
> dedicated file name for this purpose.

Yeah sounds good. slub.c is becoming rather large and this should not
interact with SLUB internals heavily anyway, slab_common.c makes sense.
Thanks!

> Anyway, share your view if you want to add something. Otherwise i can
> proceed with that process.
> 
> --
> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-25 11:02         ` Vlastimil Babka
@ 2024-11-25 11:18           ` Uladzislau Rezki
  2024-11-28 16:24             ` Uladzislau Rezki
  0 siblings, 1 reply; 19+ messages in thread
From: Uladzislau Rezki @ 2024-11-25 11:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox,
	Boqun Feng, linux-mm, linux-kernel, rcu, maple-tree

> On 11/20/24 13:37, Uladzislau Rezki wrote:
> > Thank you. Let me try to start moving it into mm/. I am thinking to place
> > it to the slab_common.c file. I am not sure if it makes sense to have a
> > dedicated file name for this purpose.
>
> Yeah sounds good. slub.c is becoming rather large and this should not
> interact with SLUB internals heavily anyway, slab_common.c makes sense.
> Thanks!
>
Got it :)

Thanks!

-- 
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-25 11:18           ` Uladzislau Rezki
@ 2024-11-28 16:24             ` Uladzislau Rezki
  2024-11-29 13:54               ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Uladzislau Rezki @ 2024-11-28 16:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Pekka Enberg, Joonsoo Kim,
	Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, linux-mm, linux-kernel, rcu,
	maple-tree

On Mon, Nov 25, 2024 at 12:18:19PM +0100, Uladzislau Rezki wrote:
> > On 11/20/24 13:37, Uladzislau Rezki wrote:
> > > Thank you. Let me try to start moving it into mm/. I am thinking to place
> > > it to the slab_common.c file. I am not sure if it makes sense to have a
> > > dedicated file name for this purpose.
> >
> > Yeah sounds good. slub.c is becoming rather large and this should not
> > interact with SLUB internals heavily anyway, slab_common.c makes sense.
> > Thanks!
> >
> Got it :)
> 
There is one question. Do you think it works if i do a migration as one
big commit? I am asking, because it is easier to go that way.

If it looks ugly for you, we can use another approach which is to split
the work into several patches and deploy it a series.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-28 16:24             ` Uladzislau Rezki
@ 2024-11-29 13:54               ` Vlastimil Babka
  2024-11-29 14:20                 ` Uladzislau Rezki
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2024-11-29 13:54 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Pekka Enberg, Joonsoo Kim, Roman Gushchin,
	Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes, Matthew Wilcox,
	Boqun Feng, linux-mm, linux-kernel, rcu, maple-tree

On 11/28/24 5:24 PM, Uladzislau Rezki wrote:
> On Mon, Nov 25, 2024 at 12:18:19PM +0100, Uladzislau Rezki wrote:
>>> On 11/20/24 13:37, Uladzislau Rezki wrote:
>>>> Thank you. Let me try to start moving it into mm/. I am thinking to place
>>>> it to the slab_common.c file. I am not sure if it makes sense to have a
>>>> dedicated file name for this purpose.
>>>
>>> Yeah sounds good. slub.c is becoming rather large and this should not
>>> interact with SLUB internals heavily anyway, slab_common.c makes sense.
>>> Thanks!
>>>
>> Got it :)
>>
> There is one question. Do you think it works if i do a migration as one
> big commit? I am asking, because it is easier to go that way.

Hi, I think one big move is fine. In case there are further adjustments,
those would be better in separate patches. That makes it easy to review
the move with git diff --color-moved.

Thanks,
Vlastimil

> If it looks ugly for you, we can use another approach which is to split
> the work into several patches and deploy it a series.
> 
> --
> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations
  2024-11-29 13:54               ` Vlastimil Babka
@ 2024-11-29 14:20                 ` Uladzislau Rezki
  0 siblings, 0 replies; 19+ messages in thread
From: Uladzislau Rezki @ 2024-11-29 14:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Uladzislau Rezki, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Pekka Enberg, Joonsoo Kim,
	Roman Gushchin, Hyeonggon Yoo, Paul E. McKenney, Lorenzo Stoakes,
	Matthew Wilcox, Boqun Feng, linux-mm, linux-kernel, rcu,
	maple-tree

Hello!

> On 11/28/24 5:24 PM, Uladzislau Rezki wrote:
> > On Mon, Nov 25, 2024 at 12:18:19PM +0100, Uladzislau Rezki wrote:
> >>> On 11/20/24 13:37, Uladzislau Rezki wrote:
> >>>> Thank you. Let me try to start moving it into mm/. I am thinking to place
> >>>> it to the slab_common.c file. I am not sure if it makes sense to have a
> >>>> dedicated file name for this purpose.
> >>>
> >>> Yeah sounds good. slub.c is becoming rather large and this should not
> >>> interact with SLUB internals heavily anyway, slab_common.c makes sense.
> >>> Thanks!
> >>>
> >> Got it :)
> >>
> > There is one question. Do you think it works if i do a migration as one
> > big commit? I am asking, because it is easier to go that way.
> 
> Hi, I think one big move is fine. In case there are further adjustments,
> those would be better in separate patches. That makes it easy to review
> the move with git diff --color-moved.
> 
Sounds good and thank you!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-11-29 14:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-12 16:38 [PATCH RFC 0/6] SLUB percpu sheaves Vlastimil Babka
2024-11-12 16:38 ` [PATCH RFC 1/6] mm/slub: add opt-in caching layer of " Vlastimil Babka
2024-11-12 16:38 ` [PATCH RFC 2/6] mm/slub: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2024-11-14 16:57   ` Uladzislau Rezki
2024-11-17 11:01     ` Vlastimil Babka
2024-11-20 12:37       ` Uladzislau Rezki
2024-11-25 11:02         ` Vlastimil Babka
2024-11-25 11:18           ` Uladzislau Rezki
2024-11-28 16:24             ` Uladzislau Rezki
2024-11-29 13:54               ` Vlastimil Babka
2024-11-29 14:20                 ` Uladzislau Rezki
2024-11-12 16:38 ` [PATCH RFC 3/6] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2024-11-12 16:38 ` [PATCH RFC 4/6] mm, vma: use sheaves for vm_area_struct cache Vlastimil Babka
2024-11-12 16:38 ` [PATCH RFC 5/6] mm, slub: cheaper locking for percpu sheaves Vlastimil Babka
2024-11-12 16:38 ` [PATCH RFC 6/6] mm, slub: sheaf prefilling for guaranteed allocations Vlastimil Babka
2024-11-18 13:13   ` Hyeonggon Yoo
2024-11-18 14:26     ` Vlastimil Babka
2024-11-19  2:29       ` Hyeonggon Yoo
2024-11-19  8:27         ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox