linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/23] SLUB percpu sheaves
@ 2025-09-10  8:01 Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 01/23] locking/local_lock: Expose dep_map in local_trylock_t Vlastimil Babka
                   ` (23 more replies)
  0 siblings, 24 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka,
	Alexei Starovoitov, Sebastian Andrzej Siewior,
	Venkat Rao Bagalkote, Qianfeng Rong, Wei Yang,
	Matthew Wilcox (Oracle),
	Andrew Morton, Lorenzo Stoakes, WangYuli, Jann Horn,
	Liam R. Howlett, Pedro Falcato

Hi,

I'm sending full v8 due to more changes in the middle of the series that
resulted in later patches being fixed up due to conflicts (details in
the changelog below).
The v8 will replace+extend the v7 in slab/for-next.

===

This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). My hope is therefore that it can eventually be enabled
for all kmem caches and replace the cpu (partial) slabs.

Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".

This caching may seem similar to the arrays in SLAB, but there are some
important differences:

- does not distinguish NUMA locality, thus there are no per-node
  "shared" arrays (with possible lock contention) and no "alien" arrays
  that would need periodical flushing
  - NUMA restricted allocations and strict_numa mode is still honoured,
    the percpu sheaves are bypassed for those allocations
  - a later patch (for separate evaluation) makes freeing remote objects
    bypass sheaves so sheaves contain mostly (not strictly) local objects
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
  for guaranteed and efficient allocations in a restricted context, when
  the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)

The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset. In v5 I include
Liam's patches for full maple tree conversion that uses the improved
preallocation API.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  thanks to local_trylock() it becomes a preempt_disable() and no atomic
  operations. Same for freeing, which is otherwise a local double cmpxchg
  only for short term allocations (so the same slab is still active on the
  same cpu when freeing the object) and a more costly locked double
  cmpxchg otherwise.

- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
  separate percpu sheaf and only submit the whole sheaf to call_rcu()
  when full. After the grace period, the sheaf can be used for
  allocations, which is more efficient than freeing and reallocating
  individual slab objects (even with the batching done by kfree_rcu()
  implementation itself). In case only some cpus are allowed to handle rcu
  callbacks, the sheaf can still be made available to other cpus on the
  same node via the shared barn. The maple_node cache uses kfree_rcu() and
  thus can benefit from this.
  Note: this path is currently limited to !PREEMPT_RT

- Preallocation support. A prefilled sheaf can be privately borrowed to
  perform a short term operation that is not allowed to block in the
  middle and may need to allocate some objects. If an upper bound (worst
  case) for the number of allocations is known, but only much fewer
  allocations actually needed on average, borrowing and returning a sheaf
  is much more efficient then a bulk allocation for the worst case
  followed by a bulk free of the many unused objects. Maple tree write
  operations should benefit from this.

- Compatibility with slub_debug. When slub_debug is enabled for a cache,
  we simply don't create the percpu sheaves so that the debugging hooks
  (at the node partial list slowpaths) are reached as before. The same
  thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
  reusing the (ineffective) paths for requests exceeding the cache's
  sheaf_capacity. This is in line with the existing approach where
  debugging bypasses the fast paths and SLUB_TINY preferes memory
  savings over performance.

GIT TREES:

this series: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v8r2
It is based on v6.17-rc3.

this series plus a microbenchmark hacked into slub_kunit:
https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v8-benchmarking

It allows evaluating overhead of the added sheaves code, and benefits
for single-threaded allocation/frees of varying batch size. I plan to
look into adding multi-threaded scenarios too.

The last commit there also adds sheaves to every cache to allow
measuring effects on other caches than vma and maple node. Note these
measurements should be compared to slab_nomerge boots without sheaves,
as adding sheaves makes caches unmergeable.

RESULTS:

In order to get some numbers that should be only due to differences in
implementation and no cache layout side-effects in users of the slab
objects etc, I have started with a in-kernel microbenchmark that does
allocating and freeing from a slab cache with or without sheaves and/or
memcg. It's either alternating single object alloc and free, or
allocates 10 objects and frees them, then 100, then 1000
- in order to see the effects of exhausting percpu sheaves or barn, or
(without sheaves) the percpu slabs. The order of objects to free can
be also shuffled instead of FIFO - to stress the non-sheaf freeing
slowpath more.

Measurements done on Ryzen 7 5700, bare metal.

The first question was how just having the sheaves implementation affects
existing no-sheaf caches due to the extra (unused) code. I have experimented
with changing inlining and adding unlikely() to the sheaves case. The
optimum seems is what's currently in the implementation - fast-path sheaves
usage is inlined, any handling of main sheaf empty on alloc/full on free is
a separate function, and the if (s->sheaf_capacity) has neither likely() nor
unlikely(). When I added unlikely() it destroyed the performance of sheaves
completely.

So the result is that with batch size 10, there's 2.4% overhead, and the
other cases are all impacted less than this. Hopefully acceptable with the
plan that eventually there would be sheaves everywhere and the current
cpu (partial) slabs scheme removed.

As for benefits of enabling sheaves (capacity=32) see the results below,
looks all good here. Of course this microbenchmark is not a complete
story though for at least these reasons:

- no kfree_rcu() evaluation
- doesn't show barn spinlock contention effects. In theory shouldn't be
worse than without sheaves because after exhausting cpu (partial) slabs, the
list_lock has to be taken. Sheaf capacity vs capacity of partial slabs is a
matter of tuning.

---------------------------------
 BATCH SIZE: 1 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115660272
 bench: no memcg, sheaves
 average (excl. iter 0): 95734972
 sheaves better by 17.2%
 bench: memcg, no sheaves
 average (excl. iter 0): 163682964
 bench: memcg, sheaves
 average (excl. iter 0): 144792803
 sheaves better by 11.5%

 ---------------------------------
 BATCH SIZE: 10 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115496906
 bench: no memcg, sheaves
 average (excl. iter 0): 97781102
 sheaves better by 15.3%
 bench: memcg, no sheaves
 average (excl. iter 0): 162771491
 bench: memcg, sheaves
 average (excl. iter 0): 144746490
 sheaves better by 11.0%

 ---------------------------------
 BATCH SIZE: 100 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 151796052
 bench: no memcg, sheaves
 average (excl. iter 0): 104641753
 sheaves better by 31.0%
 bench: memcg, no sheaves
 average (excl. iter 0): 200733436
 bench: memcg, sheaves
 average (excl. iter 0): 151340989
 sheaves better by 24.6%

 ---------------------------------
 BATCH SIZE: 1000 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 187623118
 bench: no memcg, sheaves
 average (excl. iter 0): 130914624
 sheaves better by 30.2%
 bench: memcg, no sheaves
 average (excl. iter 0): 240239575
 bench: memcg, sheaves
 average (excl. iter 0): 181474462
 sheaves better by 24.4%

 ---------------------------------
 BATCH SIZE: 10 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115110219
 bench: no memcg, sheaves
 average (excl. iter 0): 100597405
 sheaves better by 12.6%
 bench: memcg, no sheaves
 average (excl. iter 0): 163573377
 bench: memcg, sheaves
 average (excl. iter 0): 144535545
 sheaves better by 11.6%

 ---------------------------------
 BATCH SIZE: 100 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 152457970
 bench: no memcg, sheaves
 average (excl. iter 0): 108720274
 sheaves better by 28.6%
 bench: memcg, no sheaves
 average (excl. iter 0): 203478732
 bench: memcg, sheaves
 average (excl. iter 0): 151241821
 sheaves better by 25.6%

 ---------------------------------
 BATCH SIZE: 1000 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 189950559
 bench: no memcg, sheaves
 average (excl. iter 0): 177934450
 sheaves better by 6.3%
 bench: memcg, no sheaves
 average (excl. iter 0): 242988187
 bench: memcg, sheaves
 average (excl. iter 0): 221609979
 sheaves better by 8.7%

Vlastimil

---
Changes in v8:
- Liam provided a new patch "maple_tree: Drop bulk insert support" as
  the bulk insert was broken in v7 (not used anywhere but broke its
  internal tests) This has changed the context for later patches somewhat.
- Fix issues with "slab: add sheaf support for batching kfree_rcu()
  operations" reported by Ulad:
  - obtaining empty sheaf from kfree_rcu() is potentially dangerous on
    PREEMPT_RT - disable using rcu_free sheaves there for now
  - kvfree_rcu_barrier() must flush all rcu_free sheaves
  - dropped R-b's from Suren and Harry due to these changes being
    nontrivial
  - Added some R-b's
- Incorporated two patches originally in mm tree to avoid conflicts:
  - "maple_tree: remove redundant __GFP_NOWARN"
  - "tools/testing/vma: clean up stubs in vma_internal.h"
- Link to v7: https://patch.msgid.link/20250903-slub-percpu-caches-v7-0-71c114cdefef@suse.cz

Changes in v7:
- Incorporate Alexei's patch for local_trylock_t to fix
  lockdep_assert_held() on RT, reported by Thorsten.
  Patch: https://lore.kernel.org/all/20250718021646.73353-2-alexei.starovoitov@gmail.com/
- Remove pcs->barn pointer to fix boot failures reported by Venkat:
  https://lore.kernel.org/all/866d7f30-7cde-4c88-87ba-bdad16075433@linux.ibm.com/
  This was because initializing it for all possible cpus can get bogus
  nid values from cpu_to_mem() for non-online cpus.
  Instead introduce get_barn() to obtain it from node, the fast paths
  are not affected anyway. Otherwise cpu hotplug online callback would
  be necessary and I haven't found a cpuhp state that would satisfy all
  the constraints (including being safe to take slab_mutex).
  The draft fix patch from the bug report thread is effectively squashed
  into several of the sheaves patches.
- Incorporate maple tree patches posted separately here:
  https://lore.kernel.org/all/20250901-maple-sheaves-v1-0-d6a1166b53f2@suse.cz/
  Everything is reordered so that vma and maple userspace tests compile
  and pass at every step (fixes a broken vma test report from Lorenzo).
  Add tags from Sid for two of the patches there.
- Link to v6: https://patch.msgid.link/20250827-slub-percpu-caches-v6-0-f0f775a3f73f@suse.cz

Changes in v6:
- Applied feedback and review tags from Suren and Harry (thanks!)
- Separate patch for init_kmem_cache_nodes() error handling change.
- Removed the more involved maple tree conversion to be posted as a
  separate followup series.
- Link to v5: https://patch.msgid.link/20250723-slub-percpu-caches-v5-0-b792cd830f5d@suse.cz

Changes in v5:
- Apply review tags (Harry, Suren) except where changed too much (first
  patch).
- Handle CONFIG_SLUB_TINY by not creating percpu sheaves (Harry)
- Apply review feedback (typos, comments).
- Extract handling sheaf slow paths to separate non-inline functions
  __pcs_handle_empty() and __pcs_handle_full().
- Fix empty sheaf leak in rcu_free_sheaf() (Suren)
- Add "allow NUMA restricted allocations to use percpu sheaves".
- Add Liam's maple tree full sheaf conversion patches for easier
  evaluation.
- Rebase to v6.16-rc1.
- Link to v4: https://patch.msgid.link/20250425-slub-percpu-caches-v4-0-8a636982b4a4@suse.cz

Changes in v4:
- slub_debug disables sheaves for the cache in order to work properly
- strict_numa mode works as intended
- added a separate patch to make freeing remote objects skip sheaves
- various code refactoring suggested by Suren and Harry
- removed less useful stat counters and added missing ones for barn
  and prefilled sheaf events
- Link to v3: https://lore.kernel.org/r/20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz

Changes in v3:
- Squash localtry_lock conversion so it's used immediately.
- Incorporate feedback and add tags from Suren and Harry - thanks!
  - Mostly adding comments and some refactoring.
  - Fixes for kfree_rcu_sheaf() vmalloc handling, cpu hotremove
    flushing.
  - Fix wrong condition in kmem_cache_return_sheaf() that may have
    affected performance negatively.
  - Refactoring of free_to_pcs()
- Link to v2: https://lore.kernel.org/r/20250214-slub-percpu-caches-v2-0-88592ee0966a@suse.cz

Changes in v2:
- Removed kfree_rcu() destructors support as VMAs will not need it
  anymore after [3] is merged.
- Changed to localtry_lock_t borrowed from [2] instead of an own
  implementation of the same idea.
- Many fixes and improvements thanks to Liam's adoption for maple tree
  nodes.
- Userspace Testing stubs by Liam.
- Reduced limitations/todos - hooking to kfree_rcu() is complete,
  prefilled sheaves can exceed cache's sheaf_capacity.
- Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz

---
Alexei Starovoitov (1):
      locking/local_lock: Expose dep_map in local_trylock_t.

Liam R. Howlett (8):
      maple_tree: Drop bulk insert support
      tools/testing/vma: Implement vm_refcnt reset
      tools/testing: Add support for changes to slab for sheaves
      testing/radix-tree/maple: Hack around kfree_rcu not existing
      tools/testing: Add support for prefilled slab sheafs
      maple_tree: Prefilled sheaf conversion and testing
      maple_tree: Add single node allocation support to maple state
      maple_tree: Convert forking to use the sheaf interface

Lorenzo Stoakes (1):
      tools/testing/vma: clean up stubs in vma_internal.h

Pedro Falcato (2):
      maple_tree: Use kfree_rcu in ma_free_rcu
      maple_tree: Replace mt_free_one() with kfree()

Qianfeng Rong (1):
      maple_tree: remove redundant __GFP_NOWARN

Vlastimil Babka (10):
      slab: simplify init_kmem_cache_nodes() error handling
      slab: add opt-in caching layer of percpu sheaves
      slab: add sheaf support for batching kfree_rcu() operations
      slab: sheaf prefilling for guaranteed allocations
      slab: determine barn status racily outside of lock
      slab: skip percpu sheaves for remote object freeing
      slab: allow NUMA restricted allocations to use percpu sheaves
      mm, vma: use percpu sheaves for vm_area_struct cache
      maple_tree: use percpu sheaves for maple_node_cache
      tools/testing: include maple-shim.c in maple.c

 include/linux/local_lock_internal.h |    9 +-
 include/linux/maple_tree.h          |    6 +-
 include/linux/slab.h                |   47 +
 lib/maple_tree.c                    |  667 +++----------
 lib/test_maple_tree.c               |  137 ---
 mm/slab.h                           |    5 +
 mm/slab_common.c                    |   34 +-
 mm/slub.c                           | 1761 +++++++++++++++++++++++++++++++++--
 mm/vma_init.c                       |    1 +
 tools/include/linux/slab.h          |  165 +++-
 tools/testing/radix-tree/maple.c    |  514 +---------
 tools/testing/shared/linux.c        |  120 ++-
 tools/testing/shared/maple-shared.h |   11 +
 tools/testing/shared/maple-shim.c   |    7 +
 tools/testing/vma/vma_internal.h    |  259 ++----
 15 files changed, 2282 insertions(+), 1461 deletions(-)
---
base-commit: 82efd569a8909f2b13140c1b3de88535aea0b051
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 01/23] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-24 16:49   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 02/23] slab: simplify init_kmem_cache_nodes() error handling Vlastimil Babka
                   ` (22 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka,
	Alexei Starovoitov, Sebastian Andrzej Siewior

From: Alexei Starovoitov <ast@kernel.org>

lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
is a top level field of any lock that participates in LOCKDEP.
Make it so for local_trylock_t.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/local_lock_internal.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index d80b5306a2c0ccf95a3405b6b947b5f1f9a3bd38..949de37700dbc10feafc06d0b52382cf2e00c694 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -17,7 +17,10 @@ typedef struct {
 
 /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
 typedef struct {
-	local_lock_t	llock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+	struct task_struct	*owner;
+#endif
 	u8		acquired;
 } local_trylock_t;
 
@@ -31,7 +34,7 @@ typedef struct {
 	.owner = NULL,
 
 # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
-	.llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
+	LOCAL_LOCK_DEBUG_INIT(lockname)
 
 static inline void local_lock_acquire(local_lock_t *l)
 {
@@ -81,7 +84,7 @@ do {								\
 	local_lock_debug_init(lock);				\
 } while (0)
 
-#define __local_trylock_init(lock) __local_lock_init(lock.llock)
+#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
 
 #define __spinlock_nested_bh_init(lock)				\
 do {								\

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 02/23] slab: simplify init_kmem_cache_nodes() error handling
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 01/23] locking/local_lock: Expose dep_map in local_trylock_t Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-24 16:52   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 03/23] slab: add opt-in caching layer of percpu sheaves Vlastimil Babka
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

We don't need to call free_kmem_cache_nodes() immediately when failing
to allocate a kmem_cache_node, because when we return 0,
do_kmem_cache_create() calls __kmem_cache_release() which also performs
free_kmem_cache_nodes().

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 30003763d224c2704a4b93082b8b47af12dcffc5..9f671ec76131c4b0b28d5d568aa45842b5efb6d4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5669,10 +5669,8 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
 
-		if (!n) {
-			free_kmem_cache_nodes(s);
+		if (!n)
 			return 0;
-		}
 
 		init_kmem_cache_node(n);
 		s->node[node] = n;

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 03/23] slab: add opt-in caching layer of percpu sheaves
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 01/23] locking/local_lock: Expose dep_map in local_trylock_t Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 02/23] slab: simplify init_kmem_cache_nodes() error handling Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-12-02  8:48   ` [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf Hao Li
  2025-12-02  9:00   ` slub: add barn_get_full_sheaf() and refine empty-main sheaf replacement Hao Li
  2025-09-10  8:01 ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
                   ` (20 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka,
	Venkat Rao Bagalkote

Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.

Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
put the object back into one of the sheaves.

When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.

The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
the sheaves are bypassed.

The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.

The sheaf_capacity value is exported in sysfs for observability.

Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
count objects allocated or freed using the sheaves (and thus not
counting towards the other alloc/free path counters). Counters
sheaf_refill and sheaf_flush count objects filled or flushed from or to
slab pages, and can be used to assess how effective the caching is. The
refill and flush operations will also count towards the usual
alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
the backing slabs.  For barn operations, barn_get and barn_put count how
many full sheaves were get from or put to the barn, the _fail variants
count how many such requests could not be satisfied mainly  because the
barn was either empty or full. While the barn also holds empty sheaves
to make some operations easier, these are not as critical to mandate own
counters.  Finally, there are sheaf_alloc/sheaf_free counters.

Access to the percpu sheaves is protected by local_trylock() when
potential callers include irq context, and local_lock() otherwise (such
as when we already know the gfp flags allow blocking). The trylock
failures should be rare and we can easily fallback. Each per-NUMA-node
barn has a spin_lock.

When slub_debug is enabled for a cache with sheaf_capacity also
specified, the latter is ignored so that allocations and frees reach the
slow path where debugging hooks are processed. Similarly, we ignore it
with CONFIG_SLUB_TINY which prefers low memory usage to performance.

[boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ]

Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |   31 ++
 mm/slab.h            |    2 +
 mm/slab_common.c     |    5 +-
 mm/slub.c            | 1164 +++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 1142 insertions(+), 60 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..49acbcdc6696fd120c402adf757b3f41660ad50a 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -335,6 +335,37 @@ struct kmem_cache_args {
 	 * %NULL means no constructor.
 	 */
 	void (*ctor)(void *);
+	/**
+	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
+	 *
+	 * With a non-zero value, allocations from the cache go through caching
+	 * arrays called sheaves. Each cpu has a main sheaf that's always
+	 * present, and a spare sheaf that may be not present. When both become
+	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
+	 * from the per-node barn.
+	 *
+	 * When no full sheaf is available, and gfp flags allow blocking, a
+	 * sheaf is allocated and filled from slab(s) using bulk allocation.
+	 * Otherwise the allocation falls back to the normal operation
+	 * allocating a single object from a slab.
+	 *
+	 * Analogically when freeing and both percpu sheaves are full, the barn
+	 * may replace it with an empty sheaf, unless it's over capacity. In
+	 * that case a sheaf is bulk freed to slab pages.
+	 *
+	 * The sheaves do not enforce NUMA placement of objects, so allocations
+	 * via kmem_cache_alloc_node() with a node specified other than
+	 * NUMA_NO_NODE will bypass them.
+	 *
+	 * Bulk allocation and free operations also try to use the cpu sheaves
+	 * and barn, but fallback to using slab pages directly.
+	 *
+	 * When slub_debug is enabled for the cache, the sheaf_capacity argument
+	 * is ignored.
+	 *
+	 * %0 means no sheaves will be created.
+	 */
+	unsigned int sheaf_capacity;
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/mm/slab.h b/mm/slab.h
index 248b34c839b7ca39cf14e139c62d116efb97d30f..206987ce44a4d053ebe3b5e50784d2dd23822cd1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -235,6 +235,7 @@ struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
 #endif
+	struct slub_percpu_sheaves __percpu *cpu_sheaves;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
 	unsigned long min_partial;
@@ -248,6 +249,7 @@ struct kmem_cache {
 	/* Number of per cpu partial slabs to keep around */
 	unsigned int cpu_partial_slabs;
 #endif
+	unsigned int sheaf_capacity;
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bfe7c40eeee1a01c175766935c1e3c0304434a53..e2b197e47866c30acdbd1fee4159f262a751c5a7 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
 		return 1;
 #endif
 
+	if (s->cpu_sheaves)
+		return 1;
+
 	/*
 	 * We may have set a slab to be unmergeable during bootstrap.
 	 */
@@ -321,7 +324,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
-	if (!args->usersize)
+	if (!args->usersize && !args->sheaf_capacity)
 		s = __kmem_cache_alias(name, object_size, args->align, flags,
 				       args->ctor);
 	if (s)
diff --git a/mm/slub.c b/mm/slub.c
index 9f671ec76131c4b0b28d5d568aa45842b5efb6d4..cba188b7e04ddf86debf9bc27a2f725db1b2056e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -363,8 +363,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
 #endif
 
 enum stat_item {
+	ALLOC_PCS,		/* Allocation from percpu sheaf */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
+	FREE_PCS,		/* Free to percpu sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -389,6 +391,14 @@ enum stat_item {
 	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
 	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
+	SHEAF_FLUSH,		/* Objects flushed from a sheaf */
+	SHEAF_REFILL,		/* Objects refilled to a sheaf */
+	SHEAF_ALLOC,		/* Allocation of an empty sheaf */
+	SHEAF_FREE,		/* Freeing of an empty sheaf */
+	BARN_GET,		/* Got full sheaf from barn */
+	BARN_GET_FAIL,		/* Failed to get full sheaf from barn */
+	BARN_PUT,		/* Put full sheaf to barn */
+	BARN_PUT_FAIL,		/* Failed to put full sheaf to barn */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -435,6 +445,32 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
 #endif
 }
 
+#define MAX_FULL_SHEAVES	10
+#define MAX_EMPTY_SHEAVES	10
+
+struct node_barn {
+	spinlock_t lock;
+	struct list_head sheaves_full;
+	struct list_head sheaves_empty;
+	unsigned int nr_full;
+	unsigned int nr_empty;
+};
+
+struct slab_sheaf {
+	union {
+		struct rcu_head rcu_head;
+		struct list_head barn_list;
+	};
+	unsigned int size;
+	void *objects[];
+};
+
+struct slub_percpu_sheaves {
+	local_trylock_t lock;
+	struct slab_sheaf *main; /* never NULL when unlocked */
+	struct slab_sheaf *spare; /* empty or full, may be NULL */
+};
+
 /*
  * The slab lists for all objects.
  */
@@ -447,6 +483,7 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+	struct node_barn *barn;
 };
 
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
@@ -454,6 +491,12 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 	return s->node[node];
 }
 
+/* Get the barn of the current cpu's memory node */
+static inline struct node_barn *get_barn(struct kmem_cache *s)
+{
+	return get_node(s, numa_mem_id())->barn;
+}
+
 /*
  * Iterator over all nodes. The body will be executed for each node that has
  * a kmem_cache_node structure allocated (which is true for all online nodes)
@@ -470,12 +513,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
  */
 static nodemask_t slab_nodes;
 
-#ifndef CONFIG_SLUB_TINY
 /*
  * Workqueue used for flush_cpu_slab().
  */
 static struct workqueue_struct *flushwq;
-#endif
+
+struct slub_flush_work {
+	struct work_struct work;
+	struct kmem_cache *s;
+	bool skip;
+};
+
+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
 
 /********************************************************************
  * 			Core slab cache functions
@@ -2473,6 +2523,360 @@ static void *setup_object(struct kmem_cache *s, void *object)
 	return object;
 }
 
+static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
+					s->sheaf_capacity), gfp);
+
+	if (unlikely(!sheaf))
+		return NULL;
+
+	stat(s, SHEAF_ALLOC);
+
+	return sheaf;
+}
+
+static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	kfree(sheaf);
+
+	stat(s, SHEAF_FREE);
+}
+
+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+				   size_t size, void **p);
+
+
+static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
+			 gfp_t gfp)
+{
+	int to_fill = s->sheaf_capacity - sheaf->size;
+	int filled;
+
+	if (!to_fill)
+		return 0;
+
+	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
+					 &sheaf->objects[sheaf->size]);
+
+	sheaf->size += filled;
+
+	stat_add(s, SHEAF_REFILL, filled);
+
+	if (filled < to_fill)
+		return -ENOMEM;
+
+	return 0;
+}
+
+
+static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (!sheaf)
+		return NULL;
+
+	if (refill_sheaf(s, sheaf, gfp)) {
+		free_empty_sheaf(s, sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+/*
+ * Maximum number of objects freed during a single flush of main pcs sheaf.
+ * Translates directly to an on-stack array size.
+ */
+#define PCS_BATCH_MAX	32U
+
+static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
+
+/*
+ * Free all objects from the main sheaf. In order to perform
+ * __kmem_cache_free_bulk() outside of cpu_sheaves->lock, work in batches where
+ * object pointers are moved to a on-stack array under the lock. To bound the
+ * stack usage, limit each batch to PCS_BATCH_MAX.
+ *
+ * returns true if at least partially flushed
+ */
+static bool sheaf_flush_main(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned int batch, remaining;
+	void *objects[PCS_BATCH_MAX];
+	struct slab_sheaf *sheaf;
+	bool ret = false;
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return ret;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	sheaf = pcs->main;
+
+	batch = min(PCS_BATCH_MAX, sheaf->size);
+
+	sheaf->size -= batch;
+	memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
+
+	remaining = sheaf->size;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	__kmem_cache_free_bulk(s, batch, &objects[0]);
+
+	stat_add(s, SHEAF_FLUSH, batch);
+
+	ret = true;
+
+	if (remaining)
+		goto next_batch;
+
+	return ret;
+}
+
+/*
+ * Free all objects from a sheaf that's unused, i.e. not linked to any
+ * cpu_sheaves, so we need no locking and batching. The locking is also not
+ * necessary when flushing cpu's sheaves (both spare and main) during cpu
+ * hotremove as the cpu is not executing anymore.
+ */
+static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	if (!sheaf->size)
+		return;
+
+	stat_add(s, SHEAF_FLUSH, sheaf->size);
+
+	__kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+
+	sheaf->size = 0;
+}
+
+/*
+ * Caller needs to make sure migration is disabled in order to fully flush
+ * single cpu's sheaves
+ *
+ * must not be called from an irq
+ *
+ * flushing operations are rare so let's keep it simple and flush to slabs
+ * directly, skipping the barn
+ */
+static void pcs_flush_all(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *spare;
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	spare = pcs->spare;
+	pcs->spare = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (spare) {
+		sheaf_flush_unused(s, spare);
+		free_empty_sheaf(s, spare);
+	}
+
+	sheaf_flush_main(s);
+}
+
+static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	/* The cpu is not executing anymore so we don't need pcs->lock */
+	sheaf_flush_unused(s, pcs->main);
+	if (pcs->spare) {
+		sheaf_flush_unused(s, pcs->spare);
+		free_empty_sheaf(s, pcs->spare);
+		pcs->spare = NULL;
+	}
+}
+
+static void pcs_destroy(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		/* can happen when unwinding failed create */
+		if (!pcs->main)
+			continue;
+
+		/*
+		 * We have already passed __kmem_cache_shutdown() so everything
+		 * was flushed and there should be no objects allocated from
+		 * slabs, otherwise kmem_cache_destroy() would have aborted.
+		 * Therefore something would have to be really wrong if the
+		 * warnings here trigger, and we should rather leave objects and
+		 * sheaves to leak in that case.
+		 */
+
+		WARN_ON(pcs->spare);
+
+		if (!WARN_ON(pcs->main->size)) {
+			free_empty_sheaf(s, pcs->main);
+			pcs->main = NULL;
+		}
+	}
+
+	free_percpu(s->cpu_sheaves);
+	s->cpu_sheaves = NULL;
+}
+
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *empty = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_empty) {
+		empty = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&empty->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+/*
+ * The following two functions are used mainly in cases where we have to undo an
+ * intended action due to a race or cpu migration. Thus they do not check the
+ * empty or full sheaf limits for simplicity.
+ */
+
+static void barn_put_empty_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_add(&sheaf->barn_list, &barn->sheaves_empty);
+	barn->nr_empty++;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+}
+
+static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_add(&sheaf->barn_list, &barn->sheaves_full);
+	barn->nr_full++;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+}
+
+/*
+ * If a full sheaf is available, return it and put the supplied empty one to
+ * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
+ * change.
+ */
+static struct slab_sheaf *
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&full->barn_list);
+		list_add(&empty->barn_list, &barn->sheaves_empty);
+		barn->nr_full--;
+		barn->nr_empty++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
+/*
+ * If an empty sheaf is available, return it and put the supplied full one to
+ * barn. But if there are too many full sheaves, reject this with -E2BIG.
+ */
+static struct slab_sheaf *
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+{
+	struct slab_sheaf *empty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full >= MAX_FULL_SHEAVES) {
+		empty = ERR_PTR(-E2BIG);
+	} else if (!barn->nr_empty) {
+		empty = ERR_PTR(-ENOMEM);
+	} else {
+		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
+					 barn_list);
+		list_del(&empty->barn_list);
+		list_add(&full->barn_list, &barn->sheaves_full);
+		barn->nr_empty--;
+		barn->nr_full++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+static void barn_init(struct node_barn *barn)
+{
+	spin_lock_init(&barn->lock);
+	INIT_LIST_HEAD(&barn->sheaves_full);
+	INIT_LIST_HEAD(&barn->sheaves_empty);
+	barn->nr_full = 0;
+	barn->nr_empty = 0;
+}
+
+static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
+{
+	struct list_head empty_list;
+	struct list_head full_list;
+	struct slab_sheaf *sheaf, *sheaf2;
+	unsigned long flags;
+
+	INIT_LIST_HEAD(&empty_list);
+	INIT_LIST_HEAD(&full_list);
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_splice_init(&barn->sheaves_full, &full_list);
+	barn->nr_full = 0;
+	list_splice_init(&barn->sheaves_empty, &empty_list);
+	barn->nr_empty = 0;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
+		sheaf_flush_unused(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+	}
+
+	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
+		free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Slab allocation and freeing
  */
@@ -3344,11 +3748,40 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 	put_partials_cpu(s, c);
 }
 
-struct slub_flush_work {
-	struct work_struct work;
-	struct kmem_cache *s;
-	bool skip;
-};
+static inline void flush_this_cpu_slab(struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+
+	if (c->slab)
+		flush_slab(s, c);
+
+	put_partials(s);
+}
+
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+	return c->slab || slub_percpu_partial(c);
+}
+
+#else /* CONFIG_SLUB_TINY */
+static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
+static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
+static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
+#endif /* CONFIG_SLUB_TINY */
+
+static bool has_pcs_used(int cpu, struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	if (!s->cpu_sheaves)
+		return false;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	return (pcs->spare || pcs->main->size);
+}
 
 /*
  * Flush cpu slab.
@@ -3358,30 +3791,18 @@ struct slub_flush_work {
 static void flush_cpu_slab(struct work_struct *w)
 {
 	struct kmem_cache *s;
-	struct kmem_cache_cpu *c;
 	struct slub_flush_work *sfw;
 
 	sfw = container_of(w, struct slub_flush_work, work);
 
 	s = sfw->s;
-	c = this_cpu_ptr(s->cpu_slab);
-
-	if (c->slab)
-		flush_slab(s, c);
-
-	put_partials(s);
-}
 
-static bool has_cpu_slab(int cpu, struct kmem_cache *s)
-{
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+	if (s->cpu_sheaves)
+		pcs_flush_all(s);
 
-	return c->slab || slub_percpu_partial(c);
+	flush_this_cpu_slab(s);
 }
 
-static DEFINE_MUTEX(flush_lock);
-static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
-
 static void flush_all_cpus_locked(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
@@ -3392,7 +3813,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		if (!has_cpu_slab(cpu, s)) {
+		if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
@@ -3428,19 +3849,15 @@ static int slub_cpu_dead(unsigned int cpu)
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list)
+	list_for_each_entry(s, &slab_caches, list) {
 		__flush_cpu_slab(s, cpu);
+		if (s->cpu_sheaves)
+			__pcs_flush_all_cpu(s, cpu);
+	}
 	mutex_unlock(&slab_mutex);
 	return 0;
 }
 
-#else /* CONFIG_SLUB_TINY */
-static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
-static inline void flush_all(struct kmem_cache *s) { }
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
-static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
-#endif /* CONFIG_SLUB_TINY */
-
 /*
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
@@ -4191,30 +4608,240 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 }
 
 /*
- * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
- * have the fastpath folded into their functions. So no function call
- * overhead for requests that can be satisfied on the fastpath.
- *
- * The fastpath works by first checking if the lockless freelist can be used.
- * If not then __slab_alloc is called for slow processing.
+ * Replace the empty main sheaf with a (at least partially) full sheaf.
  *
- * Otherwise we can simply pick the next object from the lockless free list.
+ * Must be called with the cpu_sheaves local lock locked. If successful, returns
+ * the pcs pointer and the local lock locked (possibly on a different cpu than
+ * initially called). If not successful, returns NULL and the local lock
+ * unlocked.
  */
-static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
-		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
+static struct slub_percpu_sheaves *
+__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
 {
-	void *object;
-	bool init = false;
+	struct slab_sheaf *empty = NULL;
+	struct slab_sheaf *full;
+	struct node_barn *barn;
+	bool can_alloc;
 
-	s = slab_pre_alloc_hook(s, gfpflags);
-	if (unlikely(!s))
-		return NULL;
+	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+
+	if (pcs->spare && pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	barn = get_barn(s);
+
+	full = barn_replace_empty_sheaf(barn, pcs->main);
+
+	if (full) {
+		stat(s, BARN_GET);
+		pcs->main = full;
+		return pcs;
+	}
+
+	stat(s, BARN_GET_FAIL);
+
+	can_alloc = gfpflags_allow_blocking(gfp);
+
+	if (can_alloc) {
+		if (pcs->spare) {
+			empty = pcs->spare;
+			pcs->spare = NULL;
+		} else {
+			empty = barn_get_empty_sheaf(barn);
+		}
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (!can_alloc)
+		return NULL;
+
+	if (empty) {
+		if (!refill_sheaf(s, empty, gfp)) {
+			full = empty;
+		} else {
+			/*
+			 * we must be very low on memory so don't bother
+			 * with the barn
+			 */
+			free_empty_sheaf(s, empty);
+		}
+	} else {
+		full = alloc_full_sheaf(s, gfp);
+	}
+
+	if (!full)
+		return NULL;
+
+	/*
+	 * we can reach here only when gfpflags_allow_blocking
+	 * so this must not be an irq
+	 */
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	/*
+	 * If we are returning empty sheaf, we either got it from the
+	 * barn or had to allocate one. If we are returning a full
+	 * sheaf, it's due to racing or being migrated to a different
+	 * cpu. Breaching the barn's sheaf limits should be thus rare
+	 * enough so just ignore them to simplify the recovery.
+	 */
+
+	if (pcs->main->size == 0) {
+		barn_put_empty_sheaf(barn, pcs->main);
+		pcs->main = full;
+		return pcs;
+	}
+
+	if (!pcs->spare) {
+		pcs->spare = full;
+		return pcs;
+	}
+
+	if (pcs->spare->size == 0) {
+		barn_put_empty_sheaf(barn, pcs->spare);
+		pcs->spare = full;
+		return pcs;
+	}
+
+	barn_put_full_sheaf(barn, full);
+	stat(s, BARN_PUT);
+
+	return pcs;
+}
+
+static __fastpath_inline
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slub_percpu_sheaves *pcs;
+	void *object;
+
+#ifdef CONFIG_NUMA
+	if (static_branch_unlikely(&strict_numa)) {
+		if (current->mempolicy)
+			return NULL;
+	}
+#endif
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return NULL;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+		pcs = __pcs_replace_empty_main(s, pcs, gfp);
+		if (unlikely(!pcs))
+			return NULL;
+	}
+
+	object = pcs->main->objects[--pcs->main->size];
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat(s, ALLOC_PCS);
+
+	return object;
+}
+
+static __fastpath_inline
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main;
+	unsigned int allocated = 0;
+	unsigned int batch;
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return allocated;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *full;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(get_barn(s), pcs->main);
+
+		if (full) {
+			stat(s, BARN_GET);
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		stat(s, BARN_GET_FAIL);
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		/*
+		 * Once full sheaves in barn are depleted, let the bulk
+		 * allocation continue from slab pages, otherwise we would just
+		 * be copying arrays of pointers twice.
+		 */
+		return allocated;
+	}
+
+do_alloc:
+
+	main = pcs->main;
+	batch = min(size, main->size);
+
+	main->size -= batch;
+	memcpy(p, main->objects + main->size, batch * sizeof(void *));
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat_add(s, ALLOC_PCS, batch);
+
+	allocated += batch;
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return allocated;
+}
+
+
+/*
+ * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
+ * have the fastpath folded into their functions. So no function call
+ * overhead for requests that can be satisfied on the fastpath.
+ *
+ * The fastpath works by first checking if the lockless freelist can be used.
+ * If not then __slab_alloc is called for slow processing.
+ *
+ * Otherwise we can simply pick the next object from the lockless free list.
+ */
+static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
+		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
+{
+	void *object;
+	bool init = false;
+
+	s = slab_pre_alloc_hook(s, gfpflags);
+	if (unlikely(!s))
+		return NULL;
 
 	object = kfence_alloc(s, orig_size, gfpflags);
 	if (unlikely(object))
 		goto out;
 
-	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (s->cpu_sheaves && node == NUMA_NO_NODE)
+		object = alloc_from_pcs(s, gfpflags);
+
+	if (!object)
+		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);
@@ -4591,6 +5218,295 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+/*
+ * pcs is locked. We should have get rid of the spare sheaf and obtained an
+ * empty sheaf, while the main sheaf is full. We want to install the empty sheaf
+ * as a main sheaf, and make the current main sheaf a spare sheaf.
+ *
+ * However due to having relinquished the cpu_sheaves lock when obtaining
+ * the empty sheaf, we need to handle some unlikely but possible cases.
+ *
+ * If we put any sheaf to barn here, it's because we were interrupted or have
+ * been migrated to a different cpu, which should be rare enough so just ignore
+ * the barn's limits to simplify the handling.
+ *
+ * An alternative scenario that gets us here is when we fail
+ * barn_replace_full_sheaf(), because there's no empty sheaf available in the
+ * barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw the
+ * limit on full sheaves was not exceeded, we assume it didn't change and just
+ * put the full sheaf there.
+ */
+static void __pcs_install_empty_sheaf(struct kmem_cache *s,
+		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
+{
+	struct node_barn *barn;
+
+	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+
+	/* This is what we expect to find if nobody interrupted us. */
+	if (likely(!pcs->spare)) {
+		pcs->spare = pcs->main;
+		pcs->main = empty;
+		return;
+	}
+
+	barn = get_barn(s);
+
+	/*
+	 * Unlikely because if the main sheaf had space, we would have just
+	 * freed to it. Get rid of our empty sheaf.
+	 */
+	if (pcs->main->size < s->sheaf_capacity) {
+		barn_put_empty_sheaf(barn, empty);
+		return;
+	}
+
+	/* Also unlikely for the same reason */
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		barn_put_empty_sheaf(barn, empty);
+		return;
+	}
+
+	/*
+	 * We probably failed barn_replace_full_sheaf() due to no empty sheaf
+	 * available there, but we allocated one, so finish the job.
+	 */
+	barn_put_full_sheaf(barn, pcs->main);
+	stat(s, BARN_PUT);
+	pcs->main = empty;
+}
+
+/*
+ * Replace the full main sheaf with a (at least partially) empty sheaf.
+ *
+ * Must be called with the cpu_sheaves local lock locked. If successful, returns
+ * the pcs pointer and the local lock locked (possibly on a different cpu than
+ * initially called). If not successful, returns NULL and the local lock
+ * unlocked.
+ */
+static struct slub_percpu_sheaves *
+__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
+{
+	struct slab_sheaf *empty;
+	struct node_barn *barn;
+	bool put_fail;
+
+restart:
+	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+
+	barn = get_barn(s);
+	put_fail = false;
+
+	if (!pcs->spare) {
+		empty = barn_get_empty_sheaf(barn);
+		if (empty) {
+			pcs->spare = pcs->main;
+			pcs->main = empty;
+			return pcs;
+		}
+		goto alloc_empty;
+	}
+
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	empty = barn_replace_full_sheaf(barn, pcs->main);
+
+	if (!IS_ERR(empty)) {
+		stat(s, BARN_PUT);
+		pcs->main = empty;
+		return pcs;
+	}
+
+	if (PTR_ERR(empty) == -E2BIG) {
+		/* Since we got here, spare exists and is full */
+		struct slab_sheaf *to_flush = pcs->spare;
+
+		stat(s, BARN_PUT_FAIL);
+
+		pcs->spare = NULL;
+		local_unlock(&s->cpu_sheaves->lock);
+
+		sheaf_flush_unused(s, to_flush);
+		empty = to_flush;
+		goto got_empty;
+	}
+
+	/*
+	 * We could not replace full sheaf because barn had no empty
+	 * sheaves. We can still allocate it and put the full sheaf in
+	 * __pcs_install_empty_sheaf(), but if we fail to allocate it,
+	 * make sure to count the fail.
+	 */
+	put_fail = true;
+
+alloc_empty:
+	local_unlock(&s->cpu_sheaves->lock);
+
+	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+	if (empty)
+		goto got_empty;
+
+	if (put_fail)
+		 stat(s, BARN_PUT_FAIL);
+
+	if (!sheaf_flush_main(s))
+		return NULL;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return NULL;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	/*
+	 * we flushed the main sheaf so it should be empty now,
+	 * but in case we got preempted or migrated, we need to
+	 * check again
+	 */
+	if (pcs->main->size == s->sheaf_capacity)
+		goto restart;
+
+	return pcs;
+
+got_empty:
+	if (!local_trylock(&s->cpu_sheaves->lock)) {
+		barn_put_empty_sheaf(barn, empty);
+		return NULL;
+	}
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	__pcs_install_empty_sheaf(s, pcs, empty);
+
+	return pcs;
+}
+
+/*
+ * Free an object to the percpu sheaves.
+ * The object is expected to have passed slab_free_hook() already.
+ */
+static __fastpath_inline
+bool free_to_pcs(struct kmem_cache *s, void *object)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return false;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
+
+		pcs = __pcs_replace_full_main(s, pcs);
+		if (unlikely(!pcs))
+			return false;
+	}
+
+	pcs->main->objects[pcs->main->size++] = object;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat(s, FREE_PCS);
+
+	return true;
+}
+
+/*
+ * Bulk free objects to the percpu sheaves.
+ * Unlike free_to_pcs() this includes the calls to all necessary hooks
+ * and the fallback to freeing to slab pages.
+ */
+static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main, *empty;
+	bool init = slab_want_init_on_free(s);
+	unsigned int batch, i = 0;
+	struct node_barn *barn;
+
+	while (i < size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+			p[i] = p[--size];
+			if (!size)
+				return;
+			continue;
+		}
+
+		i++;
+	}
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto fallback;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (likely(pcs->main->size < s->sheaf_capacity))
+		goto do_free;
+
+	barn = get_barn(s);
+
+	if (!pcs->spare) {
+		empty = barn_get_empty_sheaf(barn);
+		if (!empty)
+			goto no_empty;
+
+		pcs->spare = pcs->main;
+		pcs->main = empty;
+		goto do_free;
+	}
+
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		goto do_free;
+	}
+
+	empty = barn_replace_full_sheaf(barn, pcs->main);
+	if (IS_ERR(empty)) {
+		stat(s, BARN_PUT_FAIL);
+		goto no_empty;
+	}
+
+	stat(s, BARN_PUT);
+	pcs->main = empty;
+
+do_free:
+	main = pcs->main;
+	batch = min(size, s->sheaf_capacity - main->size);
+
+	memcpy(main->objects + main->size, p, batch * sizeof(void *));
+	main->size += batch;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat_add(s, FREE_PCS, batch);
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return;
+
+no_empty:
+	local_unlock(&s->cpu_sheaves->lock);
+
+	/*
+	 * if we depleted all empty sheaves in the barn or there are too
+	 * many full sheaves, free the rest to slab pages
+	 */
+fallback:
+	__kmem_cache_free_bulk(s, size, p);
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4677,7 +5593,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	memcg_slab_free_hook(s, slab, &object, 1);
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+		return;
+
+	if (!s->cpu_sheaves || !free_to_pcs(s, object))
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -5273,6 +6192,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (!size)
 		return;
 
+	/*
+	 * freeing to sheaves is so incompatible with the detached freelist so
+	 * once we go that way, we have to do everything differently
+	 */
+	if (s && s->cpu_sheaves) {
+		free_to_pcs_bulk(s, size, p);
+		return;
+	}
+
 	do {
 		struct detached_freelist df;
 
@@ -5391,7 +6319,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 				 void **p)
 {
-	int i;
+	unsigned int i = 0;
 
 	if (!size)
 		return 0;
@@ -5400,9 +6328,20 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 	if (unlikely(!s))
 		return 0;
 
-	i = __kmem_cache_alloc_bulk(s, flags, size, p);
-	if (unlikely(i == 0))
-		return 0;
+	if (s->cpu_sheaves)
+		i = alloc_from_pcs_bulk(s, size, p);
+
+	if (i < size) {
+		/*
+		 * If we ran out of memory, don't bother with freeing back to
+		 * the percpu sheaves, we have bigger problems.
+		 */
+		if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0)) {
+			if (i > 0)
+				__kmem_cache_free_bulk(s, i, p);
+			return 0;
+		}
+	}
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
@@ -5412,11 +6351,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 		    slab_want_init_on_alloc(flags, s), s->object_size))) {
 		return 0;
 	}
-	return i;
+
+	return size;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 
-
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can
@@ -5550,7 +6489,7 @@ static inline int calculate_order(unsigned int size)
 }
 
 static void
-init_kmem_cache_node(struct kmem_cache_node *n)
+init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
@@ -5560,6 +6499,9 @@ init_kmem_cache_node(struct kmem_cache_node *n)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+	n->barn = barn;
+	if (barn)
+		barn_init(barn);
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -5590,6 +6532,26 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 }
 #endif /* CONFIG_SLUB_TINY */
 
+static int init_percpu_sheaves(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		local_trylock_init(&pcs->lock);
+
+		pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+
+		if (!pcs->main)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static struct kmem_cache *kmem_cache_node;
 
 /*
@@ -5625,7 +6587,7 @@ static void early_kmem_cache_node_alloc(int node)
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
 	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n);
+	init_kmem_cache_node(n, NULL);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
@@ -5641,6 +6603,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn) {
+			WARN_ON(n->barn->nr_full);
+			WARN_ON(n->barn->nr_empty);
+			kfree(n->barn);
+			n->barn = NULL;
+		}
+
 		s->node[node] = NULL;
 		kmem_cache_free(kmem_cache_node, n);
 	}
@@ -5649,6 +6618,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
+	if (s->cpu_sheaves)
+		pcs_destroy(s);
 #ifndef CONFIG_SLUB_TINY
 	free_percpu(s->cpu_slab);
 #endif
@@ -5661,18 +6632,29 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_node_mask(node, slab_nodes) {
 		struct kmem_cache_node *n;
+		struct node_barn *barn = NULL;
 
 		if (slab_state == DOWN) {
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+			if (!barn)
+				return 0;
+		}
+
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
-
-		if (!n)
+		if (!n) {
+			kfree(barn);
 			return 0;
+		}
+
+		init_kmem_cache_node(n, barn);
 
-		init_kmem_cache_node(n);
 		s->node[node] = n;
 	}
 	return 1;
@@ -5929,6 +6911,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	flush_all_cpus_locked(s);
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn)
+			barn_shrink(s, n->barn);
 		free_partial(s, n);
 		if (n->nr_partial || node_nr_slabs(n))
 			return 1;
@@ -6132,6 +7116,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 			INIT_LIST_HEAD(promote + i);
 
+		if (n->barn)
+			barn_shrink(s, n->barn);
+
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -6211,12 +7198,24 @@ static int slab_mem_going_online_callback(int nid)
 	 */
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
+		struct node_barn *barn = NULL;
+
 		/*
 		 * The structure may already exist if the node was previously
 		 * onlined and offlined.
 		 */
 		if (get_node(s, nid))
 			continue;
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
+
+			if (!barn) {
+				ret = -ENOMEM;
+				goto out;
+			}
+		}
+
 		/*
 		 * XXX: kmem_cache_alloc_node will fallback to other nodes
 		 *      since memory is not yet available from the node that
@@ -6224,10 +7223,13 @@ static int slab_mem_going_online_callback(int nid)
 		 */
 		n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
 		if (!n) {
+			kfree(barn);
 			ret = -ENOMEM;
 			goto out;
 		}
-		init_kmem_cache_node(n);
+
+		init_kmem_cache_node(n, barn);
+
 		s->node[nid] = n;
 	}
 	/*
@@ -6440,6 +7442,17 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 
 	set_cpu_partial(s);
 
+	if (args->sheaf_capacity && !IS_ENABLED(CONFIG_SLUB_TINY)
+					&& !(s->flags & SLAB_DEBUG_FLAGS)) {
+		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
+		if (!s->cpu_sheaves) {
+			err = -ENOMEM;
+			goto out;
+		}
+		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
+		s->sheaf_capacity = args->sheaf_capacity;
+	}
+
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -6456,6 +7469,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	if (!alloc_kmem_cache_cpus(s))
 		goto out;
 
+	if (s->cpu_sheaves) {
+		err = init_percpu_sheaves(s);
+		if (err)
+			goto out;
+	}
+
 	err = 0;
 
 	/* Mutex is not taken during early boot */
@@ -6908,6 +7927,12 @@ static ssize_t order_show(struct kmem_cache *s, char *buf)
 }
 SLAB_ATTR_RO(order);
 
+static ssize_t sheaf_capacity_show(struct kmem_cache *s, char *buf)
+{
+	return sysfs_emit(buf, "%u\n", s->sheaf_capacity);
+}
+SLAB_ATTR_RO(sheaf_capacity);
+
 static ssize_t min_partial_show(struct kmem_cache *s, char *buf)
 {
 	return sysfs_emit(buf, "%lu\n", s->min_partial);
@@ -7255,8 +8280,10 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
+STAT_ATTR(FREE_PCS, free_cpu_sheaf);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -7281,6 +8308,14 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
 STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
 STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
+STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
+STAT_ATTR(SHEAF_REFILL, sheaf_refill);
+STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
+STAT_ATTR(SHEAF_FREE, sheaf_free);
+STAT_ATTR(BARN_GET, barn_get);
+STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
+STAT_ATTR(BARN_PUT, barn_put);
+STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -7311,6 +8346,7 @@ static struct attribute *slab_attrs[] = {
 	&object_size_attr.attr,
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
+	&sheaf_capacity_attr.attr,
 	&min_partial_attr.attr,
 	&cpu_partial_attr.attr,
 	&objects_partial_attr.attr,
@@ -7342,8 +8378,10 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
+	&alloc_cpu_sheaf_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
+	&free_cpu_sheaf_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
@@ -7368,6 +8406,14 @@ static struct attribute *slab_attrs[] = {
 	&cpu_partial_free_attr.attr,
 	&cpu_partial_node_attr.attr,
 	&cpu_partial_drain_attr.attr,
+	&sheaf_flush_attr.attr,
+	&sheaf_refill_attr.attr,
+	&sheaf_alloc_attr.attr,
+	&sheaf_free_attr.attr,
+	&barn_get_attr.attr,
+	&barn_get_fail_attr.attr,
+	&barn_put_attr.attr,
+	&barn_put_fail_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (2 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 03/23] slab: add opt-in caching layer of percpu sheaves Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-12  0:38   ` Sergey Senozhatsky
                     ` (2 more replies)
  2025-09-10  8:01 ` [PATCH v8 05/23] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
                   ` (19 subsequent siblings)
  23 siblings, 3 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.

kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.

It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.

Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
  existing batching
- sheaves can be reused for allocations via barn instead of being
  flushed to slabs, which is more efficient
  - this includes cases where only some cpus are allowed to process rcu
    callbacks (Android)

Possible disadvantage:
- objects might be waiting for more than their grace period (it is
  determined by the last object freed into the sheaf), increasing memory
  usage - but the existing batching does that too.

Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.

Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
contexts where kfree_rcu() is called might not be compatible with taking
a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
spinlock - the current kfree_rcu() implementation avoids doing that.

Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
that have them. This is not a cheap operation, but the barrier usage is
rare - currently kmem_cache_destroy() or on module unload.

Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h        |   3 +
 mm/slab_common.c |  26 ++++++
 mm/slub.c        | 266 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 293 insertions(+), 2 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 206987ce44a4d053ebe3b5e50784d2dd23822cd1..e82e51c44bd00042d433ac8b46c2b4bbbdded9b1 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -435,6 +435,9 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+void flush_all_rcu_sheaves(void);
+
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
 			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e2b197e47866c30acdbd1fee4159f262a751c5a7..005a4319c06a01d2b616a75396fcc43766a62ddb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
 		kvfree_rcu_list(head);
 }
 
+static bool kfree_rcu_sheaf(void *obj)
+{
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+
+	if (is_vmalloc_addr(obj))
+		return false;
+
+	folio = virt_to_folio(obj);
+	if (unlikely(!folio_test_slab(folio)))
+		return false;
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+	if (s->cpu_sheaves)
+		return __kfree_rcu_sheaf(s, obj);
+
+	return false;
+}
+
 static bool
 need_offload_krc(struct kfree_rcu_cpu *krcp)
 {
@@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!head)
 		might_sleep();
 
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr))
+		return;
+
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
@@ -2026,6 +2050,8 @@ void kvfree_rcu_barrier(void)
 	bool queued;
 	int i, cpu;
 
+	flush_all_rcu_sheaves();
+
 	/*
 	 * Firstly we detach objects and queue them over an RCU-batch
 	 * for all CPUs. Finally queued works are flushed for each CPU.
diff --git a/mm/slub.c b/mm/slub.c
index cba188b7e04ddf86debf9bc27a2f725db1b2056e..19cd8444ae5d210c77ae767912ca1ff3fc69c2a8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -367,6 +367,8 @@ enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
 	FREE_PCS,		/* Free to percpu sheaf */
+	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
+	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -461,6 +463,7 @@ struct slab_sheaf {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
 	};
+	struct kmem_cache *cache;
 	unsigned int size;
 	void *objects[];
 };
@@ -469,6 +472,7 @@ struct slub_percpu_sheaves {
 	local_trylock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
 };
 
 /*
@@ -2531,6 +2535,8 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
 	if (unlikely(!sheaf))
 		return NULL;
 
+	sheaf->cache = s;
+
 	stat(s, SHEAF_ALLOC);
 
 	return sheaf;
@@ -2655,6 +2661,43 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	sheaf->size = 0;
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf)
+{
+	bool init = slab_want_init_on_free(s);
+	void **p = &sheaf->objects[0];
+	unsigned int i = 0;
+
+	while (i < sheaf->size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, true))) {
+			p[i] = p[--sheaf->size];
+			continue;
+		}
+
+		i++;
+	}
+}
+
+static void rcu_free_sheaf_nobarn(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	sheaf_flush_unused(s, sheaf);
+
+	free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
@@ -2667,7 +2710,7 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 static void pcs_flush_all(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
-	struct slab_sheaf *spare;
+	struct slab_sheaf *spare, *rcu_free;
 
 	local_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -2675,6 +2718,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 	spare = pcs->spare;
 	pcs->spare = NULL;
 
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
 	local_unlock(&s->cpu_sheaves->lock);
 
 	if (spare) {
@@ -2682,6 +2728,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 		free_empty_sheaf(s, spare);
 	}
 
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+
 	sheaf_flush_main(s);
 }
 
@@ -2698,6 +2747,11 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
 		free_empty_sheaf(s, pcs->spare);
 		pcs->spare = NULL;
 	}
+
+	if (pcs->rcu_free) {
+		call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+		pcs->rcu_free = NULL;
+	}
 }
 
 static void pcs_destroy(struct kmem_cache *s)
@@ -2723,6 +2777,7 @@ static void pcs_destroy(struct kmem_cache *s)
 		 */
 
 		WARN_ON(pcs->spare);
+		WARN_ON(pcs->rcu_free);
 
 		if (!WARN_ON(pcs->main->size)) {
 			free_empty_sheaf(s, pcs->main);
@@ -3780,7 +3835,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
 
 	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-	return (pcs->spare || pcs->main->size);
+	return (pcs->spare || pcs->rcu_free || pcs->main->size);
 }
 
 /*
@@ -3840,6 +3895,80 @@ static void flush_all(struct kmem_cache *s)
 	cpus_read_unlock();
 }
 
+static void flush_rcu_sheaf(struct work_struct *w)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_free;
+	struct slub_flush_work *sfw;
+	struct kmem_cache *s;
+
+	sfw = container_of(w, struct slub_flush_work, work);
+	s = sfw->s;
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+}
+
+
+/* needed for kvfree_rcu_barrier() */
+void flush_all_rcu_sheaves()
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slub_flush_work *sfw;
+	struct kmem_cache *s;
+	bool flushed = false;
+	unsigned int cpu;
+
+	cpus_read_lock();
+	mutex_lock(&slab_mutex);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		if (!s->cpu_sheaves)
+			continue;
+
+		mutex_lock(&flush_lock);
+
+		for_each_online_cpu(cpu) {
+			sfw = &per_cpu(slub_flush, cpu);
+			pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+			if (!pcs->rcu_free || !pcs->rcu_free->size) {
+				sfw->skip = true;
+				continue;
+			}
+
+			INIT_WORK(&sfw->work, flush_rcu_sheaf);
+			sfw->skip = false;
+			sfw->s = s;
+			queue_work_on(cpu, flushwq, &sfw->work);
+			flushed = true;
+		}
+
+		for_each_online_cpu(cpu) {
+			sfw = &per_cpu(slub_flush, cpu);
+			if (sfw->skip)
+				continue;
+			flush_work(&sfw->work);
+		}
+
+		mutex_unlock(&flush_lock);
+	}
+
+	mutex_unlock(&slab_mutex);
+	cpus_read_unlock();
+
+	if (flushed)
+		rcu_barrier();
+}
+
 /*
  * Use the cpu notifier to insure that the cpu slabs are flushed when
  * necessary.
@@ -5413,6 +5542,130 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
 	return true;
 }
 
+static void rcu_free_sheaf(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct node_barn *barn;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+
+	s = sheaf->cache;
+
+	/*
+	 * This may remove some objects due to slab_free_hook() returning false,
+	 * so that the sheaf might no longer be completely full. But it's easier
+	 * to handle it as full (unless it became completely empty), as the code
+	 * handles it fine. The only downside is that sheaf will serve fewer
+	 * allocations when reused. It only happens due to debugging, which is a
+	 * performance hit anyway.
+	 */
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	barn = get_node(s, numa_mem_id())->barn;
+
+	/* due to slab_free_hook() */
+	if (unlikely(sheaf->size == 0))
+		goto empty;
+
+	/*
+	 * Checking nr_full/nr_empty outside lock avoids contention in case the
+	 * barn is at the respective limit. Due to the race we might go over the
+	 * limit but that should be rare and harmless.
+	 */
+
+	if (data_race(barn->nr_full) < MAX_FULL_SHEAVES) {
+		stat(s, BARN_PUT);
+		barn_put_full_sheaf(barn, sheaf);
+		return;
+	}
+
+	stat(s, BARN_PUT_FAIL);
+	sheaf_flush_unused(s, sheaf);
+
+empty:
+	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
+		barn_put_empty_sheaf(barn, sheaf);
+		return;
+	}
+
+	free_empty_sheaf(s, sheaf);
+}
+
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_sheaf;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto fail;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(!pcs->rcu_free)) {
+
+		struct slab_sheaf *empty;
+		struct node_barn *barn;
+
+		if (pcs->spare && pcs->spare->size == 0) {
+			pcs->rcu_free = pcs->spare;
+			pcs->spare = NULL;
+			goto do_free;
+		}
+
+		barn = get_barn(s);
+
+		empty = barn_get_empty_sheaf(barn);
+
+		if (empty) {
+			pcs->rcu_free = empty;
+			goto do_free;
+		}
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty)
+			goto fail;
+
+		if (!local_trylock(&s->cpu_sheaves->lock)) {
+			barn_put_empty_sheaf(barn, empty);
+			goto fail;
+		}
+
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		if (unlikely(pcs->rcu_free))
+			barn_put_empty_sheaf(barn, empty);
+		else
+			pcs->rcu_free = empty;
+	}
+
+do_free:
+
+	rcu_sheaf = pcs->rcu_free;
+
+	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
+
+	if (likely(rcu_sheaf->size < s->sheaf_capacity))
+		rcu_sheaf = NULL;
+	else
+		pcs->rcu_free = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (rcu_sheaf)
+		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+
+	stat(s, FREE_RCU_SHEAF);
+	return true;
+
+fail:
+	stat(s, FREE_RCU_SHEAF_FAIL);
+	return false;
+}
+
 /*
  * Bulk free objects to the percpu sheaves.
  * Unlike free_to_pcs() this includes the calls to all necessary hooks
@@ -6909,6 +7162,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	flush_all_cpus_locked(s);
+
+	/* we might have rcu sheaves in flight */
+	if (s->cpu_sheaves)
+		rcu_barrier();
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
 		if (n->barn)
@@ -8284,6 +8542,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_PCS, free_cpu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -8382,6 +8642,8 @@ static struct attribute *slab_attrs[] = {
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
 	&free_cpu_sheaf_attr.attr,
+	&free_rcu_sheaf_attr.attr,
+	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 05/23] slab: sheaf prefilling for guaranteed allocations
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (3 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 06/23] slab: determine barn status racily outside of lock Vlastimil Babka
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

Add functions for efficient guaranteed allocations e.g. in a critical
section that cannot sleep, when the exact number of allocations is not
known beforehand, but an upper limit can be calculated.

kmem_cache_prefill_sheaf() returns a sheaf containing at least given
number of objects.

kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
and is guaranteed not to fail until depleted.

kmem_cache_return_sheaf() is for giving the sheaf back to the slab
allocator after the critical section. This will also attempt to refill
it to cache's sheaf capacity for better efficiency of sheaves handling,
but it's not stricly necessary to succeed.

kmem_cache_refill_sheaf() can be used to refill a previously obtained
sheaf to requested size. If the current size is sufficient, it does
nothing. If the requested size exceeds cache's sheaf_capacity and the
sheaf's current capacity, the sheaf will be replaced with a new one,
hence the indirect pointer parameter.

kmem_cache_sheaf_size() can be used to query the current size.

The implementation supports requesting sizes that exceed cache's
sheaf_capacity, but it is not efficient - such "oversize" sheaves are
allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
might be especially ineffective when replacing a sheaf with a new one of
a larger capacity. It is therefore better to size cache's
sheaf_capacity accordingly to make oversize sheaves exceptional.

CONFIG_SLUB_STATS counters are added for sheaf prefill and return
operations. A prefill or return is considered _fast when it is able to
grab or return a percpu spare sheaf (even if the sheaf needs a refill to
satisfy the request, as those should amortize over time), and _slow
otherwise (when the barn or even sheaf allocation/freeing has to be
involved). sheaf_prefill_oversize is provided to determine how many
prefills were oversize (counter for oversize returns is not necessary as
all oversize refills result in oversize returns).

When slub_debug is enabled for a cache with sheaves, no percpu sheaves
exist for it, but the prefill functionality is still provided simply by
all prefilled sheaves becoming oversize. If percpu sheaves are not
created for a cache due to not passing the sheaf_capacity argument on
cache creation, the prefills also work through oversize sheaves, but
there's a WARN_ON_ONCE() to indicate the omission.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  16 ++++
 mm/slub.c            | 263 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 279 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 49acbcdc6696fd120c402adf757b3f41660ad50a..680193356ac7a22f9df5cd9b71ff8b81e26404ad 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -829,6 +829,22 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
 				   int node) __assume_slab_alignment __malloc;
 #define kmem_cache_alloc_node(...)	alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+				       struct slab_sheaf *sheaf);
+
+void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
+			struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_from_sheaf(...)	\
+			alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);
+
 /*
  * These macros allow declaring a kmem_buckets * parameter alongside size, which
  * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
diff --git a/mm/slub.c b/mm/slub.c
index 19cd8444ae5d210c77ae767912ca1ff3fc69c2a8..38f5b865d3093556171e0f6530d395718b438099 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -401,6 +401,11 @@ enum stat_item {
 	BARN_GET_FAIL,		/* Failed to get full sheaf from barn */
 	BARN_PUT,		/* Put full sheaf to barn */
 	BARN_PUT_FAIL,		/* Failed to put full sheaf to barn */
+	SHEAF_PREFILL_FAST,	/* Sheaf prefill grabbed the spare sheaf */
+	SHEAF_PREFILL_SLOW,	/* Sheaf prefill found no spare sheaf */
+	SHEAF_PREFILL_OVERSIZE,	/* Allocation of oversize sheaf for prefill */
+	SHEAF_RETURN_FAST,	/* Sheaf return reattached spare sheaf */
+	SHEAF_RETURN_SLOW,	/* Sheaf return could not reattach spare */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -462,6 +467,8 @@ struct slab_sheaf {
 	union {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
+		/* only used for prefilled sheafs */
+		unsigned int capacity;
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
@@ -2838,6 +2845,30 @@ static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf
 	spin_unlock_irqrestore(&barn->lock, flags);
 }
 
+static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *sheaf = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_full--;
+	} else if (barn->nr_empty) {
+		sheaf = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return sheaf;
+}
+
 /*
  * If a full sheaf is available, return it and put the supplied empty one to
  * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
@@ -5042,6 +5073,228 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
 
+/*
+ * returns a sheaf that has at least the requested size
+ * when prefilling is needed, do so with given gfp flags
+ *
+ * return NULL if sheaf allocation or prefilling failed
+ */
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *sheaf = NULL;
+
+	if (unlikely(size > s->sheaf_capacity)) {
+
+		/*
+		 * slab_debug disables cpu sheaves intentionally so all
+		 * prefilled sheaves become "oversize" and we give up on
+		 * performance for the debugging. Same with SLUB_TINY.
+		 * Creating a cache without sheaves and then requesting a
+		 * prefilled sheaf is however not expected, so warn.
+		 */
+		WARN_ON_ONCE(s->sheaf_capacity == 0 &&
+			     !IS_ENABLED(CONFIG_SLUB_TINY) &&
+			     !(s->flags & SLAB_DEBUG_FLAGS));
+
+		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
+		if (!sheaf)
+			return NULL;
+
+		stat(s, SHEAF_PREFILL_OVERSIZE);
+		sheaf->cache = s;
+		sheaf->capacity = size;
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, size,
+					     &sheaf->objects[0])) {
+			kfree(sheaf);
+			return NULL;
+		}
+
+		sheaf->size = size;
+
+		return sheaf;
+	}
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (pcs->spare) {
+		sheaf = pcs->spare;
+		pcs->spare = NULL;
+		stat(s, SHEAF_PREFILL_FAST);
+	} else {
+		stat(s, SHEAF_PREFILL_SLOW);
+		sheaf = barn_get_full_or_empty_sheaf(get_barn(s));
+		if (sheaf && sheaf->size)
+			stat(s, BARN_GET);
+		else
+			stat(s, BARN_GET_FAIL);
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+
+	if (!sheaf)
+		sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (sheaf && sheaf->size < size) {
+		if (refill_sheaf(s, sheaf, gfp)) {
+			sheaf_flush_unused(s, sheaf);
+			free_empty_sheaf(s, sheaf);
+			sheaf = NULL;
+		}
+	}
+
+	if (sheaf)
+		sheaf->capacity = s->sheaf_capacity;
+
+	return sheaf;
+}
+
+/*
+ * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * If the sheaf cannot simply become the percpu spare sheaf, but there's space
+ * for a full sheaf in the barn, we try to refill the sheaf back to the cache's
+ * sheaf_capacity to avoid handling partially full sheaves.
+ *
+ * If the refill fails because gfp is e.g. GFP_NOWAIT, or the barn is full, the
+ * sheaf is instead flushed and freed.
+ */
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+			     struct slab_sheaf *sheaf)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct node_barn *barn;
+
+	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
+		sheaf_flush_unused(s, sheaf);
+		kfree(sheaf);
+		return;
+	}
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	barn = get_barn(s);
+
+	if (!pcs->spare) {
+		pcs->spare = sheaf;
+		sheaf = NULL;
+		stat(s, SHEAF_RETURN_FAST);
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (!sheaf)
+		return;
+
+	stat(s, SHEAF_RETURN_SLOW);
+
+	/*
+	 * If the barn has too many full sheaves or we fail to refill the sheaf,
+	 * simply flush and free it.
+	 */
+	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
+	    refill_sheaf(s, sheaf, gfp)) {
+		sheaf_flush_unused(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+		return;
+	}
+
+	barn_put_full_sheaf(barn, sheaf);
+	stat(s, BARN_PUT);
+}
+
+/*
+ * refill a sheaf previously returned by kmem_cache_prefill_sheaf to at least
+ * the given size
+ *
+ * the sheaf might be replaced by a new one when requesting more than
+ * s->sheaf_capacity objects if such replacement is necessary, but the refill
+ * fails (returning -ENOMEM), the existing sheaf is left intact
+ *
+ * In practice we always refill to full sheaf's capacity.
+ */
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+			    struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+
+	/*
+	 * TODO: do we want to support *sheaf == NULL to be equivalent of
+	 * kmem_cache_prefill_sheaf() ?
+	 */
+	if (!sheafp || !(*sheafp))
+		return -EINVAL;
+
+	sheaf = *sheafp;
+	if (sheaf->size >= size)
+		return 0;
+
+	if (likely(sheaf->capacity >= size)) {
+		if (likely(sheaf->capacity == s->sheaf_capacity))
+			return refill_sheaf(s, sheaf, gfp);
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
+					     &sheaf->objects[sheaf->size])) {
+			return -ENOMEM;
+		}
+		sheaf->size = sheaf->capacity;
+
+		return 0;
+	}
+
+	/*
+	 * We had a regular sized sheaf and need an oversize one, or we had an
+	 * oversize one already but need a larger one now.
+	 * This should be a very rare path so let's not complicate it.
+	 */
+	sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+	if (!sheaf)
+		return -ENOMEM;
+
+	kmem_cache_return_sheaf(s, gfp, *sheafp);
+	*sheafp = sheaf;
+	return 0;
+}
+
+/*
+ * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * Guaranteed not to fail as many allocations as was the requested size.
+ * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
+ *
+ * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
+ * memcg charging is forced over limit if necessary, to avoid failure.
+ */
+void *
+kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
+				   struct slab_sheaf *sheaf)
+{
+	void *ret = NULL;
+	bool init;
+
+	if (sheaf->size == 0)
+		goto out;
+
+	ret = sheaf->objects[--sheaf->size];
+
+	init = slab_want_init_on_alloc(gfp, s);
+
+	/* add __GFP_NOFAIL to force successful memcg charging */
+	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
+out:
+	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
+
+	return ret;
+}
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
 /*
  * To avoid unnecessary overhead, we pass through large allocation requests
  * directly to the page allocator. We use __GFP_COMP, because we will need to
@@ -8576,6 +8829,11 @@ STAT_ATTR(BARN_GET, barn_get);
 STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
 STAT_ATTR(BARN_PUT, barn_put);
 STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
+STAT_ATTR(SHEAF_PREFILL_FAST, sheaf_prefill_fast);
+STAT_ATTR(SHEAF_PREFILL_SLOW, sheaf_prefill_slow);
+STAT_ATTR(SHEAF_PREFILL_OVERSIZE, sheaf_prefill_oversize);
+STAT_ATTR(SHEAF_RETURN_FAST, sheaf_return_fast);
+STAT_ATTR(SHEAF_RETURN_SLOW, sheaf_return_slow);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -8676,6 +8934,11 @@ static struct attribute *slab_attrs[] = {
 	&barn_get_fail_attr.attr,
 	&barn_put_attr.attr,
 	&barn_put_fail_attr.attr,
+	&sheaf_prefill_fast_attr.attr,
+	&sheaf_prefill_slow_attr.attr,
+	&sheaf_prefill_oversize_attr.attr,
+	&sheaf_return_fast_attr.attr,
+	&sheaf_return_slow_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 06/23] slab: determine barn status racily outside of lock
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (4 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 05/23] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 07/23] slab: skip percpu sheaves for remote object freeing Vlastimil Babka
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

The possibility of many barn operations is determined by the current
number of full or empty sheaves. Taking the barn->lock just to find out
that e.g. there are no empty sheaves results in unnecessary overhead and
lock contention. Thus perform these checks outside of the lock with a
data_race() annotated variable read and fail quickly without taking the
lock.

Checks for sheaf availability that racily succeed have to be obviously
repeated under the lock for correctness, but we can skip repeating
checks if there are too many sheaves on the given list as the limits
don't need to be strict.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 38f5b865d3093556171e0f6530d395718b438099..35274ce4e709c9da7ac8f9006c824f28709e923d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2801,9 +2801,12 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *empty = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_empty) {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty,
 					 struct slab_sheaf, barn_list);
 		list_del(&empty->barn_list);
@@ -2850,6 +2853,9 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *sheaf = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full) && !data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
 	if (barn->nr_full) {
@@ -2880,9 +2886,12 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
 	struct slab_sheaf *full = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full) {
+	if (likely(barn->nr_full)) {
 		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
 					barn_list);
 		list_del(&full->barn_list);
@@ -2906,19 +2915,23 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
 	struct slab_sheaf *empty;
 	unsigned long flags;
 
+	/* we don't repeat this check under barn->lock as it's not critical */
+	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
+		return ERR_PTR(-E2BIG);
+	if (!data_race(barn->nr_empty))
+		return ERR_PTR(-ENOMEM);
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full >= MAX_FULL_SHEAVES) {
-		empty = ERR_PTR(-E2BIG);
-	} else if (!barn->nr_empty) {
-		empty = ERR_PTR(-ENOMEM);
-	} else {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
 					 barn_list);
 		list_del(&empty->barn_list);
 		list_add(&full->barn_list, &barn->sheaves_full);
 		barn->nr_empty--;
 		barn->nr_full++;
+	} else {
+		empty = ERR_PTR(-ENOMEM);
 	}
 
 	spin_unlock_irqrestore(&barn->lock, flags);

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 07/23] slab: skip percpu sheaves for remote object freeing
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (5 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 06/23] slab: determine barn status racily outside of lock Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-25 16:14   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 08/23] slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

Since we don't control the NUMA locality of objects in percpu sheaves,
allocations with node restrictions bypass them. Allocations without
restrictions may however still expect to get local objects with high
probability, and the introduction of sheaves can decrease it due to
freed object from a remote node ending up in percpu sheaves.

The fraction of such remote frees seems low (5% on an 8-node machine)
but it can be expected that some cache or workload specific corner cases
exist. We can either conclude that this is not a problem due to the low
fraction, or we can make remote frees bypass percpu sheaves and go
directly to their slabs. This will make the remote frees more expensive,
but if if's only a small fraction, most frees will still benefit from
the lower overhead of percpu sheaves.

This patch thus makes remote object freeing bypass percpu sheaves,
including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
it's not intended to be 100% guarantee that percpu sheaves will only
contain local objects. The refill from slabs does not provide that
guarantee in the first place, and there might be cpu migrations
happening when we need to unlock the local_lock. Avoiding all that could
be possible but complicated so we can leave it for later investigation
whether it would be worth it. It can be expected that the more selective
freeing will itself prevent accumulation of remote objects in percpu
sheaves so any such violations would have only short-term effects.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c |  7 +++++--
 mm/slub.c        | 42 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 005a4319c06a01d2b616a75396fcc43766a62ddb..b6601e0fe598e24bd8d456dce4fc82c65b342bfd 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1623,8 +1623,11 @@ static bool kfree_rcu_sheaf(void *obj)
 
 	slab = folio_slab(folio);
 	s = slab->slab_cache;
-	if (s->cpu_sheaves)
-		return __kfree_rcu_sheaf(s, obj);
+	if (s->cpu_sheaves) {
+		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
+			   slab_nid(slab) == numa_mem_id()))
+			return __kfree_rcu_sheaf(s, obj);
+	}
 
 	return false;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 35274ce4e709c9da7ac8f9006c824f28709e923d..9699d048b2cd08ee75c4cc3d1e460868704520b1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -472,6 +472,7 @@ struct slab_sheaf {
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
+	int node; /* only used for rcu_sheaf */
 	void *objects[];
 };
 
@@ -5828,7 +5829,7 @@ static void rcu_free_sheaf(struct rcu_head *head)
 	 */
 	__rcu_free_sheaf_prepare(s, sheaf);
 
-	barn = get_node(s, numa_mem_id())->barn;
+	barn = get_node(s, sheaf->node)->barn;
 
 	/* due to slab_free_hook() */
 	if (unlikely(sheaf->size == 0))
@@ -5914,10 +5915,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
 
-	if (likely(rcu_sheaf->size < s->sheaf_capacity))
+	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
-	else
+	} else {
 		pcs->rcu_free = NULL;
+		rcu_sheaf->node = numa_mem_id();
+	}
 
 	local_unlock(&s->cpu_sheaves->lock);
 
@@ -5944,7 +5947,11 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	bool init = slab_want_init_on_free(s);
 	unsigned int batch, i = 0;
 	struct node_barn *barn;
+	void *remote_objects[PCS_BATCH_MAX];
+	unsigned int remote_nr = 0;
+	int node = numa_mem_id();
 
+next_remote_batch:
 	while (i < size) {
 		struct slab *slab = virt_to_slab(p[i]);
 
@@ -5954,7 +5961,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
 			p[i] = p[--size];
 			if (!size)
-				return;
+				goto flush_remote;
+			continue;
+		}
+
+		if (unlikely(IS_ENABLED(CONFIG_NUMA) && slab_nid(slab) != node)) {
+			remote_objects[remote_nr] = p[i];
+			p[i] = p[--size];
+			if (++remote_nr >= PCS_BATCH_MAX)
+				goto flush_remote;
 			continue;
 		}
 
@@ -6024,6 +6039,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	 */
 fallback:
 	__kmem_cache_free_bulk(s, size, p);
+
+flush_remote:
+	if (remote_nr) {
+		__kmem_cache_free_bulk(s, remote_nr, &remote_objects[0]);
+		if (i < size) {
+			remote_nr = 0;
+			goto next_remote_batch;
+		}
+	}
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -6115,8 +6139,14 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (!s->cpu_sheaves || !free_to_pcs(s, object))
-		do_slab_free(s, slab, object, object, 1, addr);
+	if (s->cpu_sheaves && likely(!IS_ENABLED(CONFIG_NUMA) ||
+				     slab_nid(slab) == numa_mem_id())) {
+		if (likely(free_to_pcs(s, object))) {
+			return;
+		}
+	}
+
+	do_slab_free(s, slab, object, object, 1, addr);
 }
 
 #ifdef CONFIG_MEMCG

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 08/23] slab: allow NUMA restricted allocations to use percpu sheaves
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (6 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 07/23] slab: skip percpu sheaves for remote object freeing Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-25 16:27   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 09/23] maple_tree: remove redundant __GFP_NOWARN Vlastimil Babka
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

Currently allocations asking for a specific node explicitly or via
mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
contain mostly local objects, we can try allocating from them if the
local node happens to be the requested node or allowed by the mempolicy.
If we find the object from percpu sheaves is not from the expected node,
we skip the sheaves - this should be rare.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 46 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9699d048b2cd08ee75c4cc3d1e460868704520b1..3746c0229cc2f9658a589416c63c21fbf2850c44 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4888,18 +4888,43 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 }
 
 static __fastpath_inline
-void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 {
 	struct slub_percpu_sheaves *pcs;
+	bool node_requested;
 	void *object;
 
 #ifdef CONFIG_NUMA
-	if (static_branch_unlikely(&strict_numa)) {
-		if (current->mempolicy)
-			return NULL;
+	if (static_branch_unlikely(&strict_numa) &&
+			 node == NUMA_NO_NODE) {
+
+		struct mempolicy *mpol = current->mempolicy;
+
+		if (mpol) {
+			/*
+			 * Special BIND rule support. If the local node
+			 * is in permitted set then do not redirect
+			 * to a particular node.
+			 * Otherwise we apply the memory policy to get
+			 * the node we need to allocate on.
+			 */
+			if (mpol->mode != MPOL_BIND ||
+					!node_isset(numa_mem_id(), mpol->nodes))
+
+				node = mempolicy_slab_node();
+		}
 	}
 #endif
 
+	node_requested = IS_ENABLED(CONFIG_NUMA) && node != NUMA_NO_NODE;
+
+	/*
+	 * We assume the percpu sheaves contain only local objects although it's
+	 * not completely guaranteed, so we verify later.
+	 */
+	if (unlikely(node_requested && node != numa_mem_id()))
+		return NULL;
+
 	if (!local_trylock(&s->cpu_sheaves->lock))
 		return NULL;
 
@@ -4911,7 +4936,21 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 			return NULL;
 	}
 
-	object = pcs->main->objects[--pcs->main->size];
+	object = pcs->main->objects[pcs->main->size - 1];
+
+	if (unlikely(node_requested)) {
+		/*
+		 * Verify that the object was from the node we want. This could
+		 * be false because of cpu migration during an unlocked part of
+		 * the current allocation or previous freeing process.
+		 */
+		if (folio_nid(virt_to_folio(object)) != node) {
+			local_unlock(&s->cpu_sheaves->lock);
+			return NULL;
+		}
+	}
+
+	pcs->main->size--;
 
 	local_unlock(&s->cpu_sheaves->lock);
 
@@ -5011,8 +5050,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	if (s->cpu_sheaves && node == NUMA_NO_NODE)
-		object = alloc_from_pcs(s, gfpflags);
+	if (s->cpu_sheaves)
+		object = alloc_from_pcs(s, gfpflags, node);
 
 	if (!object)
 		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 09/23] maple_tree: remove redundant __GFP_NOWARN
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (7 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 08/23] slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 10/23] tools/testing/vma: clean up stubs in vma_internal.h Vlastimil Babka
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Qianfeng Rong,
	Wei Yang, Matthew Wilcox (Oracle),
	Andrew Morton

From: Qianfeng Rong <rongqianfeng@vivo.com>

Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.

Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant.  Let's clean up these
redundant flags across subsystems.

No functional changes.

Link: https://lkml.kernel.org/r/20250804125657.482109-1-rongqianfeng@vivo.com
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index b4ee2d29d7a962ca374467d0533185f2db3d35ff..38fb68c082915211c80f473d313159599fe97e2c 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1344,11 +1344,11 @@ static void mas_node_count_gfp(struct ma_state *mas, int count, gfp_t gfp)
  * @mas: The maple state
  * @count: The number of nodes needed
  *
- * Note: Uses GFP_NOWAIT | __GFP_NOWARN for gfp flags.
+ * Note: Uses GFP_NOWAIT for gfp flags.
  */
 static void mas_node_count(struct ma_state *mas, int count)
 {
-	return mas_node_count_gfp(mas, count, GFP_NOWAIT | __GFP_NOWARN);
+	return mas_node_count_gfp(mas, count, GFP_NOWAIT);
 }
 
 /*

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 10/23] tools/testing/vma: clean up stubs in vma_internal.h
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (8 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 09/23] maple_tree: remove redundant __GFP_NOWARN Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 11/23] maple_tree: Drop bulk insert support Vlastimil Babka
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Lorenzo Stoakes,
	WangYuli, Jann Horn, Andrew Morton

From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

We do not need to references arguments just to avoid compiler warnings,
the warning in question does not arise here, so remove all of the
instances of '(void)xxx' introduced purely to avoid this warning.

As reported by WagYuli in the referenced mail, GCC 8.3 and before will
have issues compiling this file if parameter names are not provided, so
ensure these are always provided.

Finally, perform a trivial fix up of kmem_cache_alloc() which technically
has parameters in the incorrect order (as reported by Vlastimil Babka
off-list).

Link: https://lkml.kernel.org/r/20250826102824.22730-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: WangYuli <wangyuli@uniontech.com>
Closes: https://lore.kernel.org/linux-mm/EFCEBE7E301589DE+20250729084700.208767-1-wangyuli@uniontech.com/
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/testing/vma/vma_internal.h | 167 +++++++++++++--------------------------
 1 file changed, 57 insertions(+), 110 deletions(-)

diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 3639aa8dd2b06ebe5b9cfcfe6669994fd38c482d..f8cf5b184d5b51dd627ff440943a7af3c549f482 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -676,9 +676,7 @@ static inline struct kmem_cache *__kmem_cache_create(const char *name,
 
 static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	(void)gfpflags;
-
-	return calloc(s->object_size, 1);
+	return calloc(1, s->object_size);
 }
 
 static inline void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -842,11 +840,11 @@ static inline unsigned long vma_pages(struct vm_area_struct *vma)
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
-static inline void fput(struct file *)
+static inline void fput(struct file *file)
 {
 }
 
-static inline void mpol_put(struct mempolicy *)
+static inline void mpol_put(struct mempolicy *pol)
 {
 }
 
@@ -854,15 +852,15 @@ static inline void lru_add_drain(void)
 {
 }
 
-static inline void tlb_gather_mmu(struct mmu_gather *, struct mm_struct *)
+static inline void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
 {
 }
 
-static inline void update_hiwater_rss(struct mm_struct *)
+static inline void update_hiwater_rss(struct mm_struct *mm)
 {
 }
 
-static inline void update_hiwater_vm(struct mm_struct *)
+static inline void update_hiwater_vm(struct mm_struct *mm)
 {
 }
 
@@ -871,36 +869,23 @@ static inline void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
 		      unsigned long end_addr, unsigned long tree_end,
 		      bool mm_wr_locked)
 {
-	(void)tlb;
-	(void)mas;
-	(void)vma;
-	(void)start_addr;
-	(void)end_addr;
-	(void)tree_end;
-	(void)mm_wr_locked;
 }
 
 static inline void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *vma, unsigned long floor,
 		   unsigned long ceiling, bool mm_wr_locked)
 {
-	(void)tlb;
-	(void)mas;
-	(void)vma;
-	(void)floor;
-	(void)ceiling;
-	(void)mm_wr_locked;
 }
 
-static inline void mapping_unmap_writable(struct address_space *)
+static inline void mapping_unmap_writable(struct address_space *mapping)
 {
 }
 
-static inline void flush_dcache_mmap_lock(struct address_space *)
+static inline void flush_dcache_mmap_lock(struct address_space *mapping)
 {
 }
 
-static inline void tlb_finish_mmu(struct mmu_gather *)
+static inline void tlb_finish_mmu(struct mmu_gather *tlb)
 {
 }
 
@@ -909,7 +894,7 @@ static inline struct file *get_file(struct file *f)
 	return f;
 }
 
-static inline int vma_dup_policy(struct vm_area_struct *, struct vm_area_struct *)
+static inline int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 {
 	return 0;
 }
@@ -936,10 +921,6 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long end,
 					 struct vm_area_struct *next)
 {
-	(void)vma;
-	(void)start;
-	(void)end;
-	(void)next;
 }
 
 static inline void hugetlb_split(struct vm_area_struct *, unsigned long) {}
@@ -959,51 +940,48 @@ static inline void vm_acct_memory(long pages)
 {
 }
 
-static inline void vma_interval_tree_insert(struct vm_area_struct *,
-					    struct rb_root_cached *)
+static inline void vma_interval_tree_insert(struct vm_area_struct *vma,
+					    struct rb_root_cached *rb)
 {
 }
 
-static inline void vma_interval_tree_remove(struct vm_area_struct *,
-					    struct rb_root_cached *)
+static inline void vma_interval_tree_remove(struct vm_area_struct *vma,
+					    struct rb_root_cached *rb)
 {
 }
 
-static inline void flush_dcache_mmap_unlock(struct address_space *)
+static inline void flush_dcache_mmap_unlock(struct address_space *mapping)
 {
 }
 
-static inline void anon_vma_interval_tree_insert(struct anon_vma_chain*,
-						 struct rb_root_cached *)
+static inline void anon_vma_interval_tree_insert(struct anon_vma_chain *avc,
+						 struct rb_root_cached *rb)
 {
 }
 
-static inline void anon_vma_interval_tree_remove(struct anon_vma_chain*,
-						 struct rb_root_cached *)
+static inline void anon_vma_interval_tree_remove(struct anon_vma_chain *avc,
+						 struct rb_root_cached *rb)
 {
 }
 
-static inline void uprobe_mmap(struct vm_area_struct *)
+static inline void uprobe_mmap(struct vm_area_struct *vma)
 {
 }
 
 static inline void uprobe_munmap(struct vm_area_struct *vma,
 				 unsigned long start, unsigned long end)
 {
-	(void)vma;
-	(void)start;
-	(void)end;
 }
 
-static inline void i_mmap_lock_write(struct address_space *)
+static inline void i_mmap_lock_write(struct address_space *mapping)
 {
 }
 
-static inline void anon_vma_lock_write(struct anon_vma *)
+static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
 {
 }
 
-static inline void vma_assert_write_locked(struct vm_area_struct *)
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
 }
 
@@ -1013,16 +991,16 @@ static inline void unlink_anon_vmas(struct vm_area_struct *vma)
 	vma->anon_vma->was_unlinked = true;
 }
 
-static inline void anon_vma_unlock_write(struct anon_vma *)
+static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
 {
 }
 
-static inline void i_mmap_unlock_write(struct address_space *)
+static inline void i_mmap_unlock_write(struct address_space *mapping)
 {
 }
 
-static inline void anon_vma_merge(struct vm_area_struct *,
-				  struct vm_area_struct *)
+static inline void anon_vma_merge(struct vm_area_struct *vma,
+				  struct vm_area_struct *next)
 {
 }
 
@@ -1031,27 +1009,22 @@ static inline int userfaultfd_unmap_prep(struct vm_area_struct *vma,
 					 unsigned long end,
 					 struct list_head *unmaps)
 {
-	(void)vma;
-	(void)start;
-	(void)end;
-	(void)unmaps;
-
 	return 0;
 }
 
-static inline void mmap_write_downgrade(struct mm_struct *)
+static inline void mmap_write_downgrade(struct mm_struct *mm)
 {
 }
 
-static inline void mmap_read_unlock(struct mm_struct *)
+static inline void mmap_read_unlock(struct mm_struct *mm)
 {
 }
 
-static inline void mmap_write_unlock(struct mm_struct *)
+static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 }
 
-static inline int mmap_write_lock_killable(struct mm_struct *)
+static inline int mmap_write_lock_killable(struct mm_struct *mm)
 {
 	return 0;
 }
@@ -1060,10 +1033,6 @@ static inline bool can_modify_mm(struct mm_struct *mm,
 				 unsigned long start,
 				 unsigned long end)
 {
-	(void)mm;
-	(void)start;
-	(void)end;
-
 	return true;
 }
 
@@ -1071,16 +1040,13 @@ static inline void arch_unmap(struct mm_struct *mm,
 				 unsigned long start,
 				 unsigned long end)
 {
-	(void)mm;
-	(void)start;
-	(void)end;
 }
 
-static inline void mmap_assert_locked(struct mm_struct *)
+static inline void mmap_assert_locked(struct mm_struct *mm)
 {
 }
 
-static inline bool mpol_equal(struct mempolicy *, struct mempolicy *)
+static inline bool mpol_equal(struct mempolicy *a, struct mempolicy *b)
 {
 	return true;
 }
@@ -1088,63 +1054,62 @@ static inline bool mpol_equal(struct mempolicy *, struct mempolicy *)
 static inline void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
-	(void)vma;
-	(void)vm_flags;
 }
 
-static inline bool mapping_can_writeback(struct address_space *)
+static inline bool mapping_can_writeback(struct address_space *mapping)
 {
 	return true;
 }
 
-static inline bool is_vm_hugetlb_page(struct vm_area_struct *)
+static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
 	return false;
 }
 
-static inline bool vma_soft_dirty_enabled(struct vm_area_struct *)
+static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
 {
 	return false;
 }
 
-static inline bool userfaultfd_wp(struct vm_area_struct *)
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
 {
 	return false;
 }
 
-static inline void mmap_assert_write_locked(struct mm_struct *)
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
 {
 }
 
-static inline void mutex_lock(struct mutex *)
+static inline void mutex_lock(struct mutex *lock)
 {
 }
 
-static inline void mutex_unlock(struct mutex *)
+static inline void mutex_unlock(struct mutex *lock)
 {
 }
 
-static inline bool mutex_is_locked(struct mutex *)
+static inline bool mutex_is_locked(struct mutex *lock)
 {
 	return true;
 }
 
-static inline bool signal_pending(void *)
+static inline bool signal_pending(void *p)
 {
 	return false;
 }
 
-static inline bool is_file_hugepages(struct file *)
+static inline bool is_file_hugepages(struct file *file)
 {
 	return false;
 }
 
-static inline int security_vm_enough_memory_mm(struct mm_struct *, long)
+static inline int security_vm_enough_memory_mm(struct mm_struct *mm, long pages)
 {
 	return 0;
 }
 
-static inline bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long)
+static inline bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags,
+				 unsigned long npages)
 {
 	return true;
 }
@@ -1169,7 +1134,7 @@ static inline void vm_flags_clear(struct vm_area_struct *vma,
 	vma->__vm_flags &= ~flags;
 }
 
-static inline int shmem_zero_setup(struct vm_area_struct *)
+static inline int shmem_zero_setup(struct vm_area_struct *vma)
 {
 	return 0;
 }
@@ -1179,20 +1144,20 @@ static inline void vma_set_anonymous(struct vm_area_struct *vma)
 	vma->vm_ops = NULL;
 }
 
-static inline void ksm_add_vma(struct vm_area_struct *)
+static inline void ksm_add_vma(struct vm_area_struct *vma)
 {
 }
 
-static inline void perf_event_mmap(struct vm_area_struct *)
+static inline void perf_event_mmap(struct vm_area_struct *vma)
 {
 }
 
-static inline bool vma_is_dax(struct vm_area_struct *)
+static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return false;
 }
 
-static inline struct vm_area_struct *get_gate_vma(struct mm_struct *)
+static inline struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
 {
 	return NULL;
 }
@@ -1217,16 +1182,16 @@ static inline void vma_set_page_prot(struct vm_area_struct *vma)
 	WRITE_ONCE(vma->vm_page_prot, vm_page_prot);
 }
 
-static inline bool arch_validate_flags(vm_flags_t)
+static inline bool arch_validate_flags(vm_flags_t flags)
 {
 	return true;
 }
 
-static inline void vma_close(struct vm_area_struct *)
+static inline void vma_close(struct vm_area_struct *vma)
 {
 }
 
-static inline int mmap_file(struct file *, struct vm_area_struct *)
+static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
 {
 	return 0;
 }
@@ -1388,8 +1353,6 @@ static inline int mapping_map_writable(struct address_space *mapping)
 
 static inline unsigned long move_page_tables(struct pagetable_move_control *pmc)
 {
-	(void)pmc;
-
 	return 0;
 }
 
@@ -1397,51 +1360,36 @@ static inline void free_pgd_range(struct mmu_gather *tlb,
 			unsigned long addr, unsigned long end,
 			unsigned long floor, unsigned long ceiling)
 {
-	(void)tlb;
-	(void)addr;
-	(void)end;
-	(void)floor;
-	(void)ceiling;
 }
 
 static inline int ksm_execve(struct mm_struct *mm)
 {
-	(void)mm;
-
 	return 0;
 }
 
 static inline void ksm_exit(struct mm_struct *mm)
 {
-	(void)mm;
 }
 
 static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
 {
-	(void)vma;
-	(void)reset_refcnt;
 }
 
 static inline void vma_numab_state_init(struct vm_area_struct *vma)
 {
-	(void)vma;
 }
 
 static inline void vma_numab_state_free(struct vm_area_struct *vma)
 {
-	(void)vma;
 }
 
 static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
 				     struct vm_area_struct *new_vma)
 {
-	(void)orig_vma;
-	(void)new_vma;
 }
 
 static inline void free_anon_vma_name(struct vm_area_struct *vma)
 {
-	(void)vma;
 }
 
 /* Declared in vma.h. */
@@ -1495,7 +1443,6 @@ static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
 
 static inline void fixup_hugetlb_reservations(struct vm_area_struct *vma)
 {
-	(void)vma;
 }
 
 static inline void vma_set_file(struct vm_area_struct *vma, struct file *file)
@@ -1506,13 +1453,13 @@ static inline void vma_set_file(struct vm_area_struct *vma, struct file *file)
 	fput(file);
 }
 
-static inline bool shmem_file(struct file *)
+static inline bool shmem_file(struct file *file)
 {
 	return false;
 }
 
-static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct file *,
-			 vm_flags_t vm_flags)
+static inline vm_flags_t ksm_vma_flags(const struct mm_struct *mm,
+		const struct file *file, vm_flags_t vm_flags)
 {
 	return vm_flags;
 }

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 11/23] maple_tree: Drop bulk insert support
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (9 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 10/23] tools/testing/vma: clean up stubs in vma_internal.h Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-25 16:38   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 12/23] tools/testing/vma: Implement vm_refcnt reset Vlastimil Babka
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Bulk insert mode was added to facilitate forking faster, but forking now
uses __mt_dup() to duplicate the tree.

The addition of sheaves has made the bulk allocations difficult to
maintain - since the expected entries would preallocate into the maple
state.  A big part of the maple state node allocation was the ability to
push nodes back onto the state for later use, which was essential to the
bulk insert algorithm.

Remove mas_expected_entries() and mas_destroy_rebalance() functions as
well as the MA_STATE_BULK and MA_STATE_REBALANCE maple state flags since
there are no users anymore.  Drop the associated testing as well.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c                 | 270 +--------------------------------------
 lib/test_maple_tree.c            | 137 --------------------
 tools/testing/radix-tree/maple.c |  36 ------
 3 files changed, 4 insertions(+), 439 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 38fb68c082915211c80f473d313159599fe97e2c..4f0e30b57b0cef9e5cf791f3f64f5898752db402 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -83,13 +83,9 @@
 
 /*
  * Maple state flags
- * * MA_STATE_BULK		- Bulk insert mode
- * * MA_STATE_REBALANCE		- Indicate a rebalance during bulk insert
  * * MA_STATE_PREALLOC		- Preallocated nodes, WARN_ON allocation
  */
-#define MA_STATE_BULK		1
-#define MA_STATE_REBALANCE	2
-#define MA_STATE_PREALLOC	4
+#define MA_STATE_PREALLOC	1
 
 #define ma_parent_ptr(x) ((struct maple_pnode *)(x))
 #define mas_tree_parent(x) ((unsigned long)(x->tree) | MA_ROOT_PARENT)
@@ -1031,24 +1027,6 @@ static inline void mas_descend(struct ma_state *mas)
 	mas->node = mas_slot(mas, slots, mas->offset);
 }
 
-/*
- * mte_set_gap() - Set a maple node gap.
- * @mn: The encoded maple node
- * @gap: The offset of the gap to set
- * @val: The gap value
- */
-static inline void mte_set_gap(const struct maple_enode *mn,
-				 unsigned char gap, unsigned long val)
-{
-	switch (mte_node_type(mn)) {
-	default:
-		break;
-	case maple_arange_64:
-		mte_to_node(mn)->ma64.gap[gap] = val;
-		break;
-	}
-}
-
 /*
  * mas_ascend() - Walk up a level of the tree.
  * @mas: The maple state
@@ -1878,21 +1856,7 @@ static inline int mab_calc_split(struct ma_state *mas,
 	 * end on a NULL entry, with the exception of the left-most leaf.  The
 	 * limitation means that the split of a node must be checked for this condition
 	 * and be able to put more data in one direction or the other.
-	 */
-	if (unlikely((mas->mas_flags & MA_STATE_BULK))) {
-		*mid_split = 0;
-		split = b_end - mt_min_slots[bn->type];
-
-		if (!ma_is_leaf(bn->type))
-			return split;
-
-		mas->mas_flags |= MA_STATE_REBALANCE;
-		if (!bn->slot[split])
-			split--;
-		return split;
-	}
-
-	/*
+	 *
 	 * Although extremely rare, it is possible to enter what is known as the 3-way
 	 * split scenario.  The 3-way split comes about by means of a store of a range
 	 * that overwrites the end and beginning of two full nodes.  The result is a set
@@ -2039,27 +2003,6 @@ static inline void mab_mas_cp(struct maple_big_node *b_node,
 	}
 }
 
-/*
- * mas_bulk_rebalance() - Rebalance the end of a tree after a bulk insert.
- * @mas: The maple state
- * @end: The maple node end
- * @mt: The maple node type
- */
-static inline void mas_bulk_rebalance(struct ma_state *mas, unsigned char end,
-				      enum maple_type mt)
-{
-	if (!(mas->mas_flags & MA_STATE_BULK))
-		return;
-
-	if (mte_is_root(mas->node))
-		return;
-
-	if (end > mt_min_slots[mt]) {
-		mas->mas_flags &= ~MA_STATE_REBALANCE;
-		return;
-	}
-}
-
 /*
  * mas_store_b_node() - Store an @entry into the b_node while also copying the
  * data from a maple encoded node.
@@ -2109,9 +2052,6 @@ static noinline_for_kasan void mas_store_b_node(struct ma_wr_state *wr_mas,
 	/* Handle new range ending before old range ends */
 	piv = mas_safe_pivot(mas, wr_mas->pivots, offset_end, wr_mas->type);
 	if (piv > mas->last) {
-		if (piv == ULONG_MAX)
-			mas_bulk_rebalance(mas, b_node->b_end, wr_mas->type);
-
 		if (offset_end != slot)
 			wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
 							  offset_end);
@@ -3011,126 +2951,6 @@ static inline void mas_rebalance(struct ma_state *mas,
 	return mas_spanning_rebalance(mas, &mast, empty_count);
 }
 
-/*
- * mas_destroy_rebalance() - Rebalance left-most node while destroying the maple
- * state.
- * @mas: The maple state
- * @end: The end of the left-most node.
- *
- * During a mass-insert event (such as forking), it may be necessary to
- * rebalance the left-most node when it is not sufficient.
- */
-static inline void mas_destroy_rebalance(struct ma_state *mas, unsigned char end)
-{
-	enum maple_type mt = mte_node_type(mas->node);
-	struct maple_node reuse, *newnode, *parent, *new_left, *left, *node;
-	struct maple_enode *eparent, *old_eparent;
-	unsigned char offset, tmp, split = mt_slots[mt] / 2;
-	void __rcu **l_slots, **slots;
-	unsigned long *l_pivs, *pivs, gap;
-	bool in_rcu = mt_in_rcu(mas->tree);
-	unsigned char new_height = mas_mt_height(mas);
-
-	MA_STATE(l_mas, mas->tree, mas->index, mas->last);
-
-	l_mas = *mas;
-	mas_prev_sibling(&l_mas);
-
-	/* set up node. */
-	if (in_rcu) {
-		newnode = mas_pop_node(mas);
-	} else {
-		newnode = &reuse;
-	}
-
-	node = mas_mn(mas);
-	newnode->parent = node->parent;
-	slots = ma_slots(newnode, mt);
-	pivs = ma_pivots(newnode, mt);
-	left = mas_mn(&l_mas);
-	l_slots = ma_slots(left, mt);
-	l_pivs = ma_pivots(left, mt);
-	if (!l_slots[split])
-		split++;
-	tmp = mas_data_end(&l_mas) - split;
-
-	memcpy(slots, l_slots + split + 1, sizeof(void *) * tmp);
-	memcpy(pivs, l_pivs + split + 1, sizeof(unsigned long) * tmp);
-	pivs[tmp] = l_mas.max;
-	memcpy(slots + tmp, ma_slots(node, mt), sizeof(void *) * end);
-	memcpy(pivs + tmp, ma_pivots(node, mt), sizeof(unsigned long) * end);
-
-	l_mas.max = l_pivs[split];
-	mas->min = l_mas.max + 1;
-	old_eparent = mt_mk_node(mte_parent(l_mas.node),
-			     mas_parent_type(&l_mas, l_mas.node));
-	tmp += end;
-	if (!in_rcu) {
-		unsigned char max_p = mt_pivots[mt];
-		unsigned char max_s = mt_slots[mt];
-
-		if (tmp < max_p)
-			memset(pivs + tmp, 0,
-			       sizeof(unsigned long) * (max_p - tmp));
-
-		if (tmp < mt_slots[mt])
-			memset(slots + tmp, 0, sizeof(void *) * (max_s - tmp));
-
-		memcpy(node, newnode, sizeof(struct maple_node));
-		ma_set_meta(node, mt, 0, tmp - 1);
-		mte_set_pivot(old_eparent, mte_parent_slot(l_mas.node),
-			      l_pivs[split]);
-
-		/* Remove data from l_pivs. */
-		tmp = split + 1;
-		memset(l_pivs + tmp, 0, sizeof(unsigned long) * (max_p - tmp));
-		memset(l_slots + tmp, 0, sizeof(void *) * (max_s - tmp));
-		ma_set_meta(left, mt, 0, split);
-		eparent = old_eparent;
-
-		goto done;
-	}
-
-	/* RCU requires replacing both l_mas, mas, and parent. */
-	mas->node = mt_mk_node(newnode, mt);
-	ma_set_meta(newnode, mt, 0, tmp);
-
-	new_left = mas_pop_node(mas);
-	new_left->parent = left->parent;
-	mt = mte_node_type(l_mas.node);
-	slots = ma_slots(new_left, mt);
-	pivs = ma_pivots(new_left, mt);
-	memcpy(slots, l_slots, sizeof(void *) * split);
-	memcpy(pivs, l_pivs, sizeof(unsigned long) * split);
-	ma_set_meta(new_left, mt, 0, split);
-	l_mas.node = mt_mk_node(new_left, mt);
-
-	/* replace parent. */
-	offset = mte_parent_slot(mas->node);
-	mt = mas_parent_type(&l_mas, l_mas.node);
-	parent = mas_pop_node(mas);
-	slots = ma_slots(parent, mt);
-	pivs = ma_pivots(parent, mt);
-	memcpy(parent, mte_to_node(old_eparent), sizeof(struct maple_node));
-	rcu_assign_pointer(slots[offset], mas->node);
-	rcu_assign_pointer(slots[offset - 1], l_mas.node);
-	pivs[offset - 1] = l_mas.max;
-	eparent = mt_mk_node(parent, mt);
-done:
-	gap = mas_leaf_max_gap(mas);
-	mte_set_gap(eparent, mte_parent_slot(mas->node), gap);
-	gap = mas_leaf_max_gap(&l_mas);
-	mte_set_gap(eparent, mte_parent_slot(l_mas.node), gap);
-	mas_ascend(mas);
-
-	if (in_rcu) {
-		mas_replace_node(mas, old_eparent, new_height);
-		mas_adopt_children(mas, mas->node);
-	}
-
-	mas_update_gap(mas);
-}
-
 /*
  * mas_split_final_node() - Split the final node in a subtree operation.
  * @mast: the maple subtree state
@@ -3837,8 +3657,6 @@ static inline void mas_wr_node_store(struct ma_wr_state *wr_mas,
 
 	if (mas->last == wr_mas->end_piv)
 		offset_end++; /* don't copy this offset */
-	else if (unlikely(wr_mas->r_max == ULONG_MAX))
-		mas_bulk_rebalance(mas, mas->end, wr_mas->type);
 
 	/* set up node. */
 	if (in_rcu) {
@@ -4255,7 +4073,7 @@ static inline enum store_type mas_wr_store_type(struct ma_wr_state *wr_mas)
 	new_end = mas_wr_new_end(wr_mas);
 	/* Potential spanning rebalance collapsing a node */
 	if (new_end < mt_min_slots[wr_mas->type]) {
-		if (!mte_is_root(mas->node) && !(mas->mas_flags & MA_STATE_BULK))
+		if (!mte_is_root(mas->node))
 			return  wr_rebalance;
 		return wr_node_store;
 	}
@@ -5562,25 +5380,7 @@ void mas_destroy(struct ma_state *mas)
 	struct maple_alloc *node;
 	unsigned long total;
 
-	/*
-	 * When using mas_for_each() to insert an expected number of elements,
-	 * it is possible that the number inserted is less than the expected
-	 * number.  To fix an invalid final node, a check is performed here to
-	 * rebalance the previous node with the final node.
-	 */
-	if (mas->mas_flags & MA_STATE_REBALANCE) {
-		unsigned char end;
-		if (mas_is_err(mas))
-			mas_reset(mas);
-		mas_start(mas);
-		mtree_range_walk(mas);
-		end = mas->end + 1;
-		if (end < mt_min_slot_count(mas->node) - 1)
-			mas_destroy_rebalance(mas, end);
-
-		mas->mas_flags &= ~MA_STATE_REBALANCE;
-	}
-	mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
+	mas->mas_flags &= ~MA_STATE_PREALLOC;
 
 	total = mas_allocated(mas);
 	while (total) {
@@ -5600,68 +5400,6 @@ void mas_destroy(struct ma_state *mas)
 }
 EXPORT_SYMBOL_GPL(mas_destroy);
 
-/*
- * mas_expected_entries() - Set the expected number of entries that will be inserted.
- * @mas: The maple state
- * @nr_entries: The number of expected entries.
- *
- * This will attempt to pre-allocate enough nodes to store the expected number
- * of entries.  The allocations will occur using the bulk allocator interface
- * for speed.  Please call mas_destroy() on the @mas after inserting the entries
- * to ensure any unused nodes are freed.
- *
- * Return: 0 on success, -ENOMEM if memory could not be allocated.
- */
-int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
-{
-	int nonleaf_cap = MAPLE_ARANGE64_SLOTS - 2;
-	struct maple_enode *enode = mas->node;
-	int nr_nodes;
-	int ret;
-
-	/*
-	 * Sometimes it is necessary to duplicate a tree to a new tree, such as
-	 * forking a process and duplicating the VMAs from one tree to a new
-	 * tree.  When such a situation arises, it is known that the new tree is
-	 * not going to be used until the entire tree is populated.  For
-	 * performance reasons, it is best to use a bulk load with RCU disabled.
-	 * This allows for optimistic splitting that favours the left and reuse
-	 * of nodes during the operation.
-	 */
-
-	/* Optimize splitting for bulk insert in-order */
-	mas->mas_flags |= MA_STATE_BULK;
-
-	/*
-	 * Avoid overflow, assume a gap between each entry and a trailing null.
-	 * If this is wrong, it just means allocation can happen during
-	 * insertion of entries.
-	 */
-	nr_nodes = max(nr_entries, nr_entries * 2 + 1);
-	if (!mt_is_alloc(mas->tree))
-		nonleaf_cap = MAPLE_RANGE64_SLOTS - 2;
-
-	/* Leaves; reduce slots to keep space for expansion */
-	nr_nodes = DIV_ROUND_UP(nr_nodes, MAPLE_RANGE64_SLOTS - 2);
-	/* Internal nodes */
-	nr_nodes += DIV_ROUND_UP(nr_nodes, nonleaf_cap);
-	/* Add working room for split (2 nodes) + new parents */
-	mas_node_count_gfp(mas, nr_nodes + 3, GFP_KERNEL);
-
-	/* Detect if allocations run out */
-	mas->mas_flags |= MA_STATE_PREALLOC;
-
-	if (!mas_is_err(mas))
-		return 0;
-
-	ret = xa_err(mas->node);
-	mas->node = enode;
-	mas_destroy(mas);
-	return ret;
-
-}
-EXPORT_SYMBOL_GPL(mas_expected_entries);
-
 static void mas_may_activate(struct ma_state *mas)
 {
 	if (!mas->node) {
diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index cb3936595b0d56a9682ff100eba54693a1427829..14fbbee32046a13d54d60dcac2b45be2bd190ac4 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -2746,139 +2746,6 @@ static noinline void __init check_fuzzer(struct maple_tree *mt)
 	mtree_test_erase(mt, ULONG_MAX - 10);
 }
 
-/* duplicate the tree with a specific gap */
-static noinline void __init check_dup_gaps(struct maple_tree *mt,
-				    unsigned long nr_entries, bool zero_start,
-				    unsigned long gap)
-{
-	unsigned long i = 0;
-	struct maple_tree newmt;
-	int ret;
-	void *tmp;
-	MA_STATE(mas, mt, 0, 0);
-	MA_STATE(newmas, &newmt, 0, 0);
-	struct rw_semaphore newmt_lock;
-
-	init_rwsem(&newmt_lock);
-	mt_set_external_lock(&newmt, &newmt_lock);
-
-	if (!zero_start)
-		i = 1;
-
-	mt_zero_nr_tallocated();
-	for (; i <= nr_entries; i++)
-		mtree_store_range(mt, i*10, (i+1)*10 - gap,
-				  xa_mk_value(i), GFP_KERNEL);
-
-	mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
-	mt_set_non_kernel(99999);
-	down_write(&newmt_lock);
-	ret = mas_expected_entries(&newmas, nr_entries);
-	mt_set_non_kernel(0);
-	MT_BUG_ON(mt, ret != 0);
-
-	rcu_read_lock();
-	mas_for_each(&mas, tmp, ULONG_MAX) {
-		newmas.index = mas.index;
-		newmas.last = mas.last;
-		mas_store(&newmas, tmp);
-	}
-	rcu_read_unlock();
-	mas_destroy(&newmas);
-
-	__mt_destroy(&newmt);
-	up_write(&newmt_lock);
-}
-
-/* Duplicate many sizes of trees.  Mainly to test expected entry values */
-static noinline void __init check_dup(struct maple_tree *mt)
-{
-	int i;
-	int big_start = 100010;
-
-	/* Check with a value at zero */
-	for (i = 10; i < 1000; i++) {
-		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
-		check_dup_gaps(mt, i, true, 5);
-		mtree_destroy(mt);
-		rcu_barrier();
-	}
-
-	cond_resched();
-	mt_cache_shrink();
-	/* Check with a value at zero, no gap */
-	for (i = 1000; i < 2000; i++) {
-		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
-		check_dup_gaps(mt, i, true, 0);
-		mtree_destroy(mt);
-		rcu_barrier();
-	}
-
-	cond_resched();
-	mt_cache_shrink();
-	/* Check with a value at zero and unreasonably large */
-	for (i = big_start; i < big_start + 10; i++) {
-		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
-		check_dup_gaps(mt, i, true, 5);
-		mtree_destroy(mt);
-		rcu_barrier();
-	}
-
-	cond_resched();
-	mt_cache_shrink();
-	/* Small to medium size not starting at zero*/
-	for (i = 200; i < 1000; i++) {
-		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
-		check_dup_gaps(mt, i, false, 5);
-		mtree_destroy(mt);
-		rcu_barrier();
-	}
-
-	cond_resched();
-	mt_cache_shrink();
-	/* Unreasonably large not starting at zero*/
-	for (i = big_start; i < big_start + 10; i++) {
-		mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
-		check_dup_gaps(mt, i, false, 5);
-		mtree_destroy(mt);
-		rcu_barrier();
-		cond_resched();
-		mt_cache_shrink();
-	}
-
-	/* Check non-allocation tree not starting at zero */
-	for (i = 1500; i < 3000; i++) {
-		mt_init_flags(mt, 0);
-		check_dup_gaps(mt, i, false, 5);
-		mtree_destroy(mt);
-		rcu_barrier();
-		cond_resched();
-		if (i % 2 == 0)
-			mt_cache_shrink();
-	}
-
-	mt_cache_shrink();
-	/* Check non-allocation tree starting at zero */
-	for (i = 200; i < 1000; i++) {
-		mt_init_flags(mt, 0);
-		check_dup_gaps(mt, i, true, 5);
-		mtree_destroy(mt);
-		rcu_barrier();
-		cond_resched();
-	}
-
-	mt_cache_shrink();
-	/* Unreasonably large */
-	for (i = big_start + 5; i < big_start + 10; i++) {
-		mt_init_flags(mt, 0);
-		check_dup_gaps(mt, i, true, 5);
-		mtree_destroy(mt);
-		rcu_barrier();
-		mt_cache_shrink();
-		cond_resched();
-	}
-}
-
 static noinline void __init check_bnode_min_spanning(struct maple_tree *mt)
 {
 	int i = 50;
@@ -4077,10 +3944,6 @@ static int __init maple_tree_seed(void)
 	check_fuzzer(&tree);
 	mtree_destroy(&tree);
 
-	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
-	check_dup(&tree);
-	mtree_destroy(&tree);
-
 	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 	check_bnode_min_spanning(&tree);
 	mtree_destroy(&tree);
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 172700fb7784d29f9403003b4484a5ebd7aa316b..c0543060dae2510477963331fb0ccdffd78ea965 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35455,17 +35455,6 @@ static void check_dfs_preorder(struct maple_tree *mt)
 	MT_BUG_ON(mt, count != e);
 	mtree_destroy(mt);
 
-	mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
-	mas_reset(&mas);
-	mt_zero_nr_tallocated();
-	mt_set_non_kernel(200);
-	mas_expected_entries(&mas, max);
-	for (count = 0; count <= max; count++) {
-		mas.index = mas.last = count;
-		mas_store(&mas, xa_mk_value(count));
-		MT_BUG_ON(mt, mas_is_err(&mas));
-	}
-	mas_destroy(&mas);
 	rcu_barrier();
 	/*
 	 * pr_info(" ->seq test of 0-%lu %luK in %d active (%d total)\n",
@@ -36454,27 +36443,6 @@ static inline int check_vma_modification(struct maple_tree *mt)
 	return 0;
 }
 
-/*
- * test to check that bulk stores do not use wr_rebalance as the store
- * type.
- */
-static inline void check_bulk_rebalance(struct maple_tree *mt)
-{
-	MA_STATE(mas, mt, ULONG_MAX, ULONG_MAX);
-	int max = 10;
-
-	build_full_tree(mt, 0, 2);
-
-	/* erase every entry in the tree */
-	do {
-		/* set up bulk store mode */
-		mas_expected_entries(&mas, max);
-		mas_erase(&mas);
-		MT_BUG_ON(mt, mas.store_type == wr_rebalance);
-	} while (mas_prev(&mas, 0) != NULL);
-
-	mas_destroy(&mas);
-}
 
 void farmer_tests(void)
 {
@@ -36487,10 +36455,6 @@ void farmer_tests(void)
 	check_vma_modification(&tree);
 	mtree_destroy(&tree);
 
-	mt_init(&tree);
-	check_bulk_rebalance(&tree);
-	mtree_destroy(&tree);
-
 	tree.ma_root = xa_mk_value(0);
 	mt_dump(&tree, mt_dump_dec);
 

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 12/23] tools/testing/vma: Implement vm_refcnt reset
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (10 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 11/23] maple_tree: Drop bulk insert support Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-25 16:38   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 13/23] tools/testing: Add support for changes to slab for sheaves Vlastimil Babka
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Add the reset of the ref count in vma_lock_init().  This is needed if
the vma memory is not zeroed on allocation.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/testing/vma/vma_internal.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index f8cf5b184d5b51dd627ff440943a7af3c549f482..6b6e2b05918c9f95b537f26e20a943b34082825a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -1373,6 +1373,8 @@ static inline void ksm_exit(struct mm_struct *mm)
 
 static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
 {
+	if (reset_refcnt)
+		refcount_set(&vma->vm_refcnt, 0);
 }
 
 static inline void vma_numab_state_init(struct vm_area_struct *vma)

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 13/23] tools/testing: Add support for changes to slab for sheaves
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (11 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 12/23] tools/testing/vma: Implement vm_refcnt reset Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-26 23:28   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 14/23] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The slab changes for sheaves requires more effort in the testing code.
Unite all the kmem_cache work into the tools/include slab header for
both the vma and maple tree testing.

The vma test code also requires importing more #defines to allow for
seamless use of the shared kmem_cache code.

This adds the pthread header to the slab header in the tools directory
to allow for the pthread_mutex in linux.c.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/include/linux/slab.h        | 137 ++++++++++++++++++++++++++++++++++++--
 tools/testing/shared/linux.c      |  26 ++------
 tools/testing/shared/maple-shim.c |   1 +
 tools/testing/vma/vma_internal.h  |  92 +------------------------
 4 files changed, 142 insertions(+), 114 deletions(-)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index c87051e2b26f5a7fee0362697fae067076b8e84d..c5c5cc6db5668be2cc94c29065ccfa7ca7b4bb08 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -4,11 +4,31 @@
 
 #include <linux/types.h>
 #include <linux/gfp.h>
+#include <pthread.h>
 
-#define SLAB_PANIC 2
 #define SLAB_RECLAIM_ACCOUNT    0x00020000UL            /* Objects are reclaimable */
 
 #define kzalloc_node(size, flags, node) kmalloc(size, flags)
+enum _slab_flag_bits {
+	_SLAB_KMALLOC,
+	_SLAB_HWCACHE_ALIGN,
+	_SLAB_PANIC,
+	_SLAB_TYPESAFE_BY_RCU,
+	_SLAB_ACCOUNT,
+	_SLAB_FLAGS_LAST_BIT
+};
+
+#define __SLAB_FLAG_BIT(nr)	((unsigned int __force)(1U << (nr)))
+#define __SLAB_FLAG_UNUSED	((unsigned int __force)(0U))
+
+#define SLAB_HWCACHE_ALIGN	__SLAB_FLAG_BIT(_SLAB_HWCACHE_ALIGN)
+#define SLAB_PANIC		__SLAB_FLAG_BIT(_SLAB_PANIC)
+#define SLAB_TYPESAFE_BY_RCU	__SLAB_FLAG_BIT(_SLAB_TYPESAFE_BY_RCU)
+#ifdef CONFIG_MEMCG
+# define SLAB_ACCOUNT		__SLAB_FLAG_BIT(_SLAB_ACCOUNT)
+#else
+# define SLAB_ACCOUNT		__SLAB_FLAG_UNUSED
+#endif
 
 void *kmalloc(size_t size, gfp_t gfp);
 void kfree(void *p);
@@ -23,6 +43,86 @@ enum slab_state {
 	FULL
 };
 
+struct kmem_cache {
+	pthread_mutex_t lock;
+	unsigned int size;
+	unsigned int align;
+	unsigned int sheaf_capacity;
+	int nr_objs;
+	void *objs;
+	void (*ctor)(void *);
+	bool non_kernel_enabled;
+	unsigned int non_kernel;
+	unsigned long nr_allocated;
+	unsigned long nr_tallocated;
+	bool exec_callback;
+	void (*callback)(void *);
+	void *private;
+};
+
+struct kmem_cache_args {
+	/**
+	 * @align: The required alignment for the objects.
+	 *
+	 * %0 means no specific alignment is requested.
+	 */
+	unsigned int align;
+	/**
+	 * @sheaf_capacity: The maximum size of the sheaf.
+	 */
+	unsigned int sheaf_capacity;
+	/**
+	 * @useroffset: Usercopy region offset.
+	 *
+	 * %0 is a valid offset, when @usersize is non-%0
+	 */
+	unsigned int useroffset;
+	/**
+	 * @usersize: Usercopy region size.
+	 *
+	 * %0 means no usercopy region is specified.
+	 */
+	unsigned int usersize;
+	/**
+	 * @freeptr_offset: Custom offset for the free pointer
+	 * in &SLAB_TYPESAFE_BY_RCU caches
+	 *
+	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
+	 * outside of the object. This might cause the object to grow in size.
+	 * Cache creators that have a reason to avoid this can specify a custom
+	 * free pointer offset in their struct where the free pointer will be
+	 * placed.
+	 *
+	 * Note that placing the free pointer inside the object requires the
+	 * caller to ensure that no fields are invalidated that are required to
+	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
+	 * details).
+	 *
+	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
+	 * is specified, %use_freeptr_offset must be set %true.
+	 *
+	 * Note that @ctor currently isn't supported with custom free pointers
+	 * as a @ctor requires an external free pointer.
+	 */
+	unsigned int freeptr_offset;
+	/**
+	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
+	 */
+	bool use_freeptr_offset;
+	/**
+	 * @ctor: A constructor for the objects.
+	 *
+	 * The constructor is invoked for each object in a newly allocated slab
+	 * page. It is the cache user's responsibility to free object in the
+	 * same state as after calling the constructor, or deal appropriately
+	 * with any differences between a freshly constructed and a reallocated
+	 * object.
+	 *
+	 * %NULL means no constructor.
+	 */
+	void (*ctor)(void *);
+};
+
 static inline void *kzalloc(size_t size, gfp_t gfp)
 {
 	return kmalloc(size, gfp | __GFP_ZERO);
@@ -37,9 +137,38 @@ static inline void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
 }
 void kmem_cache_free(struct kmem_cache *cachep, void *objp);
 
-struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
-			unsigned int align, unsigned int flags,
-			void (*ctor)(void *));
+
+struct kmem_cache *
+__kmem_cache_create_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags);
+
+/* If NULL is passed for @args, use this variant with default arguments. */
+static inline struct kmem_cache *
+__kmem_cache_default_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags)
+{
+	struct kmem_cache_args kmem_default_args = {};
+
+	return __kmem_cache_create_args(name, size, &kmem_default_args, flags);
+}
+
+static inline struct kmem_cache *
+__kmem_cache_create(const char *name, unsigned int size, unsigned int align,
+		unsigned int flags, void (*ctor)(void *))
+{
+	struct kmem_cache_args kmem_args = {
+		.align	= align,
+		.ctor	= ctor,
+	};
+
+	return __kmem_cache_create_args(name, size, &kmem_args, flags);
+}
+
+#define kmem_cache_create(__name, __object_size, __args, ...)           \
+	_Generic((__args),                                              \
+		struct kmem_cache_args *: __kmem_cache_create_args,	\
+		void *: __kmem_cache_default_args,			\
+		default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
 
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 0f97fb0d19e19c327aa4843a35b45cc086f4f366..97b8412ccbb6d222604c7b397c53c65618d8d51b 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -16,21 +16,6 @@ int nr_allocated;
 int preempt_count;
 int test_verbose;
 
-struct kmem_cache {
-	pthread_mutex_t lock;
-	unsigned int size;
-	unsigned int align;
-	int nr_objs;
-	void *objs;
-	void (*ctor)(void *);
-	unsigned int non_kernel;
-	unsigned long nr_allocated;
-	unsigned long nr_tallocated;
-	bool exec_callback;
-	void (*callback)(void *);
-	void *private;
-};
-
 void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *))
 {
 	cachep->callback = callback;
@@ -234,23 +219,26 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 }
 
 struct kmem_cache *
-kmem_cache_create(const char *name, unsigned int size, unsigned int align,
-		unsigned int flags, void (*ctor)(void *))
+__kmem_cache_create_args(const char *name, unsigned int size,
+			  struct kmem_cache_args *args,
+			  unsigned int flags)
 {
 	struct kmem_cache *ret = malloc(sizeof(*ret));
 
 	pthread_mutex_init(&ret->lock, NULL);
 	ret->size = size;
-	ret->align = align;
+	ret->align = args->align;
+	ret->sheaf_capacity = args->sheaf_capacity;
 	ret->nr_objs = 0;
 	ret->nr_allocated = 0;
 	ret->nr_tallocated = 0;
 	ret->objs = NULL;
-	ret->ctor = ctor;
+	ret->ctor = args->ctor;
 	ret->non_kernel = 0;
 	ret->exec_callback = false;
 	ret->callback = NULL;
 	ret->private = NULL;
+
 	return ret;
 }
 
diff --git a/tools/testing/shared/maple-shim.c b/tools/testing/shared/maple-shim.c
index 640df76f483e09f3b6f85612786060dd273e2362..9d7b743415660305416e972fa75b56824211b0eb 100644
--- a/tools/testing/shared/maple-shim.c
+++ b/tools/testing/shared/maple-shim.c
@@ -3,5 +3,6 @@
 /* Very simple shim around the maple tree. */
 
 #include "maple-shared.h"
+#include <linux/slab.h>
 
 #include "../../../lib/maple_tree.c"
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 6b6e2b05918c9f95b537f26e20a943b34082825a..d5b87fa6a133f6d676488de2538c509e0f0e1d54 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -26,6 +26,7 @@
 #include <linux/mm.h>
 #include <linux/rbtree.h>
 #include <linux/refcount.h>
+#include <linux/slab.h>
 
 extern unsigned long stack_guard_gap;
 #ifdef CONFIG_MMU
@@ -509,65 +510,6 @@ struct pagetable_move_control {
 		.len_in = len_,						\
 	}
 
-struct kmem_cache_args {
-	/**
-	 * @align: The required alignment for the objects.
-	 *
-	 * %0 means no specific alignment is requested.
-	 */
-	unsigned int align;
-	/**
-	 * @useroffset: Usercopy region offset.
-	 *
-	 * %0 is a valid offset, when @usersize is non-%0
-	 */
-	unsigned int useroffset;
-	/**
-	 * @usersize: Usercopy region size.
-	 *
-	 * %0 means no usercopy region is specified.
-	 */
-	unsigned int usersize;
-	/**
-	 * @freeptr_offset: Custom offset for the free pointer
-	 * in &SLAB_TYPESAFE_BY_RCU caches
-	 *
-	 * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
-	 * outside of the object. This might cause the object to grow in size.
-	 * Cache creators that have a reason to avoid this can specify a custom
-	 * free pointer offset in their struct where the free pointer will be
-	 * placed.
-	 *
-	 * Note that placing the free pointer inside the object requires the
-	 * caller to ensure that no fields are invalidated that are required to
-	 * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
-	 * details).
-	 *
-	 * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
-	 * is specified, %use_freeptr_offset must be set %true.
-	 *
-	 * Note that @ctor currently isn't supported with custom free pointers
-	 * as a @ctor requires an external free pointer.
-	 */
-	unsigned int freeptr_offset;
-	/**
-	 * @use_freeptr_offset: Whether a @freeptr_offset is used.
-	 */
-	bool use_freeptr_offset;
-	/**
-	 * @ctor: A constructor for the objects.
-	 *
-	 * The constructor is invoked for each object in a newly allocated slab
-	 * page. It is the cache user's responsibility to free object in the
-	 * same state as after calling the constructor, or deal appropriately
-	 * with any differences between a freshly constructed and a reallocated
-	 * object.
-	 *
-	 * %NULL means no constructor.
-	 */
-	void (*ctor)(void *);
-};
-
 static inline void vma_iter_invalidate(struct vma_iterator *vmi)
 {
 	mas_pause(&vmi->mas);
@@ -652,38 +594,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_lock_seq = UINT_MAX;
 }
 
-struct kmem_cache {
-	const char *name;
-	size_t object_size;
-	struct kmem_cache_args *args;
-};
-
-static inline struct kmem_cache *__kmem_cache_create(const char *name,
-						     size_t object_size,
-						     struct kmem_cache_args *args)
-{
-	struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
-
-	ret->name = name;
-	ret->object_size = object_size;
-	ret->args = args;
-
-	return ret;
-}
-
-#define kmem_cache_create(__name, __object_size, __args, ...)           \
-	__kmem_cache_create((__name), (__object_size), (__args))
-
-static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
-{
-	return calloc(1, s->object_size);
-}
-
-static inline void kmem_cache_free(struct kmem_cache *s, void *x)
-{
-	free(x);
-}
-
 /*
  * These are defined in vma.h, but sadly vm_stat_account() is referenced by
  * kernel/fork.c, so we have to these broadly available there, and temporarily

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 14/23] mm, vma: use percpu sheaves for vm_area_struct cache
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (12 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 13/23] tools/testing: Add support for changes to slab for sheaves Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-10  8:01 ` [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

Create the vm_area_struct cache with percpu sheaves of size 32 to
improve its performance.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vma_init.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/vma_init.c b/mm/vma_init.c
index 8e53c7943561e7324e7992946b4065dec1149b82..52c6b55fac4519e0da39ca75ad018e14449d1d95 100644
--- a/mm/vma_init.c
+++ b/mm/vma_init.c
@@ -16,6 +16,7 @@ void __init vma_state_init(void)
 	struct kmem_cache_args args = {
 		.use_freeptr_offset = true,
 		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+		.sheaf_capacity = 32,
 	};
 
 	vm_area_cachep = kmem_cache_create("vm_area_struct",

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (13 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 14/23] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-12  2:20   ` Liam R. Howlett
  2025-10-16 15:16   ` D, Suneeth
  2025-09-10  8:01 ` [PATCH v8 16/23] tools/testing: include maple-shim.c in maple.c Vlastimil Babka
                   ` (8 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
improve its performance. Note this will not immediately take advantage
of sheaf batching of kfree_rcu() operations due to the maple tree using
call_rcu with custom callbacks. The followup changes to maple tree will
change that and also make use of the prefilled sheaves functionality.

Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 4f0e30b57b0cef9e5cf791f3f64f5898752db402..d034f170ac897341b40cfd050b6aee86b6d2cf60 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6040,9 +6040,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 
 void __init maple_tree_init(void)
 {
+	struct kmem_cache_args args = {
+		.align  = sizeof(struct maple_node),
+		.sheaf_capacity = 32,
+	};
+
 	maple_node_cache = kmem_cache_create("maple_node",
-			sizeof(struct maple_node), sizeof(struct maple_node),
-			SLAB_PANIC, NULL);
+			sizeof(struct maple_node), &args,
+			SLAB_PANIC);
 }
 
 /**

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 16/23] tools/testing: include maple-shim.c in maple.c
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (14 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-26 23:45   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 17/23] testing/radix-tree/maple: Hack around kfree_rcu not existing Vlastimil Babka
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

There's some duplicated code and we are about to add more functionality
in maple-shared.h that we will need in the userspace maple test to be
available, so include it via maple-shim.c

Co-developed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/testing/radix-tree/maple.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index c0543060dae2510477963331fb0ccdffd78ea965..4a35e1e7c64b7ce347cbd1693beeaacb0c4c330e 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -8,14 +8,6 @@
  * difficult to handle in kernel tests.
  */
 
-#define CONFIG_DEBUG_MAPLE_TREE
-#define CONFIG_MAPLE_SEARCH
-#define MAPLE_32BIT (MAPLE_NODE_SLOTS > 31)
-#include "test.h"
-#include <stdlib.h>
-#include <time.h>
-#include <linux/init.h>
-
 #define module_init(x)
 #define module_exit(x)
 #define MODULE_AUTHOR(x)
@@ -23,7 +15,9 @@
 #define MODULE_LICENSE(x)
 #define dump_stack()	assert(0)
 
-#include "../../../lib/maple_tree.c"
+#include "test.h"
+
+#include "../shared/maple-shim.c"
 #include "../../../lib/test_maple_tree.c"
 
 #define RCU_RANGE_COUNT 1000

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 17/23] testing/radix-tree/maple: Hack around kfree_rcu not existing
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (15 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 16/23] tools/testing: include maple-shim.c in maple.c Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-26 23:53   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 18/23] maple_tree: Use kfree_rcu in ma_free_rcu Vlastimil Babka
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Pedro Falcato

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

liburcu doesn't have kfree_rcu (or anything similar). Despite that, we
can hack around it in a trivial fashion, by adding a wrapper.

The wrapper only works for maple_nodes because we cannot get the
kmem_cache pointer any other way in the test code.

Link: https://lore.kernel.org/all/20250812162124.59417-1-pfalcato@suse.de/
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/testing/shared/maple-shared.h | 11 +++++++++++
 tools/testing/shared/maple-shim.c   |  6 ++++++
 2 files changed, 17 insertions(+)

diff --git a/tools/testing/shared/maple-shared.h b/tools/testing/shared/maple-shared.h
index dc4d30f3860b9bd23b4177c7d7926ac686887815..2a1e9a8594a2834326cd9374738b2a2c7c3f9f7c 100644
--- a/tools/testing/shared/maple-shared.h
+++ b/tools/testing/shared/maple-shared.h
@@ -10,4 +10,15 @@
 #include <time.h>
 #include "linux/init.h"
 
+void maple_rcu_cb(struct rcu_head *head);
+#define rcu_cb		maple_rcu_cb
+
+#define kfree_rcu(_struct, _memb)		\
+do {                                            \
+    typeof(_struct) _p_struct = (_struct);      \
+                                                \
+    call_rcu(&((_p_struct)->_memb), rcu_cb);    \
+} while(0);
+
+
 #endif /* __MAPLE_SHARED_H__ */
diff --git a/tools/testing/shared/maple-shim.c b/tools/testing/shared/maple-shim.c
index 9d7b743415660305416e972fa75b56824211b0eb..16252ee616c0489c80490ff25b8d255427bf9fdc 100644
--- a/tools/testing/shared/maple-shim.c
+++ b/tools/testing/shared/maple-shim.c
@@ -6,3 +6,9 @@
 #include <linux/slab.h>
 
 #include "../../../lib/maple_tree.c"
+
+void maple_rcu_cb(struct rcu_head *head) {
+	struct maple_node *node = container_of(head, struct maple_node, rcu);
+
+	kmem_cache_free(maple_node_cache, node);
+}

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 18/23] maple_tree: Use kfree_rcu in ma_free_rcu
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (16 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 17/23] testing/radix-tree/maple: Hack around kfree_rcu not existing Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-17 11:46   ` Harry Yoo
  2025-09-10  8:01 ` [PATCH v8 19/23] maple_tree: Replace mt_free_one() with kfree() Vlastimil Babka
                   ` (5 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Pedro Falcato

From: Pedro Falcato <pfalcato@suse.de>

kfree_rcu is an optimized version of call_rcu + kfree. It used to not be
possible to call it on non-kmalloc objects, but this restriction was
lifted ever since SLOB was dropped from the kernel, and since commit
6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()").

Thus, replace call_rcu + mt_free_rcu with kfree_rcu.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index d034f170ac897341b40cfd050b6aee86b6d2cf60..c706e2e48f884fd156e25be2b17eb5e154774db7 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -187,13 +187,6 @@ static inline void mt_free_bulk(size_t size, void __rcu **nodes)
 	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
 }
 
-static void mt_free_rcu(struct rcu_head *head)
-{
-	struct maple_node *node = container_of(head, struct maple_node, rcu);
-
-	kmem_cache_free(maple_node_cache, node);
-}
-
 /*
  * ma_free_rcu() - Use rcu callback to free a maple node
  * @node: The node to free
@@ -204,7 +197,7 @@ static void mt_free_rcu(struct rcu_head *head)
 static void ma_free_rcu(struct maple_node *node)
 {
 	WARN_ON(node->parent != ma_parent_ptr(node));
-	call_rcu(&node->rcu, mt_free_rcu);
+	kfree_rcu(node, rcu);
 }
 
 static void mt_set_height(struct maple_tree *mt, unsigned char height)
@@ -5099,7 +5092,7 @@ static void mt_free_walk(struct rcu_head *head)
 	mt_free_bulk(node->slot_len, slots);
 
 free_leaf:
-	mt_free_rcu(&node->rcu);
+	mt_free_one(node);
 }
 
 static inline void __rcu **mte_destroy_descend(struct maple_enode **enode,
@@ -5183,7 +5176,7 @@ static void mt_destroy_walk(struct maple_enode *enode, struct maple_tree *mt,
 
 free_leaf:
 	if (free)
-		mt_free_rcu(&node->rcu);
+		mt_free_one(node);
 	else
 		mt_clear_meta(mt, node, node->type);
 }

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 19/23] maple_tree: Replace mt_free_one() with kfree()
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (17 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 18/23] maple_tree: Use kfree_rcu in ma_free_rcu Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-27  0:06   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 20/23] tools/testing: Add support for prefilled slab sheafs Vlastimil Babka
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Pedro Falcato

From: Pedro Falcato <pfalcato@suse.de>

kfree() is a little shorter and works with kmem_cache_alloc'd pointers
too. Also lets us remove one more helper.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index c706e2e48f884fd156e25be2b17eb5e154774db7..0439aaacf6cb1f39d0d23af2e2a5af1d27ab32be 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -177,11 +177,6 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
 	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
 }
 
-static inline void mt_free_one(struct maple_node *node)
-{
-	kmem_cache_free(maple_node_cache, node);
-}
-
 static inline void mt_free_bulk(size_t size, void __rcu **nodes)
 {
 	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
@@ -5092,7 +5087,7 @@ static void mt_free_walk(struct rcu_head *head)
 	mt_free_bulk(node->slot_len, slots);
 
 free_leaf:
-	mt_free_one(node);
+	kfree(node);
 }
 
 static inline void __rcu **mte_destroy_descend(struct maple_enode **enode,
@@ -5176,7 +5171,7 @@ static void mt_destroy_walk(struct maple_enode *enode, struct maple_tree *mt,
 
 free_leaf:
 	if (free)
-		mt_free_one(node);
+		kfree(node);
 	else
 		mt_clear_meta(mt, node, node->type);
 }
@@ -5385,7 +5380,7 @@ void mas_destroy(struct ma_state *mas)
 			mt_free_bulk(count, (void __rcu **)&node->slot[1]);
 			total -= count;
 		}
-		mt_free_one(ma_mnode_ptr(node));
+		kfree(ma_mnode_ptr(node));
 		total--;
 	}
 
@@ -6373,7 +6368,7 @@ static void mas_dup_free(struct ma_state *mas)
 	}
 
 	node = mte_to_node(mas->node);
-	mt_free_one(node);
+	kfree(node);
 }
 
 /*

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 20/23] tools/testing: Add support for prefilled slab sheafs
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (18 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 19/23] maple_tree: Replace mt_free_one() with kfree() Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-27  0:28   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing Vlastimil Babka
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Add the prefilled sheaf structs to the slab header and the associated
functions to the testing/shared/linux.c file.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/include/linux/slab.h   | 28 ++++++++++++++
 tools/testing/shared/linux.c | 89 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index c5c5cc6db5668be2cc94c29065ccfa7ca7b4bb08..94937a699402bd1f31887dfb52b6fd0a3c986f43 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -123,6 +123,18 @@ struct kmem_cache_args {
 	void (*ctor)(void *);
 };
 
+struct slab_sheaf {
+	union {
+		struct list_head barn_list;
+		/* only used for prefilled sheafs */
+		unsigned int capacity;
+	};
+	struct kmem_cache *cache;
+	unsigned int size;
+	int node; /* only used for rcu_sheaf */
+	void *objects[];
+};
+
 static inline void *kzalloc(size_t size, gfp_t gfp)
 {
 	return kmalloc(size, gfp | __GFP_ZERO);
@@ -173,5 +185,21 @@ __kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **list);
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+static inline unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
 
 #endif		/* _TOOLS_SLAB_H */
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 97b8412ccbb6d222604c7b397c53c65618d8d51b..4ceff7969b78cf8e33cd1e021c68bc9f8a02a7a1 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -137,6 +137,12 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
 	if (kmalloc_verbose)
 		pr_debug("Bulk free %p[0-%zu]\n", list, size - 1);
 
+	if (cachep->exec_callback) {
+		if (cachep->callback)
+			cachep->callback(cachep->private);
+		cachep->exec_callback = false;
+	}
+
 	pthread_mutex_lock(&cachep->lock);
 	for (int i = 0; i < size; i++)
 		kmem_cache_free_locked(cachep, list[i]);
@@ -242,6 +248,89 @@ __kmem_cache_create_args(const char *name, unsigned int size,
 	return ret;
 }
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+	unsigned int capacity;
+
+	if (s->exec_callback) {
+		if (s->callback)
+			s->callback(s->private);
+		s->exec_callback = false;
+	}
+
+	capacity = max(size, s->sheaf_capacity);
+
+	sheaf = calloc(1, sizeof(*sheaf) + sizeof(void *) * capacity);
+	if (!sheaf)
+		return NULL;
+
+	sheaf->cache = s;
+	sheaf->capacity = capacity;
+	sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
+	if (!sheaf->size) {
+		free(sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf = *sheafp;
+	int refill;
+
+	if (sheaf->size >= size)
+		return 0;
+
+	if (size > sheaf->capacity) {
+		sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+		if (!sheaf)
+			return -ENOMEM;
+
+		kmem_cache_return_sheaf(s, gfp, *sheafp);
+		*sheafp = sheaf;
+		return 0;
+	}
+
+	refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
+				       &sheaf->objects[sheaf->size]);
+	if (!refill)
+		return -ENOMEM;
+
+	sheaf->size += refill;
+	return 0;
+}
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf *sheaf)
+{
+	if (sheaf->size)
+		kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+
+	free(sheaf);
+}
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf)
+{
+	void *obj;
+
+	if (sheaf->size == 0) {
+		printf("Nothing left in sheaf!\n");
+		return NULL;
+	}
+
+	obj = sheaf->objects[--sheaf->size];
+	sheaf->objects[sheaf->size] = NULL;
+
+	return obj;
+}
+
 /*
  * Test the test infrastructure for kem_cache_alloc/free and bulk counterparts.
  */

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (19 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 20/23] tools/testing: Add support for prefilled slab sheafs Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-27  1:08   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 22/23] maple_tree: Add single node allocation support to maple state Vlastimil Babka
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Use prefilled sheaves instead of bulk allocations. This should speed up
the allocations and the return path of unused allocations.

Remove the push and pop of nodes from the maple state as this is now
handled by the slab layer with sheaves.

Testing has been removed as necessary since the features of the tree
have been reduced.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/maple_tree.h       |   6 +-
 lib/maple_tree.c                 | 326 ++++++---------------------
 tools/testing/radix-tree/maple.c | 461 ++-------------------------------------
 tools/testing/shared/linux.c     |   5 +-
 4 files changed, 88 insertions(+), 710 deletions(-)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index bafe143b1f783202e27b32567fffee4149e8e266..166fd67e00d882b1e6de1f80c1b590bba7497cd3 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -442,7 +442,8 @@ struct ma_state {
 	struct maple_enode *node;	/* The node containing this entry */
 	unsigned long min;		/* The minimum index of this node - implied pivot min */
 	unsigned long max;		/* The maximum index of this node - implied pivot max */
-	struct maple_alloc *alloc;	/* Allocated nodes for this operation */
+	struct slab_sheaf *sheaf;	/* Allocated nodes for this operation */
+	unsigned long node_request;
 	enum maple_status status;	/* The status of the state (active, start, none, etc) */
 	unsigned char depth;		/* depth of tree descent during write */
 	unsigned char offset;
@@ -490,7 +491,8 @@ struct ma_wr_state {
 		.status = ma_start,					\
 		.min = 0,						\
 		.max = ULONG_MAX,					\
-		.alloc = NULL,						\
+		.node_request= 0,					\
+		.sheaf = NULL,						\
 		.mas_flags = 0,						\
 		.store_type = wr_invalid,				\
 	}
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 0439aaacf6cb1f39d0d23af2e2a5af1d27ab32be..a3fcb20227e506ed209554cc8c041a53f7ef4903 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -182,6 +182,22 @@ static inline void mt_free_bulk(size_t size, void __rcu **nodes)
 	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
 }
 
+static void mt_return_sheaf(struct slab_sheaf *sheaf)
+{
+	kmem_cache_return_sheaf(maple_node_cache, GFP_NOWAIT, sheaf);
+}
+
+static struct slab_sheaf *mt_get_sheaf(gfp_t gfp, int count)
+{
+	return kmem_cache_prefill_sheaf(maple_node_cache, gfp, count);
+}
+
+static int mt_refill_sheaf(gfp_t gfp, struct slab_sheaf **sheaf,
+		unsigned int size)
+{
+	return kmem_cache_refill_sheaf(maple_node_cache, gfp, sheaf, size);
+}
+
 /*
  * ma_free_rcu() - Use rcu callback to free a maple node
  * @node: The node to free
@@ -574,67 +590,6 @@ static __always_inline bool mte_dead_node(const struct maple_enode *enode)
 	return ma_dead_node(node);
 }
 
-/*
- * mas_allocated() - Get the number of nodes allocated in a maple state.
- * @mas: The maple state
- *
- * The ma_state alloc member is overloaded to hold a pointer to the first
- * allocated node or to the number of requested nodes to allocate.  If bit 0 is
- * set, then the alloc contains the number of requested nodes.  If there is an
- * allocated node, then the total allocated nodes is in that node.
- *
- * Return: The total number of nodes allocated
- */
-static inline unsigned long mas_allocated(const struct ma_state *mas)
-{
-	if (!mas->alloc || ((unsigned long)mas->alloc & 0x1))
-		return 0;
-
-	return mas->alloc->total;
-}
-
-/*
- * mas_set_alloc_req() - Set the requested number of allocations.
- * @mas: the maple state
- * @count: the number of allocations.
- *
- * The requested number of allocations is either in the first allocated node,
- * located in @mas->alloc->request_count, or directly in @mas->alloc if there is
- * no allocated node.  Set the request either in the node or do the necessary
- * encoding to store in @mas->alloc directly.
- */
-static inline void mas_set_alloc_req(struct ma_state *mas, unsigned long count)
-{
-	if (!mas->alloc || ((unsigned long)mas->alloc & 0x1)) {
-		if (!count)
-			mas->alloc = NULL;
-		else
-			mas->alloc = (struct maple_alloc *)(((count) << 1U) | 1U);
-		return;
-	}
-
-	mas->alloc->request_count = count;
-}
-
-/*
- * mas_alloc_req() - get the requested number of allocations.
- * @mas: The maple state
- *
- * The alloc count is either stored directly in @mas, or in
- * @mas->alloc->request_count if there is at least one node allocated.  Decode
- * the request count if it's stored directly in @mas->alloc.
- *
- * Return: The allocation request count.
- */
-static inline unsigned int mas_alloc_req(const struct ma_state *mas)
-{
-	if ((unsigned long)mas->alloc & 0x1)
-		return (unsigned long)(mas->alloc) >> 1;
-	else if (mas->alloc)
-		return mas->alloc->request_count;
-	return 0;
-}
-
 /*
  * ma_pivots() - Get a pointer to the maple node pivots.
  * @node: the maple node
@@ -1120,77 +1075,15 @@ static int mas_ascend(struct ma_state *mas)
  */
 static inline struct maple_node *mas_pop_node(struct ma_state *mas)
 {
-	struct maple_alloc *ret, *node = mas->alloc;
-	unsigned long total = mas_allocated(mas);
-	unsigned int req = mas_alloc_req(mas);
+	struct maple_node *ret;
 
-	/* nothing or a request pending. */
-	if (WARN_ON(!total))
+	if (WARN_ON_ONCE(!mas->sheaf))
 		return NULL;
 
-	if (total == 1) {
-		/* single allocation in this ma_state */
-		mas->alloc = NULL;
-		ret = node;
-		goto single_node;
-	}
-
-	if (node->node_count == 1) {
-		/* Single allocation in this node. */
-		mas->alloc = node->slot[0];
-		mas->alloc->total = node->total - 1;
-		ret = node;
-		goto new_head;
-	}
-	node->total--;
-	ret = node->slot[--node->node_count];
-	node->slot[node->node_count] = NULL;
-
-single_node:
-new_head:
-	if (req) {
-		req++;
-		mas_set_alloc_req(mas, req);
-	}
-
+	ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
 	memset(ret, 0, sizeof(*ret));
-	return (struct maple_node *)ret;
-}
-
-/*
- * mas_push_node() - Push a node back on the maple state allocation.
- * @mas: The maple state
- * @used: The used maple node
- *
- * Stores the maple node back into @mas->alloc for reuse.  Updates allocated and
- * requested node count as necessary.
- */
-static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
-{
-	struct maple_alloc *reuse = (struct maple_alloc *)used;
-	struct maple_alloc *head = mas->alloc;
-	unsigned long count;
-	unsigned int requested = mas_alloc_req(mas);
 
-	count = mas_allocated(mas);
-
-	reuse->request_count = 0;
-	reuse->node_count = 0;
-	if (count) {
-		if (head->node_count < MAPLE_ALLOC_SLOTS) {
-			head->slot[head->node_count++] = reuse;
-			head->total++;
-			goto done;
-		}
-		reuse->slot[0] = head;
-		reuse->node_count = 1;
-	}
-
-	reuse->total = count + 1;
-	mas->alloc = reuse;
-done:
-	if (requested > 1)
-		mas_set_alloc_req(mas, requested - 1);
+	return ret;
 }
 
 /*
@@ -1200,75 +1093,32 @@ static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
  */
 static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 {
-	struct maple_alloc *node;
-	unsigned long allocated = mas_allocated(mas);
-	unsigned int requested = mas_alloc_req(mas);
-	unsigned int count;
-	void **slots = NULL;
-	unsigned int max_req = 0;
-
-	if (!requested)
-		return;
+	if (unlikely(mas->sheaf)) {
+		unsigned long refill = mas->node_request;
 
-	mas_set_alloc_req(mas, 0);
-	if (mas->mas_flags & MA_STATE_PREALLOC) {
-		if (allocated)
+		if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
+			mas->node_request = 0;
 			return;
-		WARN_ON(!allocated);
-	}
-
-	if (!allocated || mas->alloc->node_count == MAPLE_ALLOC_SLOTS) {
-		node = (struct maple_alloc *)mt_alloc_one(gfp);
-		if (!node)
-			goto nomem_one;
-
-		if (allocated) {
-			node->slot[0] = mas->alloc;
-			node->node_count = 1;
-		} else {
-			node->node_count = 0;
 		}
 
-		mas->alloc = node;
-		node->total = ++allocated;
-		node->request_count = 0;
-		requested--;
-	}
+		if (mt_refill_sheaf(gfp, &mas->sheaf, refill))
+			goto error;
 
-	node = mas->alloc;
-	while (requested) {
-		max_req = MAPLE_ALLOC_SLOTS - node->node_count;
-		slots = (void **)&node->slot[node->node_count];
-		max_req = min(requested, max_req);
-		count = mt_alloc_bulk(gfp, max_req, slots);
-		if (!count)
-			goto nomem_bulk;
-
-		if (node->node_count == 0) {
-			node->slot[0]->node_count = 0;
-			node->slot[0]->request_count = 0;
-		}
+		mas->node_request = 0;
+		return;
+	}
 
-		node->node_count += count;
-		allocated += count;
-		/* find a non-full node*/
-		do {
-			node = node->slot[0];
-		} while (unlikely(node->node_count == MAPLE_ALLOC_SLOTS));
-		requested -= count;
+	mas->sheaf = mt_get_sheaf(gfp, mas->node_request);
+	if (likely(mas->sheaf)) {
+		mas->node_request = 0;
+		return;
 	}
-	mas->alloc->total = allocated;
-	return;
 
-nomem_bulk:
-	/* Clean up potential freed allocations on bulk failure */
-	memset(slots, 0, max_req * sizeof(unsigned long));
-	mas->alloc->total = allocated;
-nomem_one:
-	mas_set_alloc_req(mas, requested);
+error:
 	mas_set_err(mas, -ENOMEM);
 }
 
+
 /*
  * mas_free() - Free an encoded maple node
  * @mas: The maple state
@@ -1279,42 +1129,7 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
  */
 static inline void mas_free(struct ma_state *mas, struct maple_enode *used)
 {
-	struct maple_node *tmp = mte_to_node(used);
-
-	if (mt_in_rcu(mas->tree))
-		ma_free_rcu(tmp);
-	else
-		mas_push_node(mas, tmp);
-}
-
-/*
- * mas_node_count_gfp() - Check if enough nodes are allocated and request more
- * if there is not enough nodes.
- * @mas: The maple state
- * @count: The number of nodes needed
- * @gfp: the gfp flags
- */
-static void mas_node_count_gfp(struct ma_state *mas, int count, gfp_t gfp)
-{
-	unsigned long allocated = mas_allocated(mas);
-
-	if (allocated < count) {
-		mas_set_alloc_req(mas, count - allocated);
-		mas_alloc_nodes(mas, gfp);
-	}
-}
-
-/*
- * mas_node_count() - Check if enough nodes are allocated and request more if
- * there is not enough nodes.
- * @mas: The maple state
- * @count: The number of nodes needed
- *
- * Note: Uses GFP_NOWAIT for gfp flags.
- */
-static void mas_node_count(struct ma_state *mas, int count)
-{
-	return mas_node_count_gfp(mas, count, GFP_NOWAIT);
+	ma_free_rcu(mte_to_node(used));
 }
 
 /*
@@ -2451,10 +2266,7 @@ static inline void mas_topiary_node(struct ma_state *mas,
 	enode = tmp_mas->node;
 	tmp = mte_to_node(enode);
 	mte_set_node_dead(enode);
-	if (in_rcu)
-		ma_free_rcu(tmp);
-	else
-		mas_push_node(mas, tmp);
+	ma_free_rcu(tmp);
 }
 
 /*
@@ -3980,7 +3792,7 @@ static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas)
  *
  * Return: Number of nodes required for preallocation.
  */
-static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
+static inline void mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
 {
 	struct ma_state *mas = wr_mas->mas;
 	unsigned char height = mas_mt_height(mas);
@@ -4026,7 +3838,7 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
 		WARN_ON_ONCE(1);
 	}
 
-	return ret;
+	mas->node_request = ret;
 }
 
 /*
@@ -4087,15 +3899,15 @@ static inline enum store_type mas_wr_store_type(struct ma_wr_state *wr_mas)
  */
 static inline void mas_wr_preallocate(struct ma_wr_state *wr_mas, void *entry)
 {
-	int request;
+	struct ma_state *mas = wr_mas->mas;
 
 	mas_wr_prealloc_setup(wr_mas);
-	wr_mas->mas->store_type = mas_wr_store_type(wr_mas);
-	request = mas_prealloc_calc(wr_mas, entry);
-	if (!request)
+	mas->store_type = mas_wr_store_type(wr_mas);
+	mas_prealloc_calc(wr_mas, entry);
+	if (!mas->node_request)
 		return;
 
-	mas_node_count(wr_mas->mas, request);
+	mas_alloc_nodes(mas, GFP_NOWAIT);
 }
 
 /**
@@ -5208,7 +5020,6 @@ static inline void mte_destroy_walk(struct maple_enode *enode,
  */
 void *mas_store(struct ma_state *mas, void *entry)
 {
-	int request;
 	MA_WR_STATE(wr_mas, mas, entry);
 
 	trace_ma_write(__func__, mas, 0, entry);
@@ -5238,11 +5049,11 @@ void *mas_store(struct ma_state *mas, void *entry)
 		return wr_mas.content;
 	}
 
-	request = mas_prealloc_calc(&wr_mas, entry);
-	if (!request)
+	mas_prealloc_calc(&wr_mas, entry);
+	if (!mas->node_request)
 		goto store;
 
-	mas_node_count(mas, request);
+	mas_alloc_nodes(mas, GFP_NOWAIT);
 	if (mas_is_err(mas))
 		return NULL;
 
@@ -5330,20 +5141,19 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
 int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 {
 	MA_WR_STATE(wr_mas, mas, entry);
-	int ret = 0;
-	int request;
 
 	mas_wr_prealloc_setup(&wr_mas);
 	mas->store_type = mas_wr_store_type(&wr_mas);
-	request = mas_prealloc_calc(&wr_mas, entry);
-	if (!request)
+	mas_prealloc_calc(&wr_mas, entry);
+	if (!mas->node_request)
 		goto set_flag;
 
 	mas->mas_flags &= ~MA_STATE_PREALLOC;
-	mas_node_count_gfp(mas, request, gfp);
+	mas_alloc_nodes(mas, gfp);
 	if (mas_is_err(mas)) {
-		mas_set_alloc_req(mas, 0);
-		ret = xa_err(mas->node);
+		int ret = xa_err(mas->node);
+
+		mas->node_request = 0;
 		mas_destroy(mas);
 		mas_reset(mas);
 		return ret;
@@ -5351,7 +5161,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 
 set_flag:
 	mas->mas_flags |= MA_STATE_PREALLOC;
-	return ret;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(mas_preallocate);
 
@@ -5365,26 +5175,13 @@ EXPORT_SYMBOL_GPL(mas_preallocate);
  */
 void mas_destroy(struct ma_state *mas)
 {
-	struct maple_alloc *node;
-	unsigned long total;
-
 	mas->mas_flags &= ~MA_STATE_PREALLOC;
 
-	total = mas_allocated(mas);
-	while (total) {
-		node = mas->alloc;
-		mas->alloc = node->slot[0];
-		if (node->node_count > 1) {
-			size_t count = node->node_count - 1;
-
-			mt_free_bulk(count, (void __rcu **)&node->slot[1]);
-			total -= count;
-		}
-		kfree(ma_mnode_ptr(node));
-		total--;
-	}
+	mas->node_request = 0;
+	if (mas->sheaf)
+		mt_return_sheaf(mas->sheaf);
 
-	mas->alloc = NULL;
+	mas->sheaf = NULL;
 }
 EXPORT_SYMBOL_GPL(mas_destroy);
 
@@ -6019,7 +5816,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 		mas_alloc_nodes(mas, gfp);
 	}
 
-	if (!mas_allocated(mas))
+	if (!mas->sheaf)
 		return false;
 
 	mas->status = ma_start;
@@ -7414,8 +7211,9 @@ void mas_dump(const struct ma_state *mas)
 
 	pr_err("[%u/%u] index=%lx last=%lx\n", mas->offset, mas->end,
 	       mas->index, mas->last);
-	pr_err("     min=%lx max=%lx alloc=" PTR_FMT ", depth=%u, flags=%x\n",
-	       mas->min, mas->max, mas->alloc, mas->depth, mas->mas_flags);
+	pr_err("     min=%lx max=%lx sheaf=" PTR_FMT ", request %lu depth=%u, flags=%x\n",
+	       mas->min, mas->max, mas->sheaf, mas->node_request, mas->depth,
+	       mas->mas_flags);
 	if (mas->index > mas->last)
 		pr_err("Check index & last\n");
 }
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 4a35e1e7c64b7ce347cbd1693beeaacb0c4c330e..72a8fe8e832a4150c6567b711768eba6a3fa6768 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -57,430 +57,6 @@ struct rcu_reader_struct {
 	struct rcu_test_struct2 *test;
 };
 
-static int get_alloc_node_count(struct ma_state *mas)
-{
-	int count = 1;
-	struct maple_alloc *node = mas->alloc;
-
-	if (!node || ((unsigned long)node & 0x1))
-		return 0;
-	while (node->node_count) {
-		count += node->node_count;
-		node = node->slot[0];
-	}
-	return count;
-}
-
-static void check_mas_alloc_node_count(struct ma_state *mas)
-{
-	mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 1, GFP_KERNEL);
-	mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 3, GFP_KERNEL);
-	MT_BUG_ON(mas->tree, get_alloc_node_count(mas) != mas->alloc->total);
-	mas_destroy(mas);
-}
-
-/*
- * check_new_node() - Check the creation of new nodes and error path
- * verification.
- */
-static noinline void __init check_new_node(struct maple_tree *mt)
-{
-
-	struct maple_node *mn, *mn2, *mn3;
-	struct maple_alloc *smn;
-	struct maple_node *nodes[100];
-	int i, j, total;
-
-	MA_STATE(mas, mt, 0, 0);
-
-	check_mas_alloc_node_count(&mas);
-
-	/* Try allocating 3 nodes */
-	mtree_lock(mt);
-	mt_set_non_kernel(0);
-	/* request 3 nodes to be allocated. */
-	mas_node_count(&mas, 3);
-	/* Allocation request of 3. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 3);
-	/* Allocate failed. */
-	MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, mas.alloc == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
-	mas_push_node(&mas, mn);
-	mas_reset(&mas);
-	mas_destroy(&mas);
-	mtree_unlock(mt);
-
-
-	/* Try allocating 1 node, then 2 more */
-	mtree_lock(mt);
-	/* Set allocation request to 1. */
-	mas_set_alloc_req(&mas, 1);
-	/* Check Allocation request of 1. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
-	mas_set_err(&mas, -ENOMEM);
-	/* Validate allocation request. */
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	/* Eat the requested node. */
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, mn->slot[0] != NULL);
-	MT_BUG_ON(mt, mn->slot[1] != NULL);
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-
-	mn->parent = ma_parent_ptr(mn);
-	ma_free_rcu(mn);
-	mas.status = ma_start;
-	mas_destroy(&mas);
-	/* Allocate 3 nodes, will fail. */
-	mas_node_count(&mas, 3);
-	/* Drop the lock and allocate 3 nodes. */
-	mas_nomem(&mas, GFP_KERNEL);
-	/* Ensure 3 are allocated. */
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	/* Allocation request of 0. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 0);
-
-	MT_BUG_ON(mt, mas.alloc == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
-	/* Ensure we counted 3. */
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	/* Free. */
-	mas_reset(&mas);
-	mas_destroy(&mas);
-
-	/* Set allocation request to 1. */
-	mas_set_alloc_req(&mas, 1);
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
-	mas_set_err(&mas, -ENOMEM);
-	/* Validate allocation request. */
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_allocated(&mas) != 1);
-	/* Check the node is only one node. */
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, mn->slot[0] != NULL);
-	MT_BUG_ON(mt, mn->slot[1] != NULL);
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas_allocated(&mas) != 1);
-	MT_BUG_ON(mt, mas.alloc->node_count);
-
-	mas_set_alloc_req(&mas, 2); /* request 2 more. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 2);
-	mas_set_err(&mas, -ENOMEM);
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	MT_BUG_ON(mt, mas.alloc == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
-	for (i = 2; i >= 0; i--) {
-		mn = mas_pop_node(&mas);
-		MT_BUG_ON(mt, mas_allocated(&mas) != i);
-		MT_BUG_ON(mt, !mn);
-		MT_BUG_ON(mt, not_empty(mn));
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-	}
-
-	total = 64;
-	mas_set_alloc_req(&mas, total); /* request 2 more. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != total);
-	mas_set_err(&mas, -ENOMEM);
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	for (i = total; i > 0; i--) {
-		unsigned int e = 0; /* expected node_count */
-
-		if (!MAPLE_32BIT) {
-			if (i >= 35)
-				e = i - 34;
-			else if (i >= 5)
-				e = i - 4;
-			else if (i >= 2)
-				e = i - 1;
-		} else {
-			if (i >= 4)
-				e = i - 3;
-			else if (i >= 1)
-				e = i - 1;
-			else
-				e = 0;
-		}
-
-		MT_BUG_ON(mt, mas.alloc->node_count != e);
-		mn = mas_pop_node(&mas);
-		MT_BUG_ON(mt, not_empty(mn));
-		MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
-		MT_BUG_ON(mt, !mn);
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-	}
-
-	total = 100;
-	for (i = 1; i < total; i++) {
-		mas_set_alloc_req(&mas, i);
-		mas_set_err(&mas, -ENOMEM);
-		MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-		for (j = i; j > 0; j--) {
-			mn = mas_pop_node(&mas);
-			MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
-			MT_BUG_ON(mt, !mn);
-			MT_BUG_ON(mt, not_empty(mn));
-			mas_push_node(&mas, mn);
-			MT_BUG_ON(mt, mas_allocated(&mas) != j);
-			mn = mas_pop_node(&mas);
-			MT_BUG_ON(mt, not_empty(mn));
-			MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-
-		mas_set_alloc_req(&mas, i);
-		mas_set_err(&mas, -ENOMEM);
-		MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-		for (j = 0; j <= i/2; j++) {
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
-			nodes[j] = mas_pop_node(&mas);
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
-		}
-
-		while (j) {
-			j--;
-			mas_push_node(&mas, nodes[j]);
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != i);
-		for (j = 0; j <= i/2; j++) {
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
-			mn = mas_pop_node(&mas);
-			MT_BUG_ON(mt, not_empty(mn));
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
-		}
-		mas_reset(&mas);
-		MT_BUG_ON(mt, mas_nomem(&mas, GFP_KERNEL));
-		mas_destroy(&mas);
-
-	}
-
-	/* Set allocation request. */
-	total = 500;
-	mas_node_count(&mas, total);
-	/* Drop the lock and allocate the nodes. */
-	mas_nomem(&mas, GFP_KERNEL);
-	MT_BUG_ON(mt, !mas.alloc);
-	i = 1;
-	smn = mas.alloc;
-	while (i < total) {
-		for (j = 0; j < MAPLE_ALLOC_SLOTS; j++) {
-			i++;
-			MT_BUG_ON(mt, !smn->slot[j]);
-			if (i == total)
-				break;
-		}
-		smn = smn->slot[0]; /* next. */
-	}
-	MT_BUG_ON(mt, mas_allocated(&mas) != total);
-	mas_reset(&mas);
-	mas_destroy(&mas); /* Free. */
-
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	for (i = 1; i < 128; i++) {
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
-		for (j = i; j > 0; j--) { /*Free the requests */
-			mn = mas_pop_node(&mas); /* get the next node. */
-			MT_BUG_ON(mt, mn == NULL);
-			MT_BUG_ON(mt, not_empty(mn));
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	}
-
-	for (i = 1; i < MAPLE_NODE_MASK + 1; i++) {
-		MA_STATE(mas2, mt, 0, 0);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
-		for (j = 1; j <= i; j++) { /* Move the allocations to mas2 */
-			mn = mas_pop_node(&mas); /* get the next node. */
-			MT_BUG_ON(mt, mn == NULL);
-			MT_BUG_ON(mt, not_empty(mn));
-			mas_push_node(&mas2, mn);
-			MT_BUG_ON(mt, mas_allocated(&mas2) != j);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-		MT_BUG_ON(mt, mas_allocated(&mas2) != i);
-
-		for (j = i; j > 0; j--) { /*Free the requests */
-			MT_BUG_ON(mt, mas_allocated(&mas2) != j);
-			mn = mas_pop_node(&mas2); /* get the next node. */
-			MT_BUG_ON(mt, mn == NULL);
-			MT_BUG_ON(mt, not_empty(mn));
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
-	}
-
-
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
-	MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
-
-	mn = mas_pop_node(&mas); /* get the next node. */
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS);
-	MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS - 1);
-
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
-
-	/* Check the limit of pop/push/pop */
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 2); /* Request */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
-	MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_alloc_req(&mas));
-	MT_BUG_ON(mt, mas.alloc->node_count != 1);
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	MT_BUG_ON(mt, mas.alloc->node_count  != MAPLE_ALLOC_SLOTS);
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas.alloc->node_count != 1);
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	mn->parent = ma_parent_ptr(mn);
-	ma_free_rcu(mn);
-	for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
-		mn = mas_pop_node(&mas);
-		MT_BUG_ON(mt, not_empty(mn));
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-	}
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-
-
-	for (i = 3; i < MAPLE_NODE_MASK * 3; i++) {
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mas_push_node(&mas, mn); /* put it back */
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn2 = mas_pop_node(&mas); /* get the next node. */
-		mas_push_node(&mas, mn); /* put them back */
-		mas_push_node(&mas, mn2);
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn2 = mas_pop_node(&mas); /* get the next node. */
-		mn3 = mas_pop_node(&mas); /* get the next node. */
-		mas_push_node(&mas, mn); /* put them back */
-		mas_push_node(&mas, mn2);
-		mas_push_node(&mas, mn3);
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mas_destroy(&mas);
-	}
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, 5); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != 5);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, 10); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != 10);
-	mas_destroy(&mas);
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS - 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS - 1);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, 10 + MAPLE_ALLOC_SLOTS - 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != 10 + MAPLE_ALLOC_SLOTS - 1);
-	mas_destroy(&mas);
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 2); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 2);
-	mas_destroy(&mas);
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 1);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 3 + 2); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 3 + 2);
-	mas_destroy(&mas);
-
-	mtree_unlock(mt);
-}
-
 /*
  * Check erasing including RCU.
  */
@@ -35507,6 +35083,13 @@ static unsigned char get_vacant_height(struct ma_wr_state *wr_mas, void *entry)
 	return vacant_height;
 }
 
+static int mas_allocated(struct ma_state *mas)
+{
+	if (mas->sheaf)
+		return kmem_cache_sheaf_size(mas->sheaf);
+
+	return 0;
+}
 /* Preallocation testing */
 static noinline void __init check_prealloc(struct maple_tree *mt)
 {
@@ -35525,7 +35108,10 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
 
 	/* Spanning store */
 	mas_set_range(&mas, 470, 500);
-	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
+
+	mas_wr_preallocate(&wr_mas, ptr);
+	MT_BUG_ON(mt, mas.store_type != wr_spanning_store);
+	MT_BUG_ON(mt, mas_is_err(&mas));
 	allocated = mas_allocated(&mas);
 	height = mas_mt_height(&mas);
 	vacant_height = get_vacant_height(&wr_mas, ptr);
@@ -35535,6 +35121,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
 	allocated = mas_allocated(&mas);
 	MT_BUG_ON(mt, allocated != 0);
 
+	mas_wr_preallocate(&wr_mas, ptr);
 	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
 	allocated = mas_allocated(&mas);
 	height = mas_mt_height(&mas);
@@ -35575,20 +35162,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
 	mn->parent = ma_parent_ptr(mn);
 	ma_free_rcu(mn);
 
-	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
-	allocated = mas_allocated(&mas);
-	height = mas_mt_height(&mas);
-	vacant_height = get_vacant_height(&wr_mas, ptr);
-	MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas_allocated(&mas) != allocated);
-	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
-	mas_destroy(&mas);
-	allocated = mas_allocated(&mas);
-	MT_BUG_ON(mt, allocated != 0);
-
 	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
 	allocated = mas_allocated(&mas);
 	height = mas_mt_height(&mas);
@@ -36389,11 +35962,17 @@ static void check_nomem_writer_race(struct maple_tree *mt)
 	check_load(mt, 6, xa_mk_value(0xC));
 	mtree_unlock(mt);
 
+	mt_set_non_kernel(0);
 	/* test for the same race but with mas_store_gfp() */
 	mtree_store_range(mt, 0, 5, xa_mk_value(0xA), GFP_KERNEL);
 	mtree_store_range(mt, 6, 10, NULL, GFP_KERNEL);
 
 	mas_set_range(&mas, 0, 5);
+
+	/* setup writer 2 that will trigger the race condition */
+	mt_set_private(mt);
+	mt_set_callback(writer2);
+
 	mtree_lock(mt);
 	mas_store_gfp(&mas, NULL, GFP_KERNEL);
 
@@ -36508,10 +36087,6 @@ void farmer_tests(void)
 	check_erase_testset(&tree);
 	mtree_destroy(&tree);
 
-	mt_init_flags(&tree, 0);
-	check_new_node(&tree);
-	mtree_destroy(&tree);
-
 	if (!MAPLE_32BIT) {
 		mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 		check_rcu_simulated(&tree);
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 4ceff7969b78cf8e33cd1e021c68bc9f8a02a7a1..8c72571559583759456c2b469a2abc2611117c13 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -64,7 +64,8 @@ void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
 
 	if (!(gfp & __GFP_DIRECT_RECLAIM)) {
 		if (!cachep->non_kernel) {
-			cachep->exec_callback = true;
+			if (cachep->callback)
+				cachep->exec_callback = true;
 			return NULL;
 		}
 
@@ -210,6 +211,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 		for (i = 0; i < size; i++)
 			__kmem_cache_free_locked(cachep, p[i]);
 		pthread_mutex_unlock(&cachep->lock);
+		if (cachep->callback)
+			cachep->exec_callback = true;
 		return 0;
 	}
 

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 22/23] maple_tree: Add single node allocation support to maple state
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (20 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-09-27  1:17   ` Suren Baghdasaryan
  2025-09-10  8:01 ` [PATCH v8 23/23] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
  2025-10-07  6:34 ` [PATCH v8 00/23] SLUB percpu sheaves Christoph Hellwig
  23 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The fast path through a write will require replacing a single node in
the tree.  Using a sheaf (32 nodes) is too heavy for the fast path, so
special case the node store operation by just allocating one node in the
maple state.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/maple_tree.h       |  4 +++-
 lib/maple_tree.c                 | 47 +++++++++++++++++++++++++++++++++++-----
 tools/testing/radix-tree/maple.c |  9 ++++++--
 3 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 166fd67e00d882b1e6de1f80c1b590bba7497cd3..562a1e9e5132b5b1fa8f8402a7cadd8abb65e323 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -443,6 +443,7 @@ struct ma_state {
 	unsigned long min;		/* The minimum index of this node - implied pivot min */
 	unsigned long max;		/* The maximum index of this node - implied pivot max */
 	struct slab_sheaf *sheaf;	/* Allocated nodes for this operation */
+	struct maple_node *alloc;	/* allocated nodes */
 	unsigned long node_request;
 	enum maple_status status;	/* The status of the state (active, start, none, etc) */
 	unsigned char depth;		/* depth of tree descent during write */
@@ -491,8 +492,9 @@ struct ma_wr_state {
 		.status = ma_start,					\
 		.min = 0,						\
 		.max = ULONG_MAX,					\
-		.node_request= 0,					\
 		.sheaf = NULL,						\
+		.alloc = NULL,						\
+		.node_request= 0,					\
 		.mas_flags = 0,						\
 		.store_type = wr_invalid,				\
 	}
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a3fcb20227e506ed209554cc8c041a53f7ef4903..a912e6a1d4378e72b967027b60f8f564476ad14e 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1073,16 +1073,23 @@ static int mas_ascend(struct ma_state *mas)
  *
  * Return: A pointer to a maple node.
  */
-static inline struct maple_node *mas_pop_node(struct ma_state *mas)
+static __always_inline struct maple_node *mas_pop_node(struct ma_state *mas)
 {
 	struct maple_node *ret;
 
+	if (mas->alloc) {
+		ret = mas->alloc;
+		mas->alloc = NULL;
+		goto out;
+	}
+
 	if (WARN_ON_ONCE(!mas->sheaf))
 		return NULL;
 
 	ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
-	memset(ret, 0, sizeof(*ret));
 
+out:
+	memset(ret, 0, sizeof(*ret));
 	return ret;
 }
 
@@ -1093,9 +1100,34 @@ static inline struct maple_node *mas_pop_node(struct ma_state *mas)
  */
 static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 {
-	if (unlikely(mas->sheaf)) {
-		unsigned long refill = mas->node_request;
+	if (!mas->node_request)
+		return;
+
+	if (mas->node_request == 1) {
+		if (mas->sheaf)
+			goto use_sheaf;
+
+		if (mas->alloc)
+			return;
 
+		mas->alloc = mt_alloc_one(gfp);
+		if (!mas->alloc)
+			goto error;
+
+		mas->node_request = 0;
+		return;
+	}
+
+use_sheaf:
+	if (unlikely(mas->alloc)) {
+		kfree(mas->alloc);
+		mas->alloc = NULL;
+	}
+
+	if (mas->sheaf) {
+		unsigned long refill;
+
+		refill = mas->node_request;
 		if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
 			mas->node_request = 0;
 			return;
@@ -5180,8 +5212,11 @@ void mas_destroy(struct ma_state *mas)
 	mas->node_request = 0;
 	if (mas->sheaf)
 		mt_return_sheaf(mas->sheaf);
-
 	mas->sheaf = NULL;
+
+	if (mas->alloc)
+		kfree(mas->alloc);
+	mas->alloc = NULL;
 }
 EXPORT_SYMBOL_GPL(mas_destroy);
 
@@ -5816,7 +5851,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 		mas_alloc_nodes(mas, gfp);
 	}
 
-	if (!mas->sheaf)
+	if (!mas->sheaf && !mas->alloc)
 		return false;
 
 	mas->status = ma_start;
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 72a8fe8e832a4150c6567b711768eba6a3fa6768..83260f2efb1990b71093e456950069c24d75560e 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35085,10 +35085,15 @@ static unsigned char get_vacant_height(struct ma_wr_state *wr_mas, void *entry)
 
 static int mas_allocated(struct ma_state *mas)
 {
+	int total = 0;
+
+	if (mas->alloc)
+		total++;
+
 	if (mas->sheaf)
-		return kmem_cache_sheaf_size(mas->sheaf);
+		total += kmem_cache_sheaf_size(mas->sheaf);
 
-	return 0;
+	return total;
 }
 /* Preallocation testing */
 static noinline void __init check_prealloc(struct maple_tree *mt)

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v8 23/23] maple_tree: Convert forking to use the sheaf interface
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (21 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 22/23] maple_tree: Add single node allocation support to maple state Vlastimil Babka
@ 2025-09-10  8:01 ` Vlastimil Babka
  2025-10-07  6:34 ` [PATCH v8 00/23] SLUB percpu sheaves Christoph Hellwig
  23 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-10  8:01 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Use the generic interface which should result in less bulk allocations
during a forking.

A part of this is to abstract the freeing of the sheaf or maple state
allocations into its own function so mas_destroy() and the tree
duplication code can use the same functionality to return any unused
resources.

[andriy.shevchenko@linux.intel.com: remove unused mt_alloc_bulk()]
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 47 +++++++++++++++++++++++------------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a912e6a1d4378e72b967027b60f8f564476ad14e..bb51424053a5c4ceece7604877dfa3cd3780944a 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -172,11 +172,6 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
 	return kmem_cache_alloc(maple_node_cache, gfp);
 }
 
-static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
-{
-	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
-}
-
 static inline void mt_free_bulk(size_t size, void __rcu **nodes)
 {
 	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
@@ -1150,6 +1145,19 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 	mas_set_err(mas, -ENOMEM);
 }
 
+static inline void mas_empty_nodes(struct ma_state *mas)
+{
+	mas->node_request = 0;
+	if (mas->sheaf) {
+		mt_return_sheaf(mas->sheaf);
+		mas->sheaf = NULL;
+	}
+
+	if (mas->alloc) {
+		kfree(mas->alloc);
+		mas->alloc = NULL;
+	}
+}
 
 /*
  * mas_free() - Free an encoded maple node
@@ -5208,15 +5216,7 @@ EXPORT_SYMBOL_GPL(mas_preallocate);
 void mas_destroy(struct ma_state *mas)
 {
 	mas->mas_flags &= ~MA_STATE_PREALLOC;
-
-	mas->node_request = 0;
-	if (mas->sheaf)
-		mt_return_sheaf(mas->sheaf);
-	mas->sheaf = NULL;
-
-	if (mas->alloc)
-		kfree(mas->alloc);
-	mas->alloc = NULL;
+	mas_empty_nodes(mas);
 }
 EXPORT_SYMBOL_GPL(mas_destroy);
 
@@ -6241,7 +6241,7 @@ static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
 	struct maple_node *node = mte_to_node(mas->node);
 	struct maple_node *new_node = mte_to_node(new_mas->node);
 	enum maple_type type;
-	unsigned char request, count, i;
+	unsigned char count, i;
 	void __rcu **slots;
 	void __rcu **new_slots;
 	unsigned long val;
@@ -6249,20 +6249,17 @@ static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
 	/* Allocate memory for child nodes. */
 	type = mte_node_type(mas->node);
 	new_slots = ma_slots(new_node, type);
-	request = mas_data_end(mas) + 1;
-	count = mt_alloc_bulk(gfp, request, (void **)new_slots);
-	if (unlikely(count < request)) {
-		memset(new_slots, 0, request * sizeof(void *));
-		mas_set_err(mas, -ENOMEM);
+	count = mas->node_request = mas_data_end(mas) + 1;
+	mas_alloc_nodes(mas, gfp);
+	if (unlikely(mas_is_err(mas)))
 		return;
-	}
 
-	/* Restore node type information in slots. */
 	slots = ma_slots(node, type);
 	for (i = 0; i < count; i++) {
 		val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
 		val &= MAPLE_NODE_MASK;
-		((unsigned long *)new_slots)[i] |= val;
+		new_slots[i] = ma_mnode_ptr((unsigned long)mas_pop_node(mas) |
+					    val);
 	}
 }
 
@@ -6316,7 +6313,7 @@ static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
 			/* Only allocate child nodes for non-leaf nodes. */
 			mas_dup_alloc(mas, new_mas, gfp);
 			if (unlikely(mas_is_err(mas)))
-				return;
+				goto empty_mas;
 		} else {
 			/*
 			 * This is the last leaf node and duplication is
@@ -6349,6 +6346,8 @@ static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
 	/* Make them the same height */
 	new_mas->tree->ma_flags = mas->tree->ma_flags;
 	rcu_assign_pointer(new_mas->tree->ma_root, root);
+empty_mas:
+	mas_empty_nodes(mas);
 }
 
 /**

-- 
2.51.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-10  8:01 ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-09-12  0:38   ` Sergey Senozhatsky
  2025-09-12  7:03     ` Vlastimil Babka
  2025-09-17  8:30   ` Harry Yoo
  2025-10-31 21:32   ` Daniel Gomez
  2 siblings, 1 reply; 95+ messages in thread
From: Sergey Senozhatsky @ 2025-09-12  0:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree

Hi Vlastimil,

On (25/09/10 10:01), Vlastimil Babka wrote:
[..]
> +
> +	if (rcu_free)
> +		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> +}
> +
> +
> +/* needed for kvfree_rcu_barrier() */
> +void flush_all_rcu_sheaves()
> +{

mm/slub.c:3960:27: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
 3960 | void flush_all_rcu_sheaves()
      |                           ^
      |                            void

---

diff --git a/mm/slub.c b/mm/slub.c
index 11ad4173e2f2..a1eae71a0f8c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3955,9 +3955,8 @@ static void flush_rcu_sheaf(struct work_struct *w)
 		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
 }
 
-
 /* needed for kvfree_rcu_barrier() */
-void flush_all_rcu_sheaves()
+void flush_all_rcu_sheaves(void)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slub_flush_work *sfw;


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache
  2025-09-10  8:01 ` [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
@ 2025-09-12  2:20   ` Liam R. Howlett
  2025-10-16 15:16   ` D, Suneeth
  1 sibling, 0 replies; 95+ messages in thread
From: Liam R. Howlett @ 2025-09-12  2:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

* Vlastimil Babka <vbabka@suse.cz> [250910 04:02]:
> Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
> improve its performance. Note this will not immediately take advantage
> of sheaf batching of kfree_rcu() operations due to the maple tree using
> call_rcu with custom callbacks. The followup changes to maple tree will
> change that and also make use of the prefilled sheaves functionality.
> 
> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  lib/maple_tree.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 4f0e30b57b0cef9e5cf791f3f64f5898752db402..d034f170ac897341b40cfd050b6aee86b6d2cf60 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -6040,9 +6040,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>  
>  void __init maple_tree_init(void)
>  {
> +	struct kmem_cache_args args = {
> +		.align  = sizeof(struct maple_node),
> +		.sheaf_capacity = 32,
> +	};
> +
>  	maple_node_cache = kmem_cache_create("maple_node",
> -			sizeof(struct maple_node), sizeof(struct maple_node),
> -			SLAB_PANIC, NULL);
> +			sizeof(struct maple_node), &args,
> +			SLAB_PANIC);
>  }
>  
>  /**
> 
> -- 
> 2.51.0
> 
> 
> -- 
> maple-tree mailing list
> maple-tree@lists.infradead.org
> https://lists.infradead.org/mailman/listinfo/maple-tree


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-12  0:38   ` Sergey Senozhatsky
@ 2025-09-12  7:03     ` Vlastimil Babka
  0 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-12  7:03 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree

On 9/12/25 02:38, Sergey Senozhatsky wrote:
> Hi Vlastimil,
> 
> On (25/09/10 10:01), Vlastimil Babka wrote:
> [..]
>> +
>> +	if (rcu_free)
>> +		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
>> +}
>> +
>> +
>> +/* needed for kvfree_rcu_barrier() */
>> +void flush_all_rcu_sheaves()
>> +{
> 
> mm/slub.c:3960:27: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
>  3960 | void flush_all_rcu_sheaves()
>       |                           ^
>       |                            void
> 
> ---

Thanks, the bots told me too and it's fixed in -next

> diff --git a/mm/slub.c b/mm/slub.c
> index 11ad4173e2f2..a1eae71a0f8c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3955,9 +3955,8 @@ static void flush_rcu_sheaf(struct work_struct *w)
>  		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
>  }
>  
> -
>  /* needed for kvfree_rcu_barrier() */
> -void flush_all_rcu_sheaves()
> +void flush_all_rcu_sheaves(void)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	struct slub_flush_work *sfw;



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-10  8:01 ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
  2025-09-12  0:38   ` Sergey Senozhatsky
@ 2025-09-17  8:30   ` Harry Yoo
  2025-09-17  9:55     ` Vlastimil Babka
  2025-10-31 21:32   ` Daniel Gomez
  2 siblings, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-09-17  8:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
> 
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
> 
> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> contexts where kfree_rcu() is called might not be compatible with taking
> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> spinlock - the current kfree_rcu() implementation avoids doing that.
> 
> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> that have them. This is not a cheap operation, but the barrier usage is
> rare - currently kmem_cache_destroy() or on module unload.
> 
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab.h        |   3 +
>  mm/slab_common.c |  26 ++++++
>  mm/slub.c        | 266 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 293 insertions(+), 2 deletions(-)
> 
> @@ -3840,6 +3895,80 @@ static void flush_all(struct kmem_cache *s)
>  	cpus_read_unlock();
>  }
>  
> +/* needed for kvfree_rcu_barrier() */
> +void flush_all_rcu_sheaves()
> +{
> +	struct slub_percpu_sheaves *pcs;
> +	struct slub_flush_work *sfw;
> +	struct kmem_cache *s;
> +	bool flushed = false;
> +	unsigned int cpu;
> +
> +	cpus_read_lock();
> +	mutex_lock(&slab_mutex);
> +
> +	list_for_each_entry(s, &slab_caches, list) {
> +		if (!s->cpu_sheaves)
> +			continue;
> +
> +		mutex_lock(&flush_lock);
> +
> +		for_each_online_cpu(cpu) {
> +			sfw = &per_cpu(slub_flush, cpu);
> +			pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +			if (!pcs->rcu_free || !pcs->rcu_free->size) {

Is the compiler allowed to compile this to read pcs->rcu_free twice?
Something like:

flush_all_rcu_sheaves()			__kfree_rcu_sheaf()

pcs->rcu_free != NULL
					pcs->rcu_free = NULL
pcs->rcu_free == NULL
/* NULL-pointer-deref */
pcs->rcu_free->size

> +				sfw->skip = true;
> +				continue;
> +			}
>
> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> +			sfw->skip = false;
> +			sfw->s = s;
> +			queue_work_on(cpu, flushwq, &sfw->work);
> +			flushed = true;
> +		}
> +
> +		for_each_online_cpu(cpu) {
> +			sfw = &per_cpu(slub_flush, cpu);
> +			if (sfw->skip)
> +				continue;
> +			flush_work(&sfw->work);
> +		}
> +
> +		mutex_unlock(&flush_lock);
> +	}
> +
> +	mutex_unlock(&slab_mutex);
> +	cpus_read_unlock();
> +
> +	if (flushed)
> +		rcu_barrier();

I think we need to call rcu_barrier() even if flushed == false?

Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
be processed before flush_all_rcu_sheaves() is called, and
in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
so flushed == false but the rcu callback isn't processed yet
by the end of the function?

That sounds like a very unlikely to happen in a realistic scenario,
but still possible...

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17  8:30   ` Harry Yoo
@ 2025-09-17  9:55     ` Vlastimil Babka
  2025-09-17 11:32       ` Harry Yoo
  2025-09-17 11:36       ` Paul E. McKenney
  0 siblings, 2 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-17  9:55 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree

On 9/17/25 10:30, Harry Yoo wrote:
> On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
>> +/* needed for kvfree_rcu_barrier() */
>> +void flush_all_rcu_sheaves()
>> +{
>> +	struct slub_percpu_sheaves *pcs;
>> +	struct slub_flush_work *sfw;
>> +	struct kmem_cache *s;
>> +	bool flushed = false;
>> +	unsigned int cpu;
>> +
>> +	cpus_read_lock();
>> +	mutex_lock(&slab_mutex);
>> +
>> +	list_for_each_entry(s, &slab_caches, list) {
>> +		if (!s->cpu_sheaves)
>> +			continue;
>> +
>> +		mutex_lock(&flush_lock);
>> +
>> +		for_each_online_cpu(cpu) {
>> +			sfw = &per_cpu(slub_flush, cpu);
>> +			pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> +
>> +			if (!pcs->rcu_free || !pcs->rcu_free->size) {
> 
> Is the compiler allowed to compile this to read pcs->rcu_free twice?
> Something like:
> 
> flush_all_rcu_sheaves()			__kfree_rcu_sheaf()
> 
> pcs->rcu_free != NULL
> 					pcs->rcu_free = NULL
> pcs->rcu_free == NULL
> /* NULL-pointer-deref */
> pcs->rcu_free->size

Good point, I'll remove the size check and simply pcs->rcu_free non-null
means we flush.

>> +				sfw->skip = true;
>> +				continue;
>> +			}
>>
>> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
>> +			sfw->skip = false;
>> +			sfw->s = s;
>> +			queue_work_on(cpu, flushwq, &sfw->work);
>> +			flushed = true;
>> +		}
>> +
>> +		for_each_online_cpu(cpu) {
>> +			sfw = &per_cpu(slub_flush, cpu);
>> +			if (sfw->skip)
>> +				continue;
>> +			flush_work(&sfw->work);
>> +		}
>> +
>> +		mutex_unlock(&flush_lock);
>> +	}
>> +
>> +	mutex_unlock(&slab_mutex);
>> +	cpus_read_unlock();
>> +
>> +	if (flushed)
>> +		rcu_barrier();
> 
> I think we need to call rcu_barrier() even if flushed == false?
> 
> Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> be processed before flush_all_rcu_sheaves() is called, and
> in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> so flushed == false but the rcu callback isn't processed yet
> by the end of the function?
> 
> That sounds like a very unlikely to happen in a realistic scenario,
> but still possible...

Yes also good point, will flush unconditionally.

Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
local_unlock(). So we don't end up seeing a NULL pcs->rcu_free in
flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
but didn't yet do the call_rcu() as it got preempted after local_unlock().

But then rcu_barrier() itself probably won't mean we make sure such cpus
finished the local_locked section, if we didn't queue work on them. So maybe
we need synchronize_rcu()?



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17  9:55     ` Vlastimil Babka
@ 2025-09-17 11:32       ` Harry Yoo
  2025-09-17 12:05         ` Vlastimil Babka
  2025-09-17 11:36       ` Paul E. McKenney
  1 sibling, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-09-17 11:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> On 9/17/25 10:30, Harry Yoo wrote:
> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> >> +				sfw->skip = true;
> >> +				continue;
> >> +			}
> >>
> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> >> +			sfw->skip = false;
> >> +			sfw->s = s;
> >> +			queue_work_on(cpu, flushwq, &sfw->work);
> >> +			flushed = true;
> >> +		}
> >> +
> >> +		for_each_online_cpu(cpu) {
> >> +			sfw = &per_cpu(slub_flush, cpu);
> >> +			if (sfw->skip)
> >> +				continue;
> >> +			flush_work(&sfw->work);
> >> +		}
> >> +
> >> +		mutex_unlock(&flush_lock);
> >> +	}
> >> +
> >> +	mutex_unlock(&slab_mutex);
> >> +	cpus_read_unlock();
> >> +
> >> +	if (flushed)
> >> +		rcu_barrier();
> > 
> > I think we need to call rcu_barrier() even if flushed == false?
> > 
> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> > be processed before flush_all_rcu_sheaves() is called, and
> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> > so flushed == false but the rcu callback isn't processed yet
> > by the end of the function?
> > 
> > That sounds like a very unlikely to happen in a realistic scenario,
> > but still possible...
> 
> Yes also good point, will flush unconditionally.
> 
> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> local_unlock(). So we don't end up seeing a NULL pcs->rcu_free in
> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> but didn't yet do the call_rcu() as it got preempted after local_unlock().

Makes sense to me.

> But then rcu_barrier() itself probably won't mean we make sure such cpus
> finished the local_locked section, if we didn't queue work on them. So maybe
> we need synchronize_rcu()?

Ah, it works because preemption disabled section works as a RCU
read-side critical section?

But then are we allowed to do release the local_lock to allocate empty
sheaves in __kfree_rcu_sheaf()?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17  9:55     ` Vlastimil Babka
  2025-09-17 11:32       ` Harry Yoo
@ 2025-09-17 11:36       ` Paul E. McKenney
  2025-09-17 12:13         ` Vlastimil Babka
  1 sibling, 1 reply; 95+ messages in thread
From: Paul E. McKenney @ 2025-09-17 11:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree

On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> On 9/17/25 10:30, Harry Yoo wrote:
> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> >> +/* needed for kvfree_rcu_barrier() */
> >> +void flush_all_rcu_sheaves()
> >> +{
> >> +	struct slub_percpu_sheaves *pcs;
> >> +	struct slub_flush_work *sfw;
> >> +	struct kmem_cache *s;
> >> +	bool flushed = false;
> >> +	unsigned int cpu;
> >> +
> >> +	cpus_read_lock();
> >> +	mutex_lock(&slab_mutex);
> >> +
> >> +	list_for_each_entry(s, &slab_caches, list) {
> >> +		if (!s->cpu_sheaves)
> >> +			continue;
> >> +
> >> +		mutex_lock(&flush_lock);
> >> +
> >> +		for_each_online_cpu(cpu) {
> >> +			sfw = &per_cpu(slub_flush, cpu);
> >> +			pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> >> +
> >> +			if (!pcs->rcu_free || !pcs->rcu_free->size) {
> > 
> > Is the compiler allowed to compile this to read pcs->rcu_free twice?
> > Something like:
> > 
> > flush_all_rcu_sheaves()			__kfree_rcu_sheaf()
> > 
> > pcs->rcu_free != NULL
> > 					pcs->rcu_free = NULL
> > pcs->rcu_free == NULL
> > /* NULL-pointer-deref */
> > pcs->rcu_free->size
> 
> Good point, I'll remove the size check and simply pcs->rcu_free non-null
> means we flush.
> 
> >> +				sfw->skip = true;
> >> +				continue;
> >> +			}
> >>
> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> >> +			sfw->skip = false;
> >> +			sfw->s = s;
> >> +			queue_work_on(cpu, flushwq, &sfw->work);
> >> +			flushed = true;
> >> +		}
> >> +
> >> +		for_each_online_cpu(cpu) {
> >> +			sfw = &per_cpu(slub_flush, cpu);
> >> +			if (sfw->skip)
> >> +				continue;
> >> +			flush_work(&sfw->work);
> >> +		}
> >> +
> >> +		mutex_unlock(&flush_lock);
> >> +	}
> >> +
> >> +	mutex_unlock(&slab_mutex);
> >> +	cpus_read_unlock();
> >> +
> >> +	if (flushed)
> >> +		rcu_barrier();
> > 
> > I think we need to call rcu_barrier() even if flushed == false?
> > 
> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> > be processed before flush_all_rcu_sheaves() is called, and
> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> > so flushed == false but the rcu callback isn't processed yet
> > by the end of the function?
> > 
> > That sounds like a very unlikely to happen in a realistic scenario,
> > but still possible...
> 
> Yes also good point, will flush unconditionally.
> 
> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> local_unlock(). So we don't end up seeing a NULL pcs->rcu_free in
> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> 
> But then rcu_barrier() itself probably won't mean we make sure such cpus
> finished the local_locked section, if we didn't queue work on them. So maybe
> we need synchronize_rcu()?

Do you need both rcu_barrier() and synchronize_rcu(), maybe along with
kvfree_rcu_barrier() as well?  It would not be hard to make such a thing,
using workqueues or some such.  Not sure what the API should look like,
especially should people want other RCU flavors to get into the act
as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 18/23] maple_tree: Use kfree_rcu in ma_free_rcu
  2025-09-10  8:01 ` [PATCH v8 18/23] maple_tree: Use kfree_rcu in ma_free_rcu Vlastimil Babka
@ 2025-09-17 11:46   ` Harry Yoo
  2025-09-27  0:05     ` Suren Baghdasaryan
  0 siblings, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-09-17 11:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Pedro Falcato

On Wed, Sep 10, 2025 at 10:01:20AM +0200, Vlastimil Babka wrote:
> From: Pedro Falcato <pfalcato@suse.de>
> 
> kfree_rcu is an optimized version of call_rcu + kfree. It used to not be
> possible to call it on non-kmalloc objects, but this restriction was
> lifted ever since SLOB was dropped from the kernel, and since commit
> 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()").
> 
> Thus, replace call_rcu + mt_free_rcu with kfree_rcu.
> 
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17 11:32       ` Harry Yoo
@ 2025-09-17 12:05         ` Vlastimil Babka
  2025-09-17 13:07           ` Harry Yoo
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-17 12:05 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On 9/17/25 13:32, Harry Yoo wrote:
> On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
>> On 9/17/25 10:30, Harry Yoo wrote:
>> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
>> >> +				sfw->skip = true;
>> >> +				continue;
>> >> +			}
>> >>
>> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
>> >> +			sfw->skip = false;
>> >> +			sfw->s = s;
>> >> +			queue_work_on(cpu, flushwq, &sfw->work);
>> >> +			flushed = true;
>> >> +		}
>> >> +
>> >> +		for_each_online_cpu(cpu) {
>> >> +			sfw = &per_cpu(slub_flush, cpu);
>> >> +			if (sfw->skip)
>> >> +				continue;
>> >> +			flush_work(&sfw->work);
>> >> +		}
>> >> +
>> >> +		mutex_unlock(&flush_lock);
>> >> +	}
>> >> +
>> >> +	mutex_unlock(&slab_mutex);
>> >> +	cpus_read_unlock();
>> >> +
>> >> +	if (flushed)
>> >> +		rcu_barrier();
>> > 
>> > I think we need to call rcu_barrier() even if flushed == false?
>> > 
>> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
>> > be processed before flush_all_rcu_sheaves() is called, and
>> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
>> > so flushed == false but the rcu callback isn't processed yet
>> > by the end of the function?
>> > 
>> > That sounds like a very unlikely to happen in a realistic scenario,
>> > but still possible...
>> 
>> Yes also good point, will flush unconditionally.
>> 
>> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
>> local_unlock(). So we don't end up seeing a NULL pcs->rcu_free in
>> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
>> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> 
> Makes sense to me.
> 
>> But then rcu_barrier() itself probably won't mean we make sure such cpus
>> finished the local_locked section, if we didn't queue work on them. So maybe
>> we need synchronize_rcu()?
> 
> Ah, it works because preemption disabled section works as a RCU
> read-side critical section?

AFAIK yes? Or maybe not on RT where local_lock is taking a mutex? So we
should denote the RCU critical section explicitly too?

> But then are we allowed to do release the local_lock to allocate empty
> sheaves in __kfree_rcu_sheaf()?

I think so, as we do that when we found no rcu_sheaf in the first place.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17 11:36       ` Paul E. McKenney
@ 2025-09-17 12:13         ` Vlastimil Babka
  0 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-17 12:13 UTC (permalink / raw)
  To: paulmck
  Cc: Harry Yoo, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree

On 9/17/25 13:36, Paul E. McKenney wrote:
> On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
>> On 9/17/25 10:30, Harry Yoo wrote:
>> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
>> >> +/* needed for kvfree_rcu_barrier() */
>> >> +void flush_all_rcu_sheaves()
>> >> +{
>> >> +	struct slub_percpu_sheaves *pcs;
>> >> +	struct slub_flush_work *sfw;
>> >> +	struct kmem_cache *s;
>> >> +	bool flushed = false;
>> >> +	unsigned int cpu;
>> >> +
>> >> +	cpus_read_lock();
>> >> +	mutex_lock(&slab_mutex);
>> >> +
>> >> +	list_for_each_entry(s, &slab_caches, list) {
>> >> +		if (!s->cpu_sheaves)
>> >> +			continue;
>> >> +
>> >> +		mutex_lock(&flush_lock);
>> >> +
>> >> +		for_each_online_cpu(cpu) {
>> >> +			sfw = &per_cpu(slub_flush, cpu);
>> >> +			pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>> >> +
>> >> +			if (!pcs->rcu_free || !pcs->rcu_free->size) {
>> > 
>> > Is the compiler allowed to compile this to read pcs->rcu_free twice?
>> > Something like:
>> > 
>> > flush_all_rcu_sheaves()			__kfree_rcu_sheaf()
>> > 
>> > pcs->rcu_free != NULL
>> > 					pcs->rcu_free = NULL
>> > pcs->rcu_free == NULL
>> > /* NULL-pointer-deref */
>> > pcs->rcu_free->size
>> 
>> Good point, I'll remove the size check and simply pcs->rcu_free non-null
>> means we flush.
>> 
>> >> +				sfw->skip = true;
>> >> +				continue;
>> >> +			}
>> >>
>> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
>> >> +			sfw->skip = false;
>> >> +			sfw->s = s;
>> >> +			queue_work_on(cpu, flushwq, &sfw->work);
>> >> +			flushed = true;
>> >> +		}
>> >> +
>> >> +		for_each_online_cpu(cpu) {
>> >> +			sfw = &per_cpu(slub_flush, cpu);
>> >> +			if (sfw->skip)
>> >> +				continue;
>> >> +			flush_work(&sfw->work);
>> >> +		}
>> >> +
>> >> +		mutex_unlock(&flush_lock);
>> >> +	}
>> >> +
>> >> +	mutex_unlock(&slab_mutex);
>> >> +	cpus_read_unlock();
>> >> +
>> >> +	if (flushed)
>> >> +		rcu_barrier();
>> > 
>> > I think we need to call rcu_barrier() even if flushed == false?
>> > 
>> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
>> > be processed before flush_all_rcu_sheaves() is called, and
>> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
>> > so flushed == false but the rcu callback isn't processed yet
>> > by the end of the function?
>> > 
>> > That sounds like a very unlikely to happen in a realistic scenario,
>> > but still possible...
>> 
>> Yes also good point, will flush unconditionally.
>> 
>> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
>> local_unlock(). So we don't end up seeing a NULL pcs->rcu_free in
>> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
>> but didn't yet do the call_rcu() as it got preempted after local_unlock().
>> 
>> But then rcu_barrier() itself probably won't mean we make sure such cpus
>> finished the local_locked section, if we didn't queue work on them. So maybe
>> we need synchronize_rcu()?
> 
> Do you need both rcu_barrier() and synchronize_rcu(), maybe along with

We need the local_lock protected sections of __kfree_rcu_sheaf() to be
finished (which might be doing call_rcu(rcu_sheaf) And then the pending
call_rcu(rcu_sheaf) callbacks to be executed.
I think that means both synchronize_rcu() and rcu_barrier(). Possibly a RCU
critical section in __kfree_rcu_sheaf() unless local_lock implies that even
on RT.

> kvfree_rcu_barrier() as well?

This (flush_all_rcu_sheaves()) is for the implementation of
kvfree_rcu_barrier() in a world with rcu_free sheaves. So it's called from
kvfree_rcu_barrier().

> It would not be hard to make such a thing,
> using workqueues or some such.  Not sure what the API should look like,

So there should be no one else calling such an API. There might be new users
of kvfree_rcu_barrier() doing this indirectly in the future.

> especially should people want other RCU flavors to get into the act
> as well.
> 
> 							Thanx, Paul



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17 12:05         ` Vlastimil Babka
@ 2025-09-17 13:07           ` Harry Yoo
  2025-09-17 13:21             ` Vlastimil Babka
  0 siblings, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-09-17 13:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
> On 9/17/25 13:32, Harry Yoo wrote:
> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> >> On 9/17/25 10:30, Harry Yoo wrote:
> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> >> >> +				sfw->skip = true;
> >> >> +				continue;
> >> >> +			}
> >> >>
> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> >> >> +			sfw->skip = false;
> >> >> +			sfw->s = s;
> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
> >> >> +			flushed = true;
> >> >> +		}
> >> >> +
> >> >> +		for_each_online_cpu(cpu) {
> >> >> +			sfw = &per_cpu(slub_flush, cpu);
> >> >> +			if (sfw->skip)
> >> >> +				continue;
> >> >> +			flush_work(&sfw->work);
> >> >> +		}
> >> >> +
> >> >> +		mutex_unlock(&flush_lock);
> >> >> +	}
> >> >> +
> >> >> +	mutex_unlock(&slab_mutex);
> >> >> +	cpus_read_unlock();
> >> >> +
> >> >> +	if (flushed)
> >> >> +		rcu_barrier();
> >> > 
> >> > I think we need to call rcu_barrier() even if flushed == false?
> >> > 
> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> >> > be processed before flush_all_rcu_sheaves() is called, and
> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> >> > so flushed == false but the rcu callback isn't processed yet
> >> > by the end of the function?
> >> > 
> >> > That sounds like a very unlikely to happen in a realistic scenario,
> >> > but still possible...
> >> 
> >> Yes also good point, will flush unconditionally.
> >> 
> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> >> local_unlock().
> >>
> >> So we don't end up seeing a NULL pcs->rcu_free in
> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> > 
> > Makes sense to me.

Wait, I'm confused.

I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
the object X to be freed before kvfree_rcu_barrier() returns?

IOW if flush_all_rcu_sheaves() is called while __kfree_rcu_sheaf(s, X) was
running on another CPU, we don't have to guarantee that
flush_all_rcu_sheaves() returns after the object X is freed?

> >> But then rcu_barrier() itself probably won't mean we make sure such cpus
> >> finished the local_locked section, if we didn't queue work on them. So maybe
> >> we need synchronize_rcu()?

So... we don't need a synchronize_rcu() then?

Or my brain started malfunctioning again :D

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17 13:07           ` Harry Yoo
@ 2025-09-17 13:21             ` Vlastimil Babka
  2025-09-17 13:34               ` Harry Yoo
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-17 13:21 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On 9/17/25 15:07, Harry Yoo wrote:
> On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
>> On 9/17/25 13:32, Harry Yoo wrote:
>> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
>> >> On 9/17/25 10:30, Harry Yoo wrote:
>> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
>> >> >> +				sfw->skip = true;
>> >> >> +				continue;
>> >> >> +			}
>> >> >>
>> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
>> >> >> +			sfw->skip = false;
>> >> >> +			sfw->s = s;
>> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
>> >> >> +			flushed = true;
>> >> >> +		}
>> >> >> +
>> >> >> +		for_each_online_cpu(cpu) {
>> >> >> +			sfw = &per_cpu(slub_flush, cpu);
>> >> >> +			if (sfw->skip)
>> >> >> +				continue;
>> >> >> +			flush_work(&sfw->work);
>> >> >> +		}
>> >> >> +
>> >> >> +		mutex_unlock(&flush_lock);
>> >> >> +	}
>> >> >> +
>> >> >> +	mutex_unlock(&slab_mutex);
>> >> >> +	cpus_read_unlock();
>> >> >> +
>> >> >> +	if (flushed)
>> >> >> +		rcu_barrier();
>> >> > 
>> >> > I think we need to call rcu_barrier() even if flushed == false?
>> >> > 
>> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
>> >> > be processed before flush_all_rcu_sheaves() is called, and
>> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
>> >> > so flushed == false but the rcu callback isn't processed yet
>> >> > by the end of the function?
>> >> > 
>> >> > That sounds like a very unlikely to happen in a realistic scenario,
>> >> > but still possible...
>> >> 
>> >> Yes also good point, will flush unconditionally.
>> >> 
>> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
>> >> local_unlock().
>> >>
>> >> So we don't end up seeing a NULL pcs->rcu_free in
>> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
>> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
>> > 
>> > Makes sense to me.
> 
> Wait, I'm confused.
> 
> I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
> only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
> the object X to be freed before kvfree_rcu_barrier() returns?

Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
the rcu_sheaf fully and thus without submitting it to call_rcu(), then
migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
up the rcu_sheaf fully and is about to call_rcu() on it. And since that
sheaf also contains the object X, we should make sure that is flushed.

> IOW if flush_all_rcu_sheaves() is called while __kfree_rcu_sheaf(s, X) was
> running on another CPU, we don't have to guarantee that
> flush_all_rcu_sheaves() returns after the object X is freed?
> 
>> >> But then rcu_barrier() itself probably won't mean we make sure such cpus
>> >> finished the local_locked section, if we didn't queue work on them. So maybe
>> >> we need synchronize_rcu()?
> 
> So... we don't need a synchronize_rcu() then?
> 
> Or my brain started malfunctioning again :D
> 



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17 13:21             ` Vlastimil Babka
@ 2025-09-17 13:34               ` Harry Yoo
  2025-09-17 14:14                 ` Vlastimil Babka
  0 siblings, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-09-17 13:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
> On 9/17/25 15:07, Harry Yoo wrote:
> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
> >> On 9/17/25 13:32, Harry Yoo wrote:
> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> >> >> On 9/17/25 10:30, Harry Yoo wrote:
> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> >> >> >> +				sfw->skip = true;
> >> >> >> +				continue;
> >> >> >> +			}
> >> >> >>
> >> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> >> >> >> +			sfw->skip = false;
> >> >> >> +			sfw->s = s;
> >> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
> >> >> >> +			flushed = true;
> >> >> >> +		}
> >> >> >> +
> >> >> >> +		for_each_online_cpu(cpu) {
> >> >> >> +			sfw = &per_cpu(slub_flush, cpu);
> >> >> >> +			if (sfw->skip)
> >> >> >> +				continue;
> >> >> >> +			flush_work(&sfw->work);
> >> >> >> +		}
> >> >> >> +
> >> >> >> +		mutex_unlock(&flush_lock);
> >> >> >> +	}
> >> >> >> +
> >> >> >> +	mutex_unlock(&slab_mutex);
> >> >> >> +	cpus_read_unlock();
> >> >> >> +
> >> >> >> +	if (flushed)
> >> >> >> +		rcu_barrier();
> >> >> > 
> >> >> > I think we need to call rcu_barrier() even if flushed == false?
> >> >> > 
> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> >> >> > be processed before flush_all_rcu_sheaves() is called, and
> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> >> >> > so flushed == false but the rcu callback isn't processed yet
> >> >> > by the end of the function?
> >> >> > 
> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
> >> >> > but still possible...
> >> >> 
> >> >> Yes also good point, will flush unconditionally.
> >> >> 
> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> >> >> local_unlock().
> >> >>
> >> >> So we don't end up seeing a NULL pcs->rcu_free in
> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> >> > 
> >> > Makes sense to me.
> > 
> > Wait, I'm confused.
> > 
> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
> > the object X to be freed before kvfree_rcu_barrier() returns?
> 
> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
> sheaf also contains the object X, we should make sure that is flushed.

I was going to say "but we queue and wait for the flushing work to
complete, so the sheaf containing object X should be flushed?"

But nah, that's true only if we see pcs->rcu_free != NULL in
flush_all_rcu_sheaves().

You are right...

Hmm, maybe it's simpler to fix this by never skipping queueing the work
even when pcs->rcu_sheaf == NULL?

> > IOW if flush_all_rcu_sheaves() is called while __kfree_rcu_sheaf(s, X) was
> > running on another CPU, we don't have to guarantee that
> > flush_all_rcu_sheaves() returns after the object X is freed?
> > 
> >> >> But then rcu_barrier() itself probably won't mean we make sure such cpus
> >> >> finished the local_locked section, if we didn't queue work on them. So maybe
> >> >> we need synchronize_rcu()?
> > 
> > So... we don't need a synchronize_rcu() then?
> > 
> > Or my brain started malfunctioning again :D

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17 13:34               ` Harry Yoo
@ 2025-09-17 14:14                 ` Vlastimil Babka
  2025-09-18  8:09                   ` Vlastimil Babka
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-17 14:14 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On 9/17/25 15:34, Harry Yoo wrote:
> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
>> On 9/17/25 15:07, Harry Yoo wrote:
>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
>> >> On 9/17/25 13:32, Harry Yoo wrote:
>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
>> >> >> >> +				sfw->skip = true;
>> >> >> >> +				continue;
>> >> >> >> +			}
>> >> >> >>
>> >> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
>> >> >> >> +			sfw->skip = false;
>> >> >> >> +			sfw->s = s;
>> >> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
>> >> >> >> +			flushed = true;
>> >> >> >> +		}
>> >> >> >> +
>> >> >> >> +		for_each_online_cpu(cpu) {
>> >> >> >> +			sfw = &per_cpu(slub_flush, cpu);
>> >> >> >> +			if (sfw->skip)
>> >> >> >> +				continue;
>> >> >> >> +			flush_work(&sfw->work);
>> >> >> >> +		}
>> >> >> >> +
>> >> >> >> +		mutex_unlock(&flush_lock);
>> >> >> >> +	}
>> >> >> >> +
>> >> >> >> +	mutex_unlock(&slab_mutex);
>> >> >> >> +	cpus_read_unlock();
>> >> >> >> +
>> >> >> >> +	if (flushed)
>> >> >> >> +		rcu_barrier();
>> >> >> > 
>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
>> >> >> > 
>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
>> >> >> > so flushed == false but the rcu callback isn't processed yet
>> >> >> > by the end of the function?
>> >> >> > 
>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
>> >> >> > but still possible...
>> >> >> 
>> >> >> Yes also good point, will flush unconditionally.
>> >> >> 
>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
>> >> >> local_unlock().
>> >> >>
>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
>> >> > 
>> >> > Makes sense to me.
>> > 
>> > Wait, I'm confused.
>> > 
>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
>> > the object X to be freed before kvfree_rcu_barrier() returns?
>> 
>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
>> sheaf also contains the object X, we should make sure that is flushed.
> 
> I was going to say "but we queue and wait for the flushing work to
> complete, so the sheaf containing object X should be flushed?"
> 
> But nah, that's true only if we see pcs->rcu_free != NULL in
> flush_all_rcu_sheaves().
> 
> You are right...
> 
> Hmm, maybe it's simpler to fix this by never skipping queueing the work
> even when pcs->rcu_sheaf == NULL?

I guess it's simpler, yeah.
We might have to think of something better once all caches have sheaves,
queueing and waiting for work to finish on each cpu, repeated for each
kmem_cache, might be just too much?

>> > IOW if flush_all_rcu_sheaves() is called while __kfree_rcu_sheaf(s, X) was
>> > running on another CPU, we don't have to guarantee that
>> > flush_all_rcu_sheaves() returns after the object X is freed?
>> > 
>> >> >> But then rcu_barrier() itself probably won't mean we make sure such cpus
>> >> >> finished the local_locked section, if we didn't queue work on them. So maybe
>> >> >> we need synchronize_rcu()?
>> > 
>> > So... we don't need a synchronize_rcu() then?
>> > 
>> > Or my brain started malfunctioning again :D
> 



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-17 14:14                 ` Vlastimil Babka
@ 2025-09-18  8:09                   ` Vlastimil Babka
  2025-09-19  6:47                     ` Harry Yoo
  2025-09-25  4:35                     ` Suren Baghdasaryan
  0 siblings, 2 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-18  8:09 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On 9/17/25 16:14, Vlastimil Babka wrote:
> On 9/17/25 15:34, Harry Yoo wrote:
>> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
>>> On 9/17/25 15:07, Harry Yoo wrote:
>>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
>>> >> On 9/17/25 13:32, Harry Yoo wrote:
>>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
>>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
>>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
>>> >> >> >> +				sfw->skip = true;
>>> >> >> >> +				continue;
>>> >> >> >> +			}
>>> >> >> >>
>>> >> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
>>> >> >> >> +			sfw->skip = false;
>>> >> >> >> +			sfw->s = s;
>>> >> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
>>> >> >> >> +			flushed = true;
>>> >> >> >> +		}
>>> >> >> >> +
>>> >> >> >> +		for_each_online_cpu(cpu) {
>>> >> >> >> +			sfw = &per_cpu(slub_flush, cpu);
>>> >> >> >> +			if (sfw->skip)
>>> >> >> >> +				continue;
>>> >> >> >> +			flush_work(&sfw->work);
>>> >> >> >> +		}
>>> >> >> >> +
>>> >> >> >> +		mutex_unlock(&flush_lock);
>>> >> >> >> +	}
>>> >> >> >> +
>>> >> >> >> +	mutex_unlock(&slab_mutex);
>>> >> >> >> +	cpus_read_unlock();
>>> >> >> >> +
>>> >> >> >> +	if (flushed)
>>> >> >> >> +		rcu_barrier();
>>> >> >> > 
>>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
>>> >> >> > 
>>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
>>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
>>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
>>> >> >> > so flushed == false but the rcu callback isn't processed yet
>>> >> >> > by the end of the function?
>>> >> >> > 
>>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
>>> >> >> > but still possible...
>>> >> >> 
>>> >> >> Yes also good point, will flush unconditionally.
>>> >> >> 
>>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
>>> >> >> local_unlock().
>>> >> >>
>>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
>>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
>>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
>>> >> > 
>>> >> > Makes sense to me.
>>> > 
>>> > Wait, I'm confused.
>>> > 
>>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
>>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
>>> > the object X to be freed before kvfree_rcu_barrier() returns?
>>> 
>>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
>>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
>>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
>>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
>>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
>>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
>>> sheaf also contains the object X, we should make sure that is flushed.
>> 
>> I was going to say "but we queue and wait for the flushing work to
>> complete, so the sheaf containing object X should be flushed?"
>> 
>> But nah, that's true only if we see pcs->rcu_free != NULL in
>> flush_all_rcu_sheaves().
>> 
>> You are right...
>> 
>> Hmm, maybe it's simpler to fix this by never skipping queueing the work
>> even when pcs->rcu_sheaf == NULL?
> 
> I guess it's simpler, yeah.

So what about this? The unconditional queueing should cover all races with
__kfree_rcu_sheaf() so there's just unconditional rcu_barrier() in the end.

From 0722b29fa1625b31c05d659d1d988ec882247b38 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 3 Sep 2025 14:59:46 +0200
Subject: [PATCH] slab: add sheaf support for batching kfree_rcu() operations

Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.

kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.

It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.

Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
  existing batching
- sheaves can be reused for allocations via barn instead of being
  flushed to slabs, which is more efficient
  - this includes cases where only some cpus are allowed to process rcu
    callbacks (Android)

Possible disadvantage:
- objects might be waiting for more than their grace period (it is
  determined by the last object freed into the sheaf), increasing memory
  usage - but the existing batching does that too.

Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.

Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
contexts where kfree_rcu() is called might not be compatible with taking
a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
spinlock - the current kfree_rcu() implementation avoids doing that.

Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
that have them. This is not a cheap operation, but the barrier usage is
rare - currently kmem_cache_destroy() or on module unload.

Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h        |   3 +
 mm/slab_common.c |  26 +++++
 mm/slub.c        | 267 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 294 insertions(+), 2 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 206987ce44a4..e82e51c44bd0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -435,6 +435,9 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+void flush_all_rcu_sheaves(void);
+
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
 			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e2b197e47866..005a4319c06a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
 		kvfree_rcu_list(head);
 }
 
+static bool kfree_rcu_sheaf(void *obj)
+{
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+
+	if (is_vmalloc_addr(obj))
+		return false;
+
+	folio = virt_to_folio(obj);
+	if (unlikely(!folio_test_slab(folio)))
+		return false;
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+	if (s->cpu_sheaves)
+		return __kfree_rcu_sheaf(s, obj);
+
+	return false;
+}
+
 static bool
 need_offload_krc(struct kfree_rcu_cpu *krcp)
 {
@@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!head)
 		might_sleep();
 
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr))
+		return;
+
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
@@ -2026,6 +2050,8 @@ void kvfree_rcu_barrier(void)
 	bool queued;
 	int i, cpu;
 
+	flush_all_rcu_sheaves();
+
 	/*
 	 * Firstly we detach objects and queue them over an RCU-batch
 	 * for all CPUs. Finally queued works are flushed for each CPU.
diff --git a/mm/slub.c b/mm/slub.c
index cba188b7e04d..171273f90efd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -367,6 +367,8 @@ enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
 	FREE_PCS,		/* Free to percpu sheaf */
+	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
+	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -461,6 +463,7 @@ struct slab_sheaf {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
 	};
+	struct kmem_cache *cache;
 	unsigned int size;
 	void *objects[];
 };
@@ -469,6 +472,7 @@ struct slub_percpu_sheaves {
 	local_trylock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
 };
 
 /*
@@ -2531,6 +2535,8 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
 	if (unlikely(!sheaf))
 		return NULL;
 
+	sheaf->cache = s;
+
 	stat(s, SHEAF_ALLOC);
 
 	return sheaf;
@@ -2655,6 +2661,43 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	sheaf->size = 0;
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf)
+{
+	bool init = slab_want_init_on_free(s);
+	void **p = &sheaf->objects[0];
+	unsigned int i = 0;
+
+	while (i < sheaf->size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, true))) {
+			p[i] = p[--sheaf->size];
+			continue;
+		}
+
+		i++;
+	}
+}
+
+static void rcu_free_sheaf_nobarn(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	sheaf_flush_unused(s, sheaf);
+
+	free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
@@ -2667,7 +2710,7 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 static void pcs_flush_all(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
-	struct slab_sheaf *spare;
+	struct slab_sheaf *spare, *rcu_free;
 
 	local_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -2675,6 +2718,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 	spare = pcs->spare;
 	pcs->spare = NULL;
 
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
 	local_unlock(&s->cpu_sheaves->lock);
 
 	if (spare) {
@@ -2682,6 +2728,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 		free_empty_sheaf(s, spare);
 	}
 
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+
 	sheaf_flush_main(s);
 }
 
@@ -2698,6 +2747,11 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
 		free_empty_sheaf(s, pcs->spare);
 		pcs->spare = NULL;
 	}
+
+	if (pcs->rcu_free) {
+		call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+		pcs->rcu_free = NULL;
+	}
 }
 
 static void pcs_destroy(struct kmem_cache *s)
@@ -2723,6 +2777,7 @@ static void pcs_destroy(struct kmem_cache *s)
 		 */
 
 		WARN_ON(pcs->spare);
+		WARN_ON(pcs->rcu_free);
 
 		if (!WARN_ON(pcs->main->size)) {
 			free_empty_sheaf(s, pcs->main);
@@ -3780,7 +3835,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
 
 	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-	return (pcs->spare || pcs->main->size);
+	return (pcs->spare || pcs->rcu_free || pcs->main->size);
 }
 
 /*
@@ -3840,6 +3895,77 @@ static void flush_all(struct kmem_cache *s)
 	cpus_read_unlock();
 }
 
+static void flush_rcu_sheaf(struct work_struct *w)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_free;
+	struct slub_flush_work *sfw;
+	struct kmem_cache *s;
+
+	sfw = container_of(w, struct slub_flush_work, work);
+	s = sfw->s;
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+}
+
+
+/* needed for kvfree_rcu_barrier() */
+void flush_all_rcu_sheaves(void)
+{
+	struct slub_flush_work *sfw;
+	struct kmem_cache *s;
+	unsigned int cpu;
+
+	cpus_read_lock();
+	mutex_lock(&slab_mutex);
+
+	list_for_each_entry(s, &slab_caches, list) {
+		if (!s->cpu_sheaves)
+			continue;
+
+		mutex_lock(&flush_lock);
+
+		for_each_online_cpu(cpu) {
+			sfw = &per_cpu(slub_flush, cpu);
+
+			/*
+			 * we don't check if rcu_free sheaf exists - racing
+			 * __kfree_rcu_sheaf() might have just removed it.
+			 * by executing flush_rcu_sheaf() on the cpu we make
+			 * sure the __kfree_rcu_sheaf() finished its call_rcu()
+			 */
+
+			INIT_WORK(&sfw->work, flush_rcu_sheaf);
+			sfw->skip = false;
+			sfw->s = s;
+			queue_work_on(cpu, flushwq, &sfw->work);
+		}
+
+		for_each_online_cpu(cpu) {
+			sfw = &per_cpu(slub_flush, cpu);
+			if (sfw->skip)
+				continue;
+			flush_work(&sfw->work);
+		}
+
+		mutex_unlock(&flush_lock);
+	}
+
+	mutex_unlock(&slab_mutex);
+	cpus_read_unlock();
+
+	rcu_barrier();
+}
+
 /*
  * Use the cpu notifier to insure that the cpu slabs are flushed when
  * necessary.
@@ -5413,6 +5539,134 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
 	return true;
 }
 
+static void rcu_free_sheaf(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct node_barn *barn;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+
+	s = sheaf->cache;
+
+	/*
+	 * This may remove some objects due to slab_free_hook() returning false,
+	 * so that the sheaf might no longer be completely full. But it's easier
+	 * to handle it as full (unless it became completely empty), as the code
+	 * handles it fine. The only downside is that sheaf will serve fewer
+	 * allocations when reused. It only happens due to debugging, which is a
+	 * performance hit anyway.
+	 */
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	barn = get_node(s, numa_mem_id())->barn;
+
+	/* due to slab_free_hook() */
+	if (unlikely(sheaf->size == 0))
+		goto empty;
+
+	/*
+	 * Checking nr_full/nr_empty outside lock avoids contention in case the
+	 * barn is at the respective limit. Due to the race we might go over the
+	 * limit but that should be rare and harmless.
+	 */
+
+	if (data_race(barn->nr_full) < MAX_FULL_SHEAVES) {
+		stat(s, BARN_PUT);
+		barn_put_full_sheaf(barn, sheaf);
+		return;
+	}
+
+	stat(s, BARN_PUT_FAIL);
+	sheaf_flush_unused(s, sheaf);
+
+empty:
+	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
+		barn_put_empty_sheaf(barn, sheaf);
+		return;
+	}
+
+	free_empty_sheaf(s, sheaf);
+}
+
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_sheaf;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto fail;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(!pcs->rcu_free)) {
+
+		struct slab_sheaf *empty;
+		struct node_barn *barn;
+
+		if (pcs->spare && pcs->spare->size == 0) {
+			pcs->rcu_free = pcs->spare;
+			pcs->spare = NULL;
+			goto do_free;
+		}
+
+		barn = get_barn(s);
+
+		empty = barn_get_empty_sheaf(barn);
+
+		if (empty) {
+			pcs->rcu_free = empty;
+			goto do_free;
+		}
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty)
+			goto fail;
+
+		if (!local_trylock(&s->cpu_sheaves->lock)) {
+			barn_put_empty_sheaf(barn, empty);
+			goto fail;
+		}
+
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		if (unlikely(pcs->rcu_free))
+			barn_put_empty_sheaf(barn, empty);
+		else
+			pcs->rcu_free = empty;
+	}
+
+do_free:
+
+	rcu_sheaf = pcs->rcu_free;
+
+	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
+
+	if (likely(rcu_sheaf->size < s->sheaf_capacity))
+		rcu_sheaf = NULL;
+	else
+		pcs->rcu_free = NULL;
+
+	/*
+	 * we flush before local_unlock to make sure a racing
+	 * flush_all_rcu_sheaves() doesn't miss this sheaf
+	 */
+	if (rcu_sheaf)
+		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat(s, FREE_RCU_SHEAF);
+	return true;
+
+fail:
+	stat(s, FREE_RCU_SHEAF_FAIL);
+	return false;
+}
+
 /*
  * Bulk free objects to the percpu sheaves.
  * Unlike free_to_pcs() this includes the calls to all necessary hooks
@@ -6909,6 +7163,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	flush_all_cpus_locked(s);
+
+	/* we might have rcu sheaves in flight */
+	if (s->cpu_sheaves)
+		rcu_barrier();
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
 		if (n->barn)
@@ -8284,6 +8543,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_PCS, free_cpu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -8382,6 +8643,8 @@ static struct attribute *slab_attrs[] = {
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
 	&free_cpu_sheaf_attr.attr,
+	&free_rcu_sheaf_attr.attr,
+	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
-- 
2.51.0




^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-18  8:09                   ` Vlastimil Babka
@ 2025-09-19  6:47                     ` Harry Yoo
  2025-09-19  7:02                       ` Vlastimil Babka
  2025-09-25  4:35                     ` Suren Baghdasaryan
  1 sibling, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-09-19  6:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On Thu, Sep 18, 2025 at 10:09:34AM +0200, Vlastimil Babka wrote:
> On 9/17/25 16:14, Vlastimil Babka wrote:
> > On 9/17/25 15:34, Harry Yoo wrote:
> >> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
> >>> On 9/17/25 15:07, Harry Yoo wrote:
> >>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
> >>> >> On 9/17/25 13:32, Harry Yoo wrote:
> >>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> >>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
> >>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> >>> >> >> >> +				sfw->skip = true;
> >>> >> >> >> +				continue;
> >>> >> >> >> +			}
> >>> >> >> >>
> >>> >> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> >>> >> >> >> +			sfw->skip = false;
> >>> >> >> >> +			sfw->s = s;
> >>> >> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
> >>> >> >> >> +			flushed = true;
> >>> >> >> >> +		}
> >>> >> >> >> +
> >>> >> >> >> +		for_each_online_cpu(cpu) {
> >>> >> >> >> +			sfw = &per_cpu(slub_flush, cpu);
> >>> >> >> >> +			if (sfw->skip)
> >>> >> >> >> +				continue;
> >>> >> >> >> +			flush_work(&sfw->work);
> >>> >> >> >> +		}
> >>> >> >> >> +
> >>> >> >> >> +		mutex_unlock(&flush_lock);
> >>> >> >> >> +	}
> >>> >> >> >> +
> >>> >> >> >> +	mutex_unlock(&slab_mutex);
> >>> >> >> >> +	cpus_read_unlock();
> >>> >> >> >> +
> >>> >> >> >> +	if (flushed)
> >>> >> >> >> +		rcu_barrier();
> >>> >> >> > 
> >>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
> >>> >> >> > 
> >>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> >>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
> >>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> >>> >> >> > so flushed == false but the rcu callback isn't processed yet
> >>> >> >> > by the end of the function?
> >>> >> >> > 
> >>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
> >>> >> >> > but still possible...
> >>> >> >> 
> >>> >> >> Yes also good point, will flush unconditionally.
> >>> >> >> 
> >>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> >>> >> >> local_unlock().
> >>> >> >>
> >>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
> >>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> >>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> >>> >> > 
> >>> >> > Makes sense to me.
> >>> > 
> >>> > Wait, I'm confused.
> >>> > 
> >>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
> >>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
> >>> > the object X to be freed before kvfree_rcu_barrier() returns?
> >>> 
> >>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
> >>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
> >>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
> >>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
> >>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
> >>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
> >>> sheaf also contains the object X, we should make sure that is flushed.
> >> 
> >> I was going to say "but we queue and wait for the flushing work to
> >> complete, so the sheaf containing object X should be flushed?"
> >> 
> >> But nah, that's true only if we see pcs->rcu_free != NULL in
> >> flush_all_rcu_sheaves().
> >> 
> >> You are right...
> >> 
> >> Hmm, maybe it's simpler to fix this by never skipping queueing the work
> >> even when pcs->rcu_sheaf == NULL?
> > 
> > I guess it's simpler, yeah.
> 
> So what about this? The unconditional queueing should cover all races with
> __kfree_rcu_sheaf() so there's just unconditional rcu_barrier() in the end.
> 
> From 0722b29fa1625b31c05d659d1d988ec882247b38 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 3 Sep 2025 14:59:46 +0200
> Subject: [PATCH] slab: add sheaf support for batching kfree_rcu() operations
> 
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
> 
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
> 
> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> contexts where kfree_rcu() is called might not be compatible with taking
> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> spinlock - the current kfree_rcu() implementation avoids doing that.
> 
> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> that have them. This is not a cheap operation, but the barrier usage is
> rare - currently kmem_cache_destroy() or on module unload.
> 
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

with a nit:

> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +	struct slab_sheaf *rcu_sheaf;
> +
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		goto fail;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (unlikely(!pcs->rcu_free)) {
> +
> +		struct slab_sheaf *empty;
> +		struct node_barn *barn;
> +
> +		if (pcs->spare && pcs->spare->size == 0) {
> +			pcs->rcu_free = pcs->spare;
> +			pcs->spare = NULL;
> +			goto do_free;
> +		}
> +
> +		barn = get_barn(s);
> +
> +		empty = barn_get_empty_sheaf(barn);
> +
> +		if (empty) {
> +			pcs->rcu_free = empty;
> +			goto do_free;
> +		}
> +
> +		local_unlock(&s->cpu_sheaves->lock);
> +
> +		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +
> +		if (!empty)
> +			goto fail;
> +
> +		if (!local_trylock(&s->cpu_sheaves->lock)) {
> +			barn_put_empty_sheaf(barn, empty);
> +			goto fail;
> +		}
> +
> +		pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +		if (unlikely(pcs->rcu_free))
> +			barn_put_empty_sheaf(barn, empty);
> +		else
> +			pcs->rcu_free = empty;
> +	}
> +
> +do_free:
> +
> +	rcu_sheaf = pcs->rcu_free;
> +
> +	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
> +
> +	if (likely(rcu_sheaf->size < s->sheaf_capacity))
> +		rcu_sheaf = NULL;
> +	else
> +		pcs->rcu_free = NULL;
> +
> +	/*
> +	 * we flush before local_unlock to make sure a racing
> +	 * flush_all_rcu_sheaves() doesn't miss this sheaf
> +	 */
> +	if (rcu_sheaf)
> +		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);

nit: now we don't have to put this inside local_lock()~local_unlock()?

-- 
Cheers,
Harry / Hyeonggon

> +	local_unlock(&s->cpu_sheaves->lock);
> +
> +	stat(s, FREE_RCU_SHEAF);
> +	return true;
> +
> +fail:
> +	stat(s, FREE_RCU_SHEAF_FAIL);
> +	return false;
> +}


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-19  6:47                     ` Harry Yoo
@ 2025-09-19  7:02                       ` Vlastimil Babka
  2025-09-19  8:59                         ` Harry Yoo
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-19  7:02 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On 9/19/25 08:47, Harry Yoo wrote:
> On Thu, Sep 18, 2025 at 10:09:34AM +0200, Vlastimil Babka wrote:
>> On 9/17/25 16:14, Vlastimil Babka wrote:
>> > On 9/17/25 15:34, Harry Yoo wrote:
>> >> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
>> >>> On 9/17/25 15:07, Harry Yoo wrote:
>> >>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
>> >>> >> On 9/17/25 13:32, Harry Yoo wrote:
>> >>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
>> >>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
>> >>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
>> >>> >> >> >> +				sfw->skip = true;
>> >>> >> >> >> +				continue;
>> >>> >> >> >> +			}
>> >>> >> >> >>
>> >>> >> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
>> >>> >> >> >> +			sfw->skip = false;
>> >>> >> >> >> +			sfw->s = s;
>> >>> >> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
>> >>> >> >> >> +			flushed = true;
>> >>> >> >> >> +		}
>> >>> >> >> >> +
>> >>> >> >> >> +		for_each_online_cpu(cpu) {
>> >>> >> >> >> +			sfw = &per_cpu(slub_flush, cpu);
>> >>> >> >> >> +			if (sfw->skip)
>> >>> >> >> >> +				continue;
>> >>> >> >> >> +			flush_work(&sfw->work);
>> >>> >> >> >> +		}
>> >>> >> >> >> +
>> >>> >> >> >> +		mutex_unlock(&flush_lock);
>> >>> >> >> >> +	}
>> >>> >> >> >> +
>> >>> >> >> >> +	mutex_unlock(&slab_mutex);
>> >>> >> >> >> +	cpus_read_unlock();
>> >>> >> >> >> +
>> >>> >> >> >> +	if (flushed)
>> >>> >> >> >> +		rcu_barrier();
>> >>> >> >> > 
>> >>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
>> >>> >> >> > 
>> >>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
>> >>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
>> >>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
>> >>> >> >> > so flushed == false but the rcu callback isn't processed yet
>> >>> >> >> > by the end of the function?
>> >>> >> >> > 
>> >>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
>> >>> >> >> > but still possible...
>> >>> >> >> 
>> >>> >> >> Yes also good point, will flush unconditionally.
>> >>> >> >> 
>> >>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
>> >>> >> >> local_unlock().
>> >>> >> >>
>> >>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
>> >>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
>> >>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
>> >>> >> > 
>> >>> >> > Makes sense to me.
>> >>> > 
>> >>> > Wait, I'm confused.
>> >>> > 
>> >>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
>> >>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
>> >>> > the object X to be freed before kvfree_rcu_barrier() returns?
>> >>> 
>> >>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
>> >>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
>> >>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
>> >>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
>> >>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
>> >>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
>> >>> sheaf also contains the object X, we should make sure that is flushed.
>> >> 
>> >> I was going to say "but we queue and wait for the flushing work to
>> >> complete, so the sheaf containing object X should be flushed?"
>> >> 
>> >> But nah, that's true only if we see pcs->rcu_free != NULL in
>> >> flush_all_rcu_sheaves().
>> >> 
>> >> You are right...
>> >> 
>> >> Hmm, maybe it's simpler to fix this by never skipping queueing the work
>> >> even when pcs->rcu_sheaf == NULL?
>> > 
>> > I guess it's simpler, yeah.
>> 
>> So what about this? The unconditional queueing should cover all races with
>> __kfree_rcu_sheaf() so there's just unconditional rcu_barrier() in the end.
>> 
>> From 0722b29fa1625b31c05d659d1d988ec882247b38 Mon Sep 17 00:00:00 2001
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Date: Wed, 3 Sep 2025 14:59:46 +0200
>> Subject: [PATCH] slab: add sheaf support for batching kfree_rcu() operations
>> 
>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> addition to main and spare sheaves.
>> 
>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> the sheaf is detached and submitted to call_rcu() with a handler that
>> will try to put it in the barn, or flush to slab pages using bulk free,
>> when the barn is full. Then a new empty sheaf must be obtained to put
>> more objects there.
>> 
>> It's possible that no free sheaves are available to use for a new
>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> kfree_rcu() implementation.
>> 
>> Expected advantages:
>> - batching the kfree_rcu() operations, that could eventually replace the
>>   existing batching
>> - sheaves can be reused for allocations via barn instead of being
>>   flushed to slabs, which is more efficient
>>   - this includes cases where only some cpus are allowed to process rcu
>>     callbacks (Android)
>> 
>> Possible disadvantage:
>> - objects might be waiting for more than their grace period (it is
>>   determined by the last object freed into the sheaf), increasing memory
>>   usage - but the existing batching does that too.
>> 
>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> implementation favors smaller memory footprint over performance.
>> 
>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>> contexts where kfree_rcu() is called might not be compatible with taking
>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>> spinlock - the current kfree_rcu() implementation avoids doing that.
>> 
>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>> that have them. This is not a cheap operation, but the barrier usage is
>> rare - currently kmem_cache_destroy() or on module unload.
>> 
>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>> many had to fall back to the existing implementation.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks.

>> +do_free:
>> +
>> +	rcu_sheaf = pcs->rcu_free;
>> +
>> +	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
>> +
>> +	if (likely(rcu_sheaf->size < s->sheaf_capacity))
>> +		rcu_sheaf = NULL;
>> +	else
>> +		pcs->rcu_free = NULL;
>> +
>> +	/*
>> +	 * we flush before local_unlock to make sure a racing
>> +	 * flush_all_rcu_sheaves() doesn't miss this sheaf
>> +	 */
>> +	if (rcu_sheaf)
>> +		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
> 
> nit: now we don't have to put this inside local_lock()~local_unlock()?

I think we still need to? AFAICS I wrote before is still true:

The caller of kvfree_rcu(X, rhs) might have returned without filling up
the rcu_sheaf fully and thus without submitting it to call_rcu(), then
migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
up the rcu_sheaf fully and is about to call_rcu() on it.

If it can local_unlock() before doing the call_rcu(), it can local_unlock(),
get preempted, and our flush worqueue handler will only see there's no
rcu_free sheaf and do nothing.

If if must call_rcu() before local_unlock(), our flush workqueue handler
will not execute on the cpu until it performs the call_rcu() and
local_unlock(), because it can't preempt that section (!RT) or will have to
wait doing local_lock() in flush_rcu_sheaf() (RT) - here it's important it
takes the lock unconditionally.

Or am I missing something?


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-19  7:02                       ` Vlastimil Babka
@ 2025-09-19  8:59                         ` Harry Yoo
  0 siblings, 0 replies; 95+ messages in thread
From: Harry Yoo @ 2025-09-19  8:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On Fri, Sep 19, 2025 at 09:02:22AM +0200, Vlastimil Babka wrote:
> On 9/19/25 08:47, Harry Yoo wrote:
> > On Thu, Sep 18, 2025 at 10:09:34AM +0200, Vlastimil Babka wrote:
> >> On 9/17/25 16:14, Vlastimil Babka wrote:
> >> > On 9/17/25 15:34, Harry Yoo wrote:
> >> >> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
> >> >>> On 9/17/25 15:07, Harry Yoo wrote:
> >> >>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
> >> >>> >> On 9/17/25 13:32, Harry Yoo wrote:
> >> >>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> >> >>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
> >> >>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> >> >>> >> >> >> +				sfw->skip = true;
> >> >>> >> >> >> +				continue;
> >> >>> >> >> >> +			}
> >> >>> >> >> >>
> >> >>> >> >> >> +			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> >> >>> >> >> >> +			sfw->skip = false;
> >> >>> >> >> >> +			sfw->s = s;
> >> >>> >> >> >> +			queue_work_on(cpu, flushwq, &sfw->work);
> >> >>> >> >> >> +			flushed = true;
> >> >>> >> >> >> +		}
> >> >>> >> >> >> +
> >> >>> >> >> >> +		for_each_online_cpu(cpu) {
> >> >>> >> >> >> +			sfw = &per_cpu(slub_flush, cpu);
> >> >>> >> >> >> +			if (sfw->skip)
> >> >>> >> >> >> +				continue;
> >> >>> >> >> >> +			flush_work(&sfw->work);
> >> >>> >> >> >> +		}
> >> >>> >> >> >> +
> >> >>> >> >> >> +		mutex_unlock(&flush_lock);
> >> >>> >> >> >> +	}
> >> >>> >> >> >> +
> >> >>> >> >> >> +	mutex_unlock(&slab_mutex);
> >> >>> >> >> >> +	cpus_read_unlock();
> >> >>> >> >> >> +
> >> >>> >> >> >> +	if (flushed)
> >> >>> >> >> >> +		rcu_barrier();
> >> >>> >> >> > 
> >> >>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
> >> >>> >> >> > 
> >> >>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> >> >>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
> >> >>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> >> >>> >> >> > so flushed == false but the rcu callback isn't processed yet
> >> >>> >> >> > by the end of the function?
> >> >>> >> >> > 
> >> >>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
> >> >>> >> >> > but still possible...
> >> >>> >> >> 
> >> >>> >> >> Yes also good point, will flush unconditionally.
> >> >>> >> >> 
> >> >>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> >> >>> >> >> local_unlock().
> >> >>> >> >>
> >> >>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
> >> >>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> >> >>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> >> >>> >> > 
> >> >>> >> > Makes sense to me.
> >> >>> > 
> >> >>> > Wait, I'm confused.
> >> >>> > 
> >> >>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
> >> >>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
> >> >>> > the object X to be freed before kvfree_rcu_barrier() returns?
> >> >>> 
> >> >>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
> >> >>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
> >> >>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
> >> >>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
> >> >>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
> >> >>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
> >> >>> sheaf also contains the object X, we should make sure that is flushed.
> >> >> 
> >> >> I was going to say "but we queue and wait for the flushing work to
> >> >> complete, so the sheaf containing object X should be flushed?"
> >> >> 
> >> >> But nah, that's true only if we see pcs->rcu_free != NULL in
> >> >> flush_all_rcu_sheaves().
> >> >> 
> >> >> You are right...
> >> >> 
> >> >> Hmm, maybe it's simpler to fix this by never skipping queueing the work
> >> >> even when pcs->rcu_sheaf == NULL?
> >> > 
> >> > I guess it's simpler, yeah.
> >> 
> >> So what about this? The unconditional queueing should cover all races with
> >> __kfree_rcu_sheaf() so there's just unconditional rcu_barrier() in the end.
> >> 
> >> From 0722b29fa1625b31c05d659d1d988ec882247b38 Mon Sep 17 00:00:00 2001
> >> From: Vlastimil Babka <vbabka@suse.cz>
> >> Date: Wed, 3 Sep 2025 14:59:46 +0200
> >> Subject: [PATCH] slab: add sheaf support for batching kfree_rcu() operations
> >> 
> >> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> >> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> >> addition to main and spare sheaves.
> >> 
> >> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> >> the sheaf is detached and submitted to call_rcu() with a handler that
> >> will try to put it in the barn, or flush to slab pages using bulk free,
> >> when the barn is full. Then a new empty sheaf must be obtained to put
> >> more objects there.
> >> 
> >> It's possible that no free sheaves are available to use for a new
> >> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> >> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> >> kfree_rcu() implementation.
> >> 
> >> Expected advantages:
> >> - batching the kfree_rcu() operations, that could eventually replace the
> >>   existing batching
> >> - sheaves can be reused for allocations via barn instead of being
> >>   flushed to slabs, which is more efficient
> >>   - this includes cases where only some cpus are allowed to process rcu
> >>     callbacks (Android)
> >> 
> >> Possible disadvantage:
> >> - objects might be waiting for more than their grace period (it is
> >>   determined by the last object freed into the sheaf), increasing memory
> >>   usage - but the existing batching does that too.
> >> 
> >> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> >> implementation favors smaller memory footprint over performance.
> >> 
> >> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> >> contexts where kfree_rcu() is called might not be compatible with taking
> >> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> >> spinlock - the current kfree_rcu() implementation avoids doing that.
> >> 
> >> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> >> that have them. This is not a cheap operation, but the barrier usage is
> >> rare - currently kmem_cache_destroy() or on module unload.
> >> 
> >> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> >> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> >> many had to fall back to the existing implementation.
> >> 
> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >> ---
> > 
> > Looks good to me,
> > Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> Thanks.
> 
> >> +do_free:
> >> +
> >> +	rcu_sheaf = pcs->rcu_free;
> >> +
> >> +	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
> >> +
> >> +	if (likely(rcu_sheaf->size < s->sheaf_capacity))
> >> +		rcu_sheaf = NULL;
> >> +	else
> >> +		pcs->rcu_free = NULL;
> >> +
> >> +	/*
> >> +	 * we flush before local_unlock to make sure a racing
> >> +	 * flush_all_rcu_sheaves() doesn't miss this sheaf
> >> +	 */
> >> +	if (rcu_sheaf)
> >> +		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
> > 
> > nit: now we don't have to put this inside local_lock()~local_unlock()?
> 
> I think we still need to? AFAICS I wrote before is still true:
> 
> The caller of kvfree_rcu(X, rhs) might have returned without filling up
> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
> up the rcu_sheaf fully and is about to call_rcu() on it.
>
> If it can local_unlock() before doing the call_rcu(), it can local_unlock(),
> get preempted, and our flush worqueue handler will only see there's no
> rcu_free sheaf and do nothing.

Oops, you're right. So even if a previous kvfree_rcu() has returned
and then kvfree_rcu_barrier() is called, a later kvfree_rcu() call can
make the sheaf invisible to the flush workqueue handler if it calls
call_rcu() outside the critical section because it can be preempted by
the workqueue handler after local_unlock() but before calling
call_rcu().

> If if must call_rcu() before local_unlock(), our flush workqueue handler
> will not execute on the cpu until it performs the call_rcu() and
> local_unlock(), because it can't preempt that section (!RT) or will have to
> wait doing local_lock() in flush_rcu_sheaf() (RT) - here it's important it
> takes the lock unconditionally.

Right.

My nit was wrong and it looks good to me then!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 01/23] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-09-10  8:01 ` [PATCH v8 01/23] locking/local_lock: Expose dep_map in local_trylock_t Vlastimil Babka
@ 2025-09-24 16:49   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-24 16:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, Alexei Starovoitov,
	Sebastian Andrzej Siewior

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
> is a top level field of any lock that participates in LOCKDEP.
> Make it so for local_trylock_t.
>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  include/linux/local_lock_internal.h | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index d80b5306a2c0ccf95a3405b6b947b5f1f9a3bd38..949de37700dbc10feafc06d0b52382cf2e00c694 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -17,7 +17,10 @@ typedef struct {
>
>  /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
>  typedef struct {
> -       local_lock_t    llock;
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +       struct lockdep_map      dep_map;
> +       struct task_struct      *owner;
> +#endif
>         u8              acquired;
>  } local_trylock_t;
>
> @@ -31,7 +34,7 @@ typedef struct {
>         .owner = NULL,
>
>  # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)            \
> -       .llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
> +       LOCAL_LOCK_DEBUG_INIT(lockname)
>
>  static inline void local_lock_acquire(local_lock_t *l)
>  {
> @@ -81,7 +84,7 @@ do {                                                          \
>         local_lock_debug_init(lock);                            \
>  } while (0)
>
> -#define __local_trylock_init(lock) __local_lock_init(lock.llock)
> +#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
>
>  #define __spinlock_nested_bh_init(lock)                                \
>  do {                                                           \
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 02/23] slab: simplify init_kmem_cache_nodes() error handling
  2025-09-10  8:01 ` [PATCH v8 02/23] slab: simplify init_kmem_cache_nodes() error handling Vlastimil Babka
@ 2025-09-24 16:52   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-24 16:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> We don't need to call free_kmem_cache_nodes() immediately when failing
> to allocate a kmem_cache_node, because when we return 0,
> do_kmem_cache_create() calls __kmem_cache_release() which also performs
> free_kmem_cache_nodes().
>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 30003763d224c2704a4b93082b8b47af12dcffc5..9f671ec76131c4b0b28d5d568aa45842b5efb6d4 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5669,10 +5669,8 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>                 n = kmem_cache_alloc_node(kmem_cache_node,
>                                                 GFP_KERNEL, node);
>
> -               if (!n) {
> -                       free_kmem_cache_nodes(s);
> +               if (!n)
>                         return 0;
> -               }
>
>                 init_kmem_cache_node(n);
>                 s->node[node] = n;
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-18  8:09                   ` Vlastimil Babka
  2025-09-19  6:47                     ` Harry Yoo
@ 2025-09-25  4:35                     ` Suren Baghdasaryan
  2025-09-25  8:52                       ` Harry Yoo
  2025-09-26 10:08                       ` Vlastimil Babka
  1 sibling, 2 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-25  4:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, Paul E . McKenney

On Thu, Sep 18, 2025 at 1:09 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 9/17/25 16:14, Vlastimil Babka wrote:
> > On 9/17/25 15:34, Harry Yoo wrote:
> >> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
> >>> On 9/17/25 15:07, Harry Yoo wrote:
> >>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
> >>> >> On 9/17/25 13:32, Harry Yoo wrote:
> >>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> >>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
> >>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> >>> >> >> >> +                          sfw->skip = true;
> >>> >> >> >> +                          continue;
> >>> >> >> >> +                  }
> >>> >> >> >>
> >>> >> >> >> +                  INIT_WORK(&sfw->work, flush_rcu_sheaf);
> >>> >> >> >> +                  sfw->skip = false;
> >>> >> >> >> +                  sfw->s = s;
> >>> >> >> >> +                  queue_work_on(cpu, flushwq, &sfw->work);
> >>> >> >> >> +                  flushed = true;
> >>> >> >> >> +          }
> >>> >> >> >> +
> >>> >> >> >> +          for_each_online_cpu(cpu) {
> >>> >> >> >> +                  sfw = &per_cpu(slub_flush, cpu);
> >>> >> >> >> +                  if (sfw->skip)
> >>> >> >> >> +                          continue;
> >>> >> >> >> +                  flush_work(&sfw->work);
> >>> >> >> >> +          }
> >>> >> >> >> +
> >>> >> >> >> +          mutex_unlock(&flush_lock);
> >>> >> >> >> +  }
> >>> >> >> >> +
> >>> >> >> >> +  mutex_unlock(&slab_mutex);
> >>> >> >> >> +  cpus_read_unlock();
> >>> >> >> >> +
> >>> >> >> >> +  if (flushed)
> >>> >> >> >> +          rcu_barrier();
> >>> >> >> >
> >>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
> >>> >> >> >
> >>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> >>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
> >>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> >>> >> >> > so flushed == false but the rcu callback isn't processed yet
> >>> >> >> > by the end of the function?
> >>> >> >> >
> >>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
> >>> >> >> > but still possible...
> >>> >> >>
> >>> >> >> Yes also good point, will flush unconditionally.
> >>> >> >>
> >>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> >>> >> >> local_unlock().
> >>> >> >>
> >>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
> >>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> >>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> >>> >> >
> >>> >> > Makes sense to me.
> >>> >
> >>> > Wait, I'm confused.
> >>> >
> >>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
> >>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
> >>> > the object X to be freed before kvfree_rcu_barrier() returns?
> >>>
> >>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
> >>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
> >>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
> >>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
> >>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
> >>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
> >>> sheaf also contains the object X, we should make sure that is flushed.
> >>
> >> I was going to say "but we queue and wait for the flushing work to
> >> complete, so the sheaf containing object X should be flushed?"
> >>
> >> But nah, that's true only if we see pcs->rcu_free != NULL in
> >> flush_all_rcu_sheaves().
> >>
> >> You are right...
> >>
> >> Hmm, maybe it's simpler to fix this by never skipping queueing the work
> >> even when pcs->rcu_sheaf == NULL?
> >
> > I guess it's simpler, yeah.
>
> So what about this? The unconditional queueing should cover all races with
> __kfree_rcu_sheaf() so there's just unconditional rcu_barrier() in the end.
>
> From 0722b29fa1625b31c05d659d1d988ec882247b38 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 3 Sep 2025 14:59:46 +0200
> Subject: [PATCH] slab: add sheaf support for batching kfree_rcu() operations
>
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
>
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
>
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
>
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)

nit: I would say it's more CONFIG_RCU_NOCB_CPU related. Android is
just an instance of that.

>
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
>
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
>
> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> contexts where kfree_rcu() is called might not be compatible with taking
> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> spinlock - the current kfree_rcu() implementation avoids doing that.
>
> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> that have them. This is not a cheap operation, but the barrier usage is
> rare - currently kmem_cache_destroy() or on module unload.
>
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab.h        |   3 +
>  mm/slab_common.c |  26 +++++
>  mm/slub.c        | 267 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 294 insertions(+), 2 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 206987ce44a4..e82e51c44bd0 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -435,6 +435,9 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>         return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>  }
>
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
> +void flush_all_rcu_sheaves(void);
> +
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>                          SLAB_CACHE_DMA32 | SLAB_PANIC | \
>                          SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index e2b197e47866..005a4319c06a 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
>                 kvfree_rcu_list(head);
>  }
>
> +static bool kfree_rcu_sheaf(void *obj)
> +{
> +       struct kmem_cache *s;
> +       struct folio *folio;
> +       struct slab *slab;
> +
> +       if (is_vmalloc_addr(obj))
> +               return false;
> +
> +       folio = virt_to_folio(obj);
> +       if (unlikely(!folio_test_slab(folio)))
> +               return false;
> +
> +       slab = folio_slab(folio);
> +       s = slab->slab_cache;
> +       if (s->cpu_sheaves)
> +               return __kfree_rcu_sheaf(s, obj);
> +
> +       return false;
> +}
> +
>  static bool
>  need_offload_krc(struct kfree_rcu_cpu *krcp)
>  {
> @@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>         if (!head)
>                 might_sleep();
>
> +       if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr))
> +               return;
> +
>         // Queue the object but don't yet schedule the batch.
>         if (debug_rcu_head_queue(ptr)) {
>                 // Probable double kfree_rcu(), just leak.
> @@ -2026,6 +2050,8 @@ void kvfree_rcu_barrier(void)
>         bool queued;
>         int i, cpu;
>
> +       flush_all_rcu_sheaves();
> +
>         /*
>          * Firstly we detach objects and queue them over an RCU-batch
>          * for all CPUs. Finally queued works are flushed for each CPU.
> diff --git a/mm/slub.c b/mm/slub.c
> index cba188b7e04d..171273f90efd 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -367,6 +367,8 @@ enum stat_item {
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
>         FREE_PCS,               /* Free to percpu sheaf */
> +       FREE_RCU_SHEAF,         /* Free to rcu_free sheaf */
> +       FREE_RCU_SHEAF_FAIL,    /* Failed to free to a rcu_free sheaf */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -461,6 +463,7 @@ struct slab_sheaf {
>                 struct rcu_head rcu_head;
>                 struct list_head barn_list;
>         };
> +       struct kmem_cache *cache;
>         unsigned int size;
>         void *objects[];
>  };
> @@ -469,6 +472,7 @@ struct slub_percpu_sheaves {
>         local_trylock_t lock;
>         struct slab_sheaf *main; /* never NULL when unlocked */
>         struct slab_sheaf *spare; /* empty or full, may be NULL */
> +       struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
>  };
>
>  /*
> @@ -2531,6 +2535,8 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
>         if (unlikely(!sheaf))
>                 return NULL;
>
> +       sheaf->cache = s;
> +
>         stat(s, SHEAF_ALLOC);
>
>         return sheaf;
> @@ -2655,6 +2661,43 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
>         sheaf->size = 0;
>  }
>
> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> +                                    struct slab_sheaf *sheaf)
> +{
> +       bool init = slab_want_init_on_free(s);
> +       void **p = &sheaf->objects[0];
> +       unsigned int i = 0;
> +
> +       while (i < sheaf->size) {
> +               struct slab *slab = virt_to_slab(p[i]);
> +
> +               memcg_slab_free_hook(s, slab, p + i, 1);
> +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
> +
> +               if (unlikely(!slab_free_hook(s, p[i], init, true))) {
> +                       p[i] = p[--sheaf->size];
> +                       continue;
> +               }
> +
> +               i++;
> +       }
> +}
> +
> +static void rcu_free_sheaf_nobarn(struct rcu_head *head)
> +{
> +       struct slab_sheaf *sheaf;
> +       struct kmem_cache *s;
> +
> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
> +       s = sheaf->cache;
> +
> +       __rcu_free_sheaf_prepare(s, sheaf);
> +
> +       sheaf_flush_unused(s, sheaf);
> +
> +       free_empty_sheaf(s, sheaf);
> +}
> +
>  /*
>   * Caller needs to make sure migration is disabled in order to fully flush
>   * single cpu's sheaves
> @@ -2667,7 +2710,7 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
>  static void pcs_flush_all(struct kmem_cache *s)
>  {
>         struct slub_percpu_sheaves *pcs;
> -       struct slab_sheaf *spare;
> +       struct slab_sheaf *spare, *rcu_free;
>
>         local_lock(&s->cpu_sheaves->lock);
>         pcs = this_cpu_ptr(s->cpu_sheaves);
> @@ -2675,6 +2718,9 @@ static void pcs_flush_all(struct kmem_cache *s)
>         spare = pcs->spare;
>         pcs->spare = NULL;
>
> +       rcu_free = pcs->rcu_free;
> +       pcs->rcu_free = NULL;
> +
>         local_unlock(&s->cpu_sheaves->lock);
>
>         if (spare) {
> @@ -2682,6 +2728,9 @@ static void pcs_flush_all(struct kmem_cache *s)
>                 free_empty_sheaf(s, spare);
>         }
>
> +       if (rcu_free)
> +               call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> +
>         sheaf_flush_main(s);
>  }
>
> @@ -2698,6 +2747,11 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
>                 free_empty_sheaf(s, pcs->spare);
>                 pcs->spare = NULL;
>         }
> +
> +       if (pcs->rcu_free) {
> +               call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> +               pcs->rcu_free = NULL;
> +       }
>  }
>
>  static void pcs_destroy(struct kmem_cache *s)
> @@ -2723,6 +2777,7 @@ static void pcs_destroy(struct kmem_cache *s)
>                  */
>
>                 WARN_ON(pcs->spare);
> +               WARN_ON(pcs->rcu_free);
>
>                 if (!WARN_ON(pcs->main->size)) {
>                         free_empty_sheaf(s, pcs->main);
> @@ -3780,7 +3835,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
>
>         pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>
> -       return (pcs->spare || pcs->main->size);
> +       return (pcs->spare || pcs->rcu_free || pcs->main->size);
>  }
>
>  /*
> @@ -3840,6 +3895,77 @@ static void flush_all(struct kmem_cache *s)
>         cpus_read_unlock();
>  }
>
> +static void flush_rcu_sheaf(struct work_struct *w)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *rcu_free;
> +       struct slub_flush_work *sfw;
> +       struct kmem_cache *s;
> +
> +       sfw = container_of(w, struct slub_flush_work, work);
> +       s = sfw->s;
> +
> +       local_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       rcu_free = pcs->rcu_free;
> +       pcs->rcu_free = NULL;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       if (rcu_free)
> +               call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> +}
> +
> +
> +/* needed for kvfree_rcu_barrier() */
> +void flush_all_rcu_sheaves(void)
> +{
> +       struct slub_flush_work *sfw;
> +       struct kmem_cache *s;
> +       unsigned int cpu;
> +
> +       cpus_read_lock();
> +       mutex_lock(&slab_mutex);
> +
> +       list_for_each_entry(s, &slab_caches, list) {
> +               if (!s->cpu_sheaves)
> +                       continue;
> +
> +               mutex_lock(&flush_lock);
> +
> +               for_each_online_cpu(cpu) {
> +                       sfw = &per_cpu(slub_flush, cpu);
> +
> +                       /*
> +                        * we don't check if rcu_free sheaf exists - racing
> +                        * __kfree_rcu_sheaf() might have just removed it.
> +                        * by executing flush_rcu_sheaf() on the cpu we make
> +                        * sure the __kfree_rcu_sheaf() finished its call_rcu()
> +                        */
> +
> +                       INIT_WORK(&sfw->work, flush_rcu_sheaf);
> +                       sfw->skip = false;

I think you don't need this sfw->skip flag since you never skip anymore, right?

> +                       sfw->s = s;
> +                       queue_work_on(cpu, flushwq, &sfw->work);
> +               }
> +
> +               for_each_online_cpu(cpu) {
> +                       sfw = &per_cpu(slub_flush, cpu);
> +                       if (sfw->skip)
> +                               continue;
> +                       flush_work(&sfw->work);

I'm sure I'm missing something but why can't we execute call_rcu()
from here instead of queuing the work which does call_rcu() and then
flushing all the queued work? I'm sure you have a good reason which
I'm missing.

> +               }
> +
> +               mutex_unlock(&flush_lock);
> +       }
> +
> +       mutex_unlock(&slab_mutex);
> +       cpus_read_unlock();
> +
> +       rcu_barrier();
> +}
> +
>  /*
>   * Use the cpu notifier to insure that the cpu slabs are flushed when
>   * necessary.
> @@ -5413,6 +5539,134 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>         return true;
>  }
>
> +static void rcu_free_sheaf(struct rcu_head *head)
> +{
> +       struct slab_sheaf *sheaf;
> +       struct node_barn *barn;
> +       struct kmem_cache *s;
> +
> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
> +
> +       s = sheaf->cache;
> +
> +       /*
> +        * This may remove some objects due to slab_free_hook() returning false,
> +        * so that the sheaf might no longer be completely full. But it's easier
> +        * to handle it as full (unless it became completely empty), as the code
> +        * handles it fine. The only downside is that sheaf will serve fewer
> +        * allocations when reused. It only happens due to debugging, which is a
> +        * performance hit anyway.
> +        */
> +       __rcu_free_sheaf_prepare(s, sheaf);
> +
> +       barn = get_node(s, numa_mem_id())->barn;
> +
> +       /* due to slab_free_hook() */
> +       if (unlikely(sheaf->size == 0))
> +               goto empty;
> +
> +       /*
> +        * Checking nr_full/nr_empty outside lock avoids contention in case the
> +        * barn is at the respective limit. Due to the race we might go over the
> +        * limit but that should be rare and harmless.
> +        */
> +
> +       if (data_race(barn->nr_full) < MAX_FULL_SHEAVES) {
> +               stat(s, BARN_PUT);
> +               barn_put_full_sheaf(barn, sheaf);
> +               return;
> +       }
> +
> +       stat(s, BARN_PUT_FAIL);
> +       sheaf_flush_unused(s, sheaf);
> +
> +empty:
> +       if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
> +               barn_put_empty_sheaf(barn, sheaf);
> +               return;
> +       }
> +
> +       free_empty_sheaf(s, sheaf);
> +}
> +
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *rcu_sheaf;
> +
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               goto fail;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(!pcs->rcu_free)) {
> +
> +               struct slab_sheaf *empty;
> +               struct node_barn *barn;
> +
> +               if (pcs->spare && pcs->spare->size == 0) {
> +                       pcs->rcu_free = pcs->spare;
> +                       pcs->spare = NULL;
> +                       goto do_free;
> +               }
> +
> +               barn = get_barn(s);
> +
> +               empty = barn_get_empty_sheaf(barn);
> +
> +               if (empty) {
> +                       pcs->rcu_free = empty;
> +                       goto do_free;
> +               }
> +
> +               local_unlock(&s->cpu_sheaves->lock);
> +
> +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +
> +               if (!empty)
> +                       goto fail;
> +
> +               if (!local_trylock(&s->cpu_sheaves->lock)) {
> +                       barn_put_empty_sheaf(barn, empty);
> +                       goto fail;
> +               }
> +
> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +               if (unlikely(pcs->rcu_free))
> +                       barn_put_empty_sheaf(barn, empty);
> +               else
> +                       pcs->rcu_free = empty;
> +       }
> +
> +do_free:
> +
> +       rcu_sheaf = pcs->rcu_free;
> +
> +       rcu_sheaf->objects[rcu_sheaf->size++] = obj;

nit: The above would result in OOB write if we ever reached here with
a full rcu_sheaf (rcu_sheaf->size == rcu_sheaf->sheaf_capacity) but I
think it's impossible. You always start with an empty rcu_sheaf and
objects are added only here with a following check for a full
rcu_sheaf. I think a short comment clarifying that would be nice.

> +
> +       if (likely(rcu_sheaf->size < s->sheaf_capacity))
> +               rcu_sheaf = NULL;
> +       else
> +               pcs->rcu_free = NULL;
> +
> +       /*
> +        * we flush before local_unlock to make sure a racing
> +        * flush_all_rcu_sheaves() doesn't miss this sheaf
> +        */
> +       if (rcu_sheaf)
> +               call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat(s, FREE_RCU_SHEAF);
> +       return true;
> +
> +fail:
> +       stat(s, FREE_RCU_SHEAF_FAIL);
> +       return false;
> +}
> +
>  /*
>   * Bulk free objects to the percpu sheaves.
>   * Unlike free_to_pcs() this includes the calls to all necessary hooks
> @@ -6909,6 +7163,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>         struct kmem_cache_node *n;
>
>         flush_all_cpus_locked(s);
> +
> +       /* we might have rcu sheaves in flight */
> +       if (s->cpu_sheaves)
> +               rcu_barrier();
> +
>         /* Attempt to free all objects */
>         for_each_kmem_cache_node(s, node, n) {
>                 if (n->barn)
> @@ -8284,6 +8543,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
>  STAT_ATTR(FREE_PCS, free_cpu_sheaf);
> +STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
> +STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_FROZEN, free_frozen);
> @@ -8382,6 +8643,8 @@ static struct attribute *slab_attrs[] = {
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
>         &free_cpu_sheaf_attr.attr,
> +       &free_rcu_sheaf_attr.attr,
> +       &free_rcu_sheaf_fail_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_frozen_attr.attr,
> --
> 2.51.0
>
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-25  4:35                     ` Suren Baghdasaryan
@ 2025-09-25  8:52                       ` Harry Yoo
  2025-09-25 13:38                         ` Suren Baghdasaryan
  2025-09-26 10:08                       ` Vlastimil Babka
  1 sibling, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-09-25  8:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On Wed, Sep 24, 2025 at 09:35:05PM -0700, Suren Baghdasaryan wrote:
> On Thu, Sep 18, 2025 at 1:09 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 9/17/25 16:14, Vlastimil Babka wrote:
> > > On 9/17/25 15:34, Harry Yoo wrote:
> > >> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
> > >>> On 9/17/25 15:07, Harry Yoo wrote:
> > >>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
> > >>> >> On 9/17/25 13:32, Harry Yoo wrote:
> > >>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> > >>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
> > >>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> > >>> >> >> >> +                          sfw->skip = true;
> > >>> >> >> >> +                          continue;
> > >>> >> >> >> +                  }
> > >>> >> >> >>
> > >>> >> >> >> +                  INIT_WORK(&sfw->work, flush_rcu_sheaf);
> > >>> >> >> >> +                  sfw->skip = false;
> > >>> >> >> >> +                  sfw->s = s;
> > >>> >> >> >> +                  queue_work_on(cpu, flushwq, &sfw->work);
> > >>> >> >> >> +                  flushed = true;
> > >>> >> >> >> +          }
> > >>> >> >> >> +
> > >>> >> >> >> +          for_each_online_cpu(cpu) {
> > >>> >> >> >> +                  sfw = &per_cpu(slub_flush, cpu);
> > >>> >> >> >> +                  if (sfw->skip)
> > >>> >> >> >> +                          continue;
> > >>> >> >> >> +                  flush_work(&sfw->work);
> > >>> >> >> >> +          }
> > >>> >> >> >> +
> > >>> >> >> >> +          mutex_unlock(&flush_lock);
> > >>> >> >> >> +  }
> > >>> >> >> >> +
> > >>> >> >> >> +  mutex_unlock(&slab_mutex);
> > >>> >> >> >> +  cpus_read_unlock();
> > >>> >> >> >> +
> > >>> >> >> >> +  if (flushed)
> > >>> >> >> >> +          rcu_barrier();
> > >>> >> >> >
> > >>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
> > >>> >> >> >
> > >>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> > >>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
> > >>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> > >>> >> >> > so flushed == false but the rcu callback isn't processed yet
> > >>> >> >> > by the end of the function?
> > >>> >> >> >
> > >>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
> > >>> >> >> > but still possible...
> > >>> >> >>
> > >>> >> >> Yes also good point, will flush unconditionally.
> > >>> >> >>
> > >>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> > >>> >> >> local_unlock().
> > >>> >> >>
> > >>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
> > >>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> > >>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> > >>> >> >
> > >>> >> > Makes sense to me.
> > >>> >
> > >>> > Wait, I'm confused.
> > >>> >
> > >>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
> > >>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
> > >>> > the object X to be freed before kvfree_rcu_barrier() returns?
> > >>>
> > >>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
> > >>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
> > >>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
> > >>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
> > >>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
> > >>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
> > >>> sheaf also contains the object X, we should make sure that is flushed.
> > >>
> > >> I was going to say "but we queue and wait for the flushing work to
> > >> complete, so the sheaf containing object X should be flushed?"
> > >>
> > >> But nah, that's true only if we see pcs->rcu_free != NULL in
> > >> flush_all_rcu_sheaves().
> > >>
> > >> You are right...
> > >>
> > >> Hmm, maybe it's simpler to fix this by never skipping queueing the work
> > >> even when pcs->rcu_sheaf == NULL?
> > >
> > > I guess it's simpler, yeah.
> >
> > So what about this? The unconditional queueing should cover all races with
> > __kfree_rcu_sheaf() so there's just unconditional rcu_barrier() in the end.
> >
> > From 0722b29fa1625b31c05d659d1d988ec882247b38 Mon Sep 17 00:00:00 2001
> > From: Vlastimil Babka <vbabka@suse.cz>
> > Date: Wed, 3 Sep 2025 14:59:46 +0200
> > Subject: [PATCH] slab: add sheaf support for batching kfree_rcu() operations
> >
> > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > addition to main and spare sheaves.
> >
> > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > the sheaf is detached and submitted to call_rcu() with a handler that
> > will try to put it in the barn, or flush to slab pages using bulk free,
> > when the barn is full. Then a new empty sheaf must be obtained to put
> > more objects there.
> >
> > It's possible that no free sheaves are available to use for a new
> > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > kfree_rcu() implementation.
> >
> > Expected advantages:
> > - batching the kfree_rcu() operations, that could eventually replace the
> >   existing batching
> > - sheaves can be reused for allocations via barn instead of being
> >   flushed to slabs, which is more efficient
> >   - this includes cases where only some cpus are allowed to process rcu
> >     callbacks (Android)
> 
> nit: I would say it's more CONFIG_RCU_NOCB_CPU related. Android is
> just an instance of that.
> 
> >
> > Possible disadvantage:
> > - objects might be waiting for more than their grace period (it is
> >   determined by the last object freed into the sheaf), increasing memory
> >   usage - but the existing batching does that too.
> >
> > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > implementation favors smaller memory footprint over performance.
> >
> > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > contexts where kfree_rcu() is called might not be compatible with taking
> > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > spinlock - the current kfree_rcu() implementation avoids doing that.
> >
> > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > that have them. This is not a cheap operation, but the barrier usage is
> > rare - currently kmem_cache_destroy() or on module unload.
> >
> > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > many had to fall back to the existing implementation.
> >
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/slab.h        |   3 +
> >  mm/slab_common.c |  26 +++++
> >  mm/slub.c        | 267 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 294 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index cba188b7e04d..171273f90efd 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c

[...snip...]

> > @@ -3840,6 +3895,77 @@ static void flush_all(struct kmem_cache *s)
> >         cpus_read_unlock();
> >  }
> >
> > +static void flush_rcu_sheaf(struct work_struct *w)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       struct slab_sheaf *rcu_free;
> > +       struct slub_flush_work *sfw;
> > +       struct kmem_cache *s;
> > +
> > +       sfw = container_of(w, struct slub_flush_work, work);
> > +       s = sfw->s;
> > +
> > +       local_lock(&s->cpu_sheaves->lock);
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +       rcu_free = pcs->rcu_free;
> > +       pcs->rcu_free = NULL;
> > +
> > +       local_unlock(&s->cpu_sheaves->lock);
> > +
> > +       if (rcu_free)
> > +               call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> > +}
> > +
> > +
> > +/* needed for kvfree_rcu_barrier() */
> > +void flush_all_rcu_sheaves(void)
> > +{
> > +       struct slub_flush_work *sfw;
> > +       struct kmem_cache *s;
> > +       unsigned int cpu;
> > +
> > +       cpus_read_lock();
> > +       mutex_lock(&slab_mutex);
> > +
> > +       list_for_each_entry(s, &slab_caches, list) {
> > +               if (!s->cpu_sheaves)
> > +                       continue;
> > +
> > +               mutex_lock(&flush_lock);
> > +
> > +               for_each_online_cpu(cpu) {
> > +                       sfw = &per_cpu(slub_flush, cpu);
> > +
> > +                       /*
> > +                        * we don't check if rcu_free sheaf exists - racing
> > +                        * __kfree_rcu_sheaf() might have just removed it.
> > +                        * by executing flush_rcu_sheaf() on the cpu we make
> > +                        * sure the __kfree_rcu_sheaf() finished its call_rcu()
> > +                        */
> > +
> > +                       INIT_WORK(&sfw->work, flush_rcu_sheaf);
> > +                       sfw->skip = false;
> 
> I think you don't need this sfw->skip flag since you never skip anymore, right?

Yes, at least in flush_all_rcu_sheaves().
I'm fine with or without sfw->skip in this function.

> > +                       sfw->s = s;
> > +                       queue_work_on(cpu, flushwq, &sfw->work);
> > +               }
> > +
> > +               for_each_online_cpu(cpu) {
> > +                       sfw = &per_cpu(slub_flush, cpu);
> > +                       if (sfw->skip)
> > +                               continue;
> > +                       flush_work(&sfw->work);
> 
> I'm sure I'm missing something but why can't we execute call_rcu()
> from here instead of queuing the work which does call_rcu() and then
> flushing all the queued work? I'm sure you have a good reason which
> I'm missing.

Because a local lock cannot be held by other CPUs, you can't take off the
rcu_free sheaf remotely and call call_rcu(). That's why the work is
queued on each CPU, ensuring the rcu_free sheaf is flushed by its local CPU.

> > +               }
> > +
> > +               mutex_unlock(&flush_lock);
> > +       }
> > +
> > +       mutex_unlock(&slab_mutex);
> > +       cpus_read_unlock();
> > +
> > +       rcu_barrier();
> > +}
> > +
> >  /*
> >   * Use the cpu notifier to insure that the cpu slabs are flushed when
> >   * necessary.
> > +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> > +{
> > +       struct slub_percpu_sheaves *pcs;
> > +       struct slab_sheaf *rcu_sheaf;
> > +
> > +       if (!local_trylock(&s->cpu_sheaves->lock))
> > +               goto fail;
> > +
> > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +       if (unlikely(!pcs->rcu_free)) {
> > +
> > +               struct slab_sheaf *empty;
> > +               struct node_barn *barn;
> > +
> > +               if (pcs->spare && pcs->spare->size == 0) {
> > +                       pcs->rcu_free = pcs->spare;
> > +                       pcs->spare = NULL;
> > +                       goto do_free;
> > +               }
> > +
> > +               barn = get_barn(s);
> > +
> > +               empty = barn_get_empty_sheaf(barn);
> > +
> > +               if (empty) {
> > +                       pcs->rcu_free = empty;
> > +                       goto do_free;
> > +               }
> > +
> > +               local_unlock(&s->cpu_sheaves->lock);
> > +
> > +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> > +
> > +               if (!empty)
> > +                       goto fail;
> > +
> > +               if (!local_trylock(&s->cpu_sheaves->lock)) {
> > +                       barn_put_empty_sheaf(barn, empty);
> > +                       goto fail;
> > +               }
> > +
> > +               pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +               if (unlikely(pcs->rcu_free))
> > +                       barn_put_empty_sheaf(barn, empty);
> > +               else
> > +                       pcs->rcu_free = empty;
> > +       }
> > +
> > +do_free:
> > +
> > +       rcu_sheaf = pcs->rcu_free;
> > +
> > +       rcu_sheaf->objects[rcu_sheaf->size++] = obj;
> 
> nit: The above would result in OOB write if we ever reached here with
> a full rcu_sheaf (rcu_sheaf->size == rcu_sheaf->sheaf_capacity) but I
> think it's impossible. You always start with an empty rcu_sheaf and
> objects are added only here with a following check for a full
> rcu_sheaf. I think a short comment clarifying that would be nice.

Sounds good to me.

-- 
Cheers,
Harry / Hyeonggon

> > +
> > +       if (likely(rcu_sheaf->size < s->sheaf_capacity))
> > +               rcu_sheaf = NULL;
> > +       else
> > +               pcs->rcu_free = NULL;
> > +
> > +       /*
> > +        * we flush before local_unlock to make sure a racing
> > +        * flush_all_rcu_sheaves() doesn't miss this sheaf
> > +        */
> > +       if (rcu_sheaf)
> > +               call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
> > +
> > +       local_unlock(&s->cpu_sheaves->lock);
> > +
> > +       stat(s, FREE_RCU_SHEAF);
> > +       return true;
> > +
> > +fail:
> > +       stat(s, FREE_RCU_SHEAF_FAIL);
> > +       return false;
> > +}


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-25  8:52                       ` Harry Yoo
@ 2025-09-25 13:38                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-25 13:38 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Paul E . McKenney

On Thu, Sep 25, 2025 at 1:52 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Wed, Sep 24, 2025 at 09:35:05PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Sep 18, 2025 at 1:09 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 9/17/25 16:14, Vlastimil Babka wrote:
> > > > On 9/17/25 15:34, Harry Yoo wrote:
> > > >> On Wed, Sep 17, 2025 at 03:21:31PM +0200, Vlastimil Babka wrote:
> > > >>> On 9/17/25 15:07, Harry Yoo wrote:
> > > >>> > On Wed, Sep 17, 2025 at 02:05:49PM +0200, Vlastimil Babka wrote:
> > > >>> >> On 9/17/25 13:32, Harry Yoo wrote:
> > > >>> >> > On Wed, Sep 17, 2025 at 11:55:10AM +0200, Vlastimil Babka wrote:
> > > >>> >> >> On 9/17/25 10:30, Harry Yoo wrote:
> > > >>> >> >> > On Wed, Sep 10, 2025 at 10:01:06AM +0200, Vlastimil Babka wrote:
> > > >>> >> >> >> +                          sfw->skip = true;
> > > >>> >> >> >> +                          continue;
> > > >>> >> >> >> +                  }
> > > >>> >> >> >>
> > > >>> >> >> >> +                  INIT_WORK(&sfw->work, flush_rcu_sheaf);
> > > >>> >> >> >> +                  sfw->skip = false;
> > > >>> >> >> >> +                  sfw->s = s;
> > > >>> >> >> >> +                  queue_work_on(cpu, flushwq, &sfw->work);
> > > >>> >> >> >> +                  flushed = true;
> > > >>> >> >> >> +          }
> > > >>> >> >> >> +
> > > >>> >> >> >> +          for_each_online_cpu(cpu) {
> > > >>> >> >> >> +                  sfw = &per_cpu(slub_flush, cpu);
> > > >>> >> >> >> +                  if (sfw->skip)
> > > >>> >> >> >> +                          continue;
> > > >>> >> >> >> +                  flush_work(&sfw->work);
> > > >>> >> >> >> +          }
> > > >>> >> >> >> +
> > > >>> >> >> >> +          mutex_unlock(&flush_lock);
> > > >>> >> >> >> +  }
> > > >>> >> >> >> +
> > > >>> >> >> >> +  mutex_unlock(&slab_mutex);
> > > >>> >> >> >> +  cpus_read_unlock();
> > > >>> >> >> >> +
> > > >>> >> >> >> +  if (flushed)
> > > >>> >> >> >> +          rcu_barrier();
> > > >>> >> >> >
> > > >>> >> >> > I think we need to call rcu_barrier() even if flushed == false?
> > > >>> >> >> >
> > > >>> >> >> > Maybe a kvfree_rcu()'d object was already waiting for the rcu callback to
> > > >>> >> >> > be processed before flush_all_rcu_sheaves() is called, and
> > > >>> >> >> > in flush_all_rcu_sheaves() we skipped all (cache, cpu) pairs,
> > > >>> >> >> > so flushed == false but the rcu callback isn't processed yet
> > > >>> >> >> > by the end of the function?
> > > >>> >> >> >
> > > >>> >> >> > That sounds like a very unlikely to happen in a realistic scenario,
> > > >>> >> >> > but still possible...
> > > >>> >> >>
> > > >>> >> >> Yes also good point, will flush unconditionally.
> > > >>> >> >>
> > > >>> >> >> Maybe in __kfree_rcu_sheaf() I should also move the call_rcu(...) before
> > > >>> >> >> local_unlock().
> > > >>> >> >>
> > > >>> >> >> So we don't end up seeing a NULL pcs->rcu_free in
> > > >>> >> >> flush_all_rcu_sheaves() because __kfree_rcu_sheaf() already set it to NULL,
> > > >>> >> >> but didn't yet do the call_rcu() as it got preempted after local_unlock().
> > > >>> >> >
> > > >>> >> > Makes sense to me.
> > > >>> >
> > > >>> > Wait, I'm confused.
> > > >>> >
> > > >>> > I think the caller of kvfree_rcu_barrier() should make sure that it's invoked
> > > >>> > only after a kvfree_rcu(X, rhs) call has returned, if the caller expects
> > > >>> > the object X to be freed before kvfree_rcu_barrier() returns?
> > > >>>
> > > >>> Hmm, the caller of kvfree_rcu(X, rhs) might have returned without filling up
> > > >>> the rcu_sheaf fully and thus without submitting it to call_rcu(), then
> > > >>> migrated to another cpu. Then it calls kvfree_rcu_barrier() while another
> > > >>> unrelated kvfree_rcu(X, rhs) call on the previous cpu is for the same
> > > >>> kmem_cache (kvfree_rcu_barrier() is not only for cache destruction), fills
> > > >>> up the rcu_sheaf fully and is about to call_rcu() on it. And since that
> > > >>> sheaf also contains the object X, we should make sure that is flushed.
> > > >>
> > > >> I was going to say "but we queue and wait for the flushing work to
> > > >> complete, so the sheaf containing object X should be flushed?"
> > > >>
> > > >> But nah, that's true only if we see pcs->rcu_free != NULL in
> > > >> flush_all_rcu_sheaves().
> > > >>
> > > >> You are right...
> > > >>
> > > >> Hmm, maybe it's simpler to fix this by never skipping queueing the work
> > > >> even when pcs->rcu_sheaf == NULL?
> > > >
> > > > I guess it's simpler, yeah.
> > >
> > > So what about this? The unconditional queueing should cover all races with
> > > __kfree_rcu_sheaf() so there's just unconditional rcu_barrier() in the end.
> > >
> > > From 0722b29fa1625b31c05d659d1d988ec882247b38 Mon Sep 17 00:00:00 2001
> > > From: Vlastimil Babka <vbabka@suse.cz>
> > > Date: Wed, 3 Sep 2025 14:59:46 +0200
> > > Subject: [PATCH] slab: add sheaf support for batching kfree_rcu() operations
> > >
> > > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > > addition to main and spare sheaves.
> > >
> > > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > > the sheaf is detached and submitted to call_rcu() with a handler that
> > > will try to put it in the barn, or flush to slab pages using bulk free,
> > > when the barn is full. Then a new empty sheaf must be obtained to put
> > > more objects there.
> > >
> > > It's possible that no free sheaves are available to use for a new
> > > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > > kfree_rcu() implementation.
> > >
> > > Expected advantages:
> > > - batching the kfree_rcu() operations, that could eventually replace the
> > >   existing batching
> > > - sheaves can be reused for allocations via barn instead of being
> > >   flushed to slabs, which is more efficient
> > >   - this includes cases where only some cpus are allowed to process rcu
> > >     callbacks (Android)
> >
> > nit: I would say it's more CONFIG_RCU_NOCB_CPU related. Android is
> > just an instance of that.
> >
> > >
> > > Possible disadvantage:
> > > - objects might be waiting for more than their grace period (it is
> > >   determined by the last object freed into the sheaf), increasing memory
> > >   usage - but the existing batching does that too.
> > >
> > > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > > implementation favors smaller memory footprint over performance.
> > >
> > > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > > contexts where kfree_rcu() is called might not be compatible with taking
> > > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > > spinlock - the current kfree_rcu() implementation avoids doing that.
> > >
> > > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > > that have them. This is not a cheap operation, but the barrier usage is
> > > rare - currently kmem_cache_destroy() or on module unload.
> > >
> > > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > > many had to fall back to the existing implementation.
> > >
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  mm/slab.h        |   3 +
> > >  mm/slab_common.c |  26 +++++
> > >  mm/slub.c        | 267 ++++++++++++++++++++++++++++++++++++++++++++++-
> > >  3 files changed, 294 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index cba188b7e04d..171273f90efd 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
>
> [...snip...]
>
> > > @@ -3840,6 +3895,77 @@ static void flush_all(struct kmem_cache *s)
> > >         cpus_read_unlock();
> > >  }
> > >
> > > +static void flush_rcu_sheaf(struct work_struct *w)
> > > +{
> > > +       struct slub_percpu_sheaves *pcs;
> > > +       struct slab_sheaf *rcu_free;
> > > +       struct slub_flush_work *sfw;
> > > +       struct kmem_cache *s;
> > > +
> > > +       sfw = container_of(w, struct slub_flush_work, work);
> > > +       s = sfw->s;
> > > +
> > > +       local_lock(&s->cpu_sheaves->lock);
> > > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > > +
> > > +       rcu_free = pcs->rcu_free;
> > > +       pcs->rcu_free = NULL;
> > > +
> > > +       local_unlock(&s->cpu_sheaves->lock);
> > > +
> > > +       if (rcu_free)
> > > +               call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> > > +}
> > > +
> > > +
> > > +/* needed for kvfree_rcu_barrier() */
> > > +void flush_all_rcu_sheaves(void)
> > > +{
> > > +       struct slub_flush_work *sfw;
> > > +       struct kmem_cache *s;
> > > +       unsigned int cpu;
> > > +
> > > +       cpus_read_lock();
> > > +       mutex_lock(&slab_mutex);
> > > +
> > > +       list_for_each_entry(s, &slab_caches, list) {
> > > +               if (!s->cpu_sheaves)
> > > +                       continue;
> > > +
> > > +               mutex_lock(&flush_lock);
> > > +
> > > +               for_each_online_cpu(cpu) {
> > > +                       sfw = &per_cpu(slub_flush, cpu);
> > > +
> > > +                       /*
> > > +                        * we don't check if rcu_free sheaf exists - racing
> > > +                        * __kfree_rcu_sheaf() might have just removed it.
> > > +                        * by executing flush_rcu_sheaf() on the cpu we make
> > > +                        * sure the __kfree_rcu_sheaf() finished its call_rcu()
> > > +                        */
> > > +
> > > +                       INIT_WORK(&sfw->work, flush_rcu_sheaf);
> > > +                       sfw->skip = false;
> >
> > I think you don't need this sfw->skip flag since you never skip anymore, right?
>
> Yes, at least in flush_all_rcu_sheaves().
> I'm fine with or without sfw->skip in this function.
>
> > > +                       sfw->s = s;
> > > +                       queue_work_on(cpu, flushwq, &sfw->work);
> > > +               }
> > > +
> > > +               for_each_online_cpu(cpu) {
> > > +                       sfw = &per_cpu(slub_flush, cpu);
> > > +                       if (sfw->skip)
> > > +                               continue;
> > > +                       flush_work(&sfw->work);
> >
> > I'm sure I'm missing something but why can't we execute call_rcu()
> > from here instead of queuing the work which does call_rcu() and then
> > flushing all the queued work? I'm sure you have a good reason which
> > I'm missing.
>
> Because a local lock cannot be held by other CPUs, you can't take off the
> rcu_free sheaf remotely and call call_rcu(). That's why the work is
> queued on each CPU, ensuring the rcu_free sheaf is flushed by its local CPU.

Ah, yes, of course. I knew it was something obvious but my brain was
too tired. Thanks for the explanation, Harry!

>
> > > +               }
> > > +
> > > +               mutex_unlock(&flush_lock);
> > > +       }
> > > +
> > > +       mutex_unlock(&slab_mutex);
> > > +       cpus_read_unlock();
> > > +
> > > +       rcu_barrier();
> > > +}
> > > +
> > >  /*
> > >   * Use the cpu notifier to insure that the cpu slabs are flushed when
> > >   * necessary.
> > > +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> > > +{
> > > +       struct slub_percpu_sheaves *pcs;
> > > +       struct slab_sheaf *rcu_sheaf;
> > > +
> > > +       if (!local_trylock(&s->cpu_sheaves->lock))
> > > +               goto fail;
> > > +
> > > +       pcs = this_cpu_ptr(s->cpu_sheaves);
> > > +
> > > +       if (unlikely(!pcs->rcu_free)) {
> > > +
> > > +               struct slab_sheaf *empty;
> > > +               struct node_barn *barn;
> > > +
> > > +               if (pcs->spare && pcs->spare->size == 0) {
> > > +                       pcs->rcu_free = pcs->spare;
> > > +                       pcs->spare = NULL;
> > > +                       goto do_free;
> > > +               }
> > > +
> > > +               barn = get_barn(s);
> > > +
> > > +               empty = barn_get_empty_sheaf(barn);
> > > +
> > > +               if (empty) {
> > > +                       pcs->rcu_free = empty;
> > > +                       goto do_free;
> > > +               }
> > > +
> > > +               local_unlock(&s->cpu_sheaves->lock);
> > > +
> > > +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> > > +
> > > +               if (!empty)
> > > +                       goto fail;
> > > +
> > > +               if (!local_trylock(&s->cpu_sheaves->lock)) {
> > > +                       barn_put_empty_sheaf(barn, empty);
> > > +                       goto fail;
> > > +               }
> > > +
> > > +               pcs = this_cpu_ptr(s->cpu_sheaves);
> > > +
> > > +               if (unlikely(pcs->rcu_free))
> > > +                       barn_put_empty_sheaf(barn, empty);
> > > +               else
> > > +                       pcs->rcu_free = empty;
> > > +       }
> > > +
> > > +do_free:
> > > +
> > > +       rcu_sheaf = pcs->rcu_free;
> > > +
> > > +       rcu_sheaf->objects[rcu_sheaf->size++] = obj;
> >
> > nit: The above would result in OOB write if we ever reached here with
> > a full rcu_sheaf (rcu_sheaf->size == rcu_sheaf->sheaf_capacity) but I
> > think it's impossible. You always start with an empty rcu_sheaf and
> > objects are added only here with a following check for a full
> > rcu_sheaf. I think a short comment clarifying that would be nice.
>
> Sounds good to me.
>
> --
> Cheers,
> Harry / Hyeonggon
>
> > > +
> > > +       if (likely(rcu_sheaf->size < s->sheaf_capacity))
> > > +               rcu_sheaf = NULL;
> > > +       else
> > > +               pcs->rcu_free = NULL;
> > > +
> > > +       /*
> > > +        * we flush before local_unlock to make sure a racing
> > > +        * flush_all_rcu_sheaves() doesn't miss this sheaf
> > > +        */
> > > +       if (rcu_sheaf)
> > > +               call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
> > > +
> > > +       local_unlock(&s->cpu_sheaves->lock);
> > > +
> > > +       stat(s, FREE_RCU_SHEAF);
> > > +       return true;
> > > +
> > > +fail:
> > > +       stat(s, FREE_RCU_SHEAF_FAIL);
> > > +       return false;
> > > +}


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 07/23] slab: skip percpu sheaves for remote object freeing
  2025-09-10  8:01 ` [PATCH v8 07/23] slab: skip percpu sheaves for remote object freeing Vlastimil Babka
@ 2025-09-25 16:14   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-25 16:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Since we don't control the NUMA locality of objects in percpu sheaves,
> allocations with node restrictions bypass them. Allocations without
> restrictions may however still expect to get local objects with high
> probability, and the introduction of sheaves can decrease it due to
> freed object from a remote node ending up in percpu sheaves.
>
> The fraction of such remote frees seems low (5% on an 8-node machine)
> but it can be expected that some cache or workload specific corner cases
> exist. We can either conclude that this is not a problem due to the low
> fraction, or we can make remote frees bypass percpu sheaves and go
> directly to their slabs. This will make the remote frees more expensive,
> but if if's only a small fraction, most frees will still benefit from

s/if's/it's

> the lower overhead of percpu sheaves.
>
> This patch thus makes remote object freeing bypass percpu sheaves,
> including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
> it's not intended to be 100% guarantee that percpu sheaves will only
> contain local objects. The refill from slabs does not provide that
> guarantee in the first place, and there might be cpu migrations
> happening when we need to unlock the local_lock. Avoiding all that could
> be possible but complicated so we can leave it for later investigation
> whether it would be worth it. It can be expected that the more selective
> freeing will itself prevent accumulation of remote objects in percpu
> sheaves so any such violations would have only short-term effects.
>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slab_common.c |  7 +++++--
>  mm/slub.c        | 42 ++++++++++++++++++++++++++++++++++++------
>  2 files changed, 41 insertions(+), 8 deletions(-)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 005a4319c06a01d2b616a75396fcc43766a62ddb..b6601e0fe598e24bd8d456dce4fc82c65b342bfd 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1623,8 +1623,11 @@ static bool kfree_rcu_sheaf(void *obj)
>
>         slab = folio_slab(folio);
>         s = slab->slab_cache;
> -       if (s->cpu_sheaves)
> -               return __kfree_rcu_sheaf(s, obj);
> +       if (s->cpu_sheaves) {
> +               if (likely(!IS_ENABLED(CONFIG_NUMA) ||
> +                          slab_nid(slab) == numa_mem_id()))
> +                       return __kfree_rcu_sheaf(s, obj);
> +       }
>
>         return false;
>  }
> diff --git a/mm/slub.c b/mm/slub.c
> index 35274ce4e709c9da7ac8f9006c824f28709e923d..9699d048b2cd08ee75c4cc3d1e460868704520b1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -472,6 +472,7 @@ struct slab_sheaf {
>         };
>         struct kmem_cache *cache;
>         unsigned int size;
> +       int node; /* only used for rcu_sheaf */
>         void *objects[];
>  };
>
> @@ -5828,7 +5829,7 @@ static void rcu_free_sheaf(struct rcu_head *head)
>          */
>         __rcu_free_sheaf_prepare(s, sheaf);
>
> -       barn = get_node(s, numa_mem_id())->barn;
> +       barn = get_node(s, sheaf->node)->barn;
>
>         /* due to slab_free_hook() */
>         if (unlikely(sheaf->size == 0))
> @@ -5914,10 +5915,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>
>         rcu_sheaf->objects[rcu_sheaf->size++] = obj;
>
> -       if (likely(rcu_sheaf->size < s->sheaf_capacity))
> +       if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
>                 rcu_sheaf = NULL;
> -       else
> +       } else {
>                 pcs->rcu_free = NULL;
> +               rcu_sheaf->node = numa_mem_id();
> +       }
>
>         local_unlock(&s->cpu_sheaves->lock);
>
> @@ -5944,7 +5947,11 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>         bool init = slab_want_init_on_free(s);
>         unsigned int batch, i = 0;
>         struct node_barn *barn;
> +       void *remote_objects[PCS_BATCH_MAX];
> +       unsigned int remote_nr = 0;
> +       int node = numa_mem_id();
>
> +next_remote_batch:
>         while (i < size) {
>                 struct slab *slab = virt_to_slab(p[i]);
>
> @@ -5954,7 +5961,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>                 if (unlikely(!slab_free_hook(s, p[i], init, false))) {
>                         p[i] = p[--size];
>                         if (!size)
> -                               return;
> +                               goto flush_remote;
> +                       continue;
> +               }
> +
> +               if (unlikely(IS_ENABLED(CONFIG_NUMA) && slab_nid(slab) != node)) {
> +                       remote_objects[remote_nr] = p[i];
> +                       p[i] = p[--size];
> +                       if (++remote_nr >= PCS_BATCH_MAX)
> +                               goto flush_remote;
>                         continue;
>                 }
>
> @@ -6024,6 +6039,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
>          */
>  fallback:
>         __kmem_cache_free_bulk(s, size, p);
> +
> +flush_remote:
> +       if (remote_nr) {
> +               __kmem_cache_free_bulk(s, remote_nr, &remote_objects[0]);
> +               if (i < size) {
> +                       remote_nr = 0;
> +                       goto next_remote_batch;
> +               }
> +       }
>  }
>
>  #ifndef CONFIG_SLUB_TINY
> @@ -6115,8 +6139,14 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>         if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>                 return;
>
> -       if (!s->cpu_sheaves || !free_to_pcs(s, object))
> -               do_slab_free(s, slab, object, object, 1, addr);
> +       if (s->cpu_sheaves && likely(!IS_ENABLED(CONFIG_NUMA) ||
> +                                    slab_nid(slab) == numa_mem_id())) {
> +               if (likely(free_to_pcs(s, object))) {

nit: no need for curly braces here.

> +                       return;
> +               }
> +       }
> +
> +       do_slab_free(s, slab, object, object, 1, addr);
>  }
>
>  #ifdef CONFIG_MEMCG
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 08/23] slab: allow NUMA restricted allocations to use percpu sheaves
  2025-09-10  8:01 ` [PATCH v8 08/23] slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
@ 2025-09-25 16:27   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-25 16:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Currently allocations asking for a specific node explicitly or via
> mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
> contain mostly local objects, we can try allocating from them if the
> local node happens to be the requested node or allowed by the mempolicy.
> If we find the object from percpu sheaves is not from the expected node,
> we skip the sheaves - this should be rare.
>
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  mm/slub.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 46 insertions(+), 7 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 9699d048b2cd08ee75c4cc3d1e460868704520b1..3746c0229cc2f9658a589416c63c21fbf2850c44 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4888,18 +4888,43 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>  }
>
>  static __fastpath_inline
> -void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
> +void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>  {
>         struct slub_percpu_sheaves *pcs;
> +       bool node_requested;
>         void *object;
>
>  #ifdef CONFIG_NUMA
> -       if (static_branch_unlikely(&strict_numa)) {
> -               if (current->mempolicy)
> -                       return NULL;
> +       if (static_branch_unlikely(&strict_numa) &&
> +                        node == NUMA_NO_NODE) {
> +
> +               struct mempolicy *mpol = current->mempolicy;
> +
> +               if (mpol) {
> +                       /*
> +                        * Special BIND rule support. If the local node
> +                        * is in permitted set then do not redirect
> +                        * to a particular node.
> +                        * Otherwise we apply the memory policy to get
> +                        * the node we need to allocate on.
> +                        */
> +                       if (mpol->mode != MPOL_BIND ||
> +                                       !node_isset(numa_mem_id(), mpol->nodes))
> +
> +                               node = mempolicy_slab_node();
> +               }
>         }
>  #endif
>
> +       node_requested = IS_ENABLED(CONFIG_NUMA) && node != NUMA_NO_NODE;
> +
> +       /*
> +        * We assume the percpu sheaves contain only local objects although it's
> +        * not completely guaranteed, so we verify later.
> +        */
> +       if (unlikely(node_requested && node != numa_mem_id()))
> +               return NULL;
> +
>         if (!local_trylock(&s->cpu_sheaves->lock))
>                 return NULL;
>
> @@ -4911,7 +4936,21 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>                         return NULL;
>         }
>
> -       object = pcs->main->objects[--pcs->main->size];
> +       object = pcs->main->objects[pcs->main->size - 1];
> +
> +       if (unlikely(node_requested)) {
> +               /*
> +                * Verify that the object was from the node we want. This could
> +                * be false because of cpu migration during an unlocked part of
> +                * the current allocation or previous freeing process.
> +                */
> +               if (folio_nid(virt_to_folio(object)) != node) {
> +                       local_unlock(&s->cpu_sheaves->lock);
> +                       return NULL;
> +               }
> +       }
> +
> +       pcs->main->size--;
>
>         local_unlock(&s->cpu_sheaves->lock);
>
> @@ -5011,8 +5050,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>         if (unlikely(object))
>                 goto out;
>
> -       if (s->cpu_sheaves && node == NUMA_NO_NODE)
> -               object = alloc_from_pcs(s, gfpflags);
> +       if (s->cpu_sheaves)
> +               object = alloc_from_pcs(s, gfpflags, node);
>
>         if (!object)
>                 object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 11/23] maple_tree: Drop bulk insert support
  2025-09-10  8:01 ` [PATCH v8 11/23] maple_tree: Drop bulk insert support Vlastimil Babka
@ 2025-09-25 16:38   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-25 16:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>
> Bulk insert mode was added to facilitate forking faster, but forking now
> uses __mt_dup() to duplicate the tree.
>
> The addition of sheaves has made the bulk allocations difficult to
> maintain - since the expected entries would preallocate into the maple
> state.  A big part of the maple state node allocation was the ability to
> push nodes back onto the state for later use, which was essential to the
> bulk insert algorithm.
>
> Remove mas_expected_entries() and mas_destroy_rebalance() functions as
> well as the MA_STATE_BULK and MA_STATE_REBALANCE maple state flags since
> there are no users anymore.  Drop the associated testing as well.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  lib/maple_tree.c                 | 270 +--------------------------------------
>  lib/test_maple_tree.c            | 137 --------------------
>  tools/testing/radix-tree/maple.c |  36 ------
>  3 files changed, 4 insertions(+), 439 deletions(-)

Awesome!

>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 38fb68c082915211c80f473d313159599fe97e2c..4f0e30b57b0cef9e5cf791f3f64f5898752db402 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -83,13 +83,9 @@
>
>  /*
>   * Maple state flags
> - * * MA_STATE_BULK             - Bulk insert mode
> - * * MA_STATE_REBALANCE                - Indicate a rebalance during bulk insert
>   * * MA_STATE_PREALLOC         - Preallocated nodes, WARN_ON allocation
>   */
> -#define MA_STATE_BULK          1
> -#define MA_STATE_REBALANCE     2
> -#define MA_STATE_PREALLOC      4
> +#define MA_STATE_PREALLOC      1
>
>  #define ma_parent_ptr(x) ((struct maple_pnode *)(x))
>  #define mas_tree_parent(x) ((unsigned long)(x->tree) | MA_ROOT_PARENT)
> @@ -1031,24 +1027,6 @@ static inline void mas_descend(struct ma_state *mas)
>         mas->node = mas_slot(mas, slots, mas->offset);
>  }
>
> -/*
> - * mte_set_gap() - Set a maple node gap.
> - * @mn: The encoded maple node
> - * @gap: The offset of the gap to set
> - * @val: The gap value
> - */
> -static inline void mte_set_gap(const struct maple_enode *mn,
> -                                unsigned char gap, unsigned long val)
> -{
> -       switch (mte_node_type(mn)) {
> -       default:
> -               break;
> -       case maple_arange_64:
> -               mte_to_node(mn)->ma64.gap[gap] = val;
> -               break;
> -       }
> -}
> -
>  /*
>   * mas_ascend() - Walk up a level of the tree.
>   * @mas: The maple state
> @@ -1878,21 +1856,7 @@ static inline int mab_calc_split(struct ma_state *mas,
>          * end on a NULL entry, with the exception of the left-most leaf.  The
>          * limitation means that the split of a node must be checked for this condition
>          * and be able to put more data in one direction or the other.
> -        */
> -       if (unlikely((mas->mas_flags & MA_STATE_BULK))) {
> -               *mid_split = 0;
> -               split = b_end - mt_min_slots[bn->type];
> -
> -               if (!ma_is_leaf(bn->type))
> -                       return split;
> -
> -               mas->mas_flags |= MA_STATE_REBALANCE;
> -               if (!bn->slot[split])
> -                       split--;
> -               return split;
> -       }
> -
> -       /*
> +        *
>          * Although extremely rare, it is possible to enter what is known as the 3-way
>          * split scenario.  The 3-way split comes about by means of a store of a range
>          * that overwrites the end and beginning of two full nodes.  The result is a set
> @@ -2039,27 +2003,6 @@ static inline void mab_mas_cp(struct maple_big_node *b_node,
>         }
>  }
>
> -/*
> - * mas_bulk_rebalance() - Rebalance the end of a tree after a bulk insert.
> - * @mas: The maple state
> - * @end: The maple node end
> - * @mt: The maple node type
> - */
> -static inline void mas_bulk_rebalance(struct ma_state *mas, unsigned char end,
> -                                     enum maple_type mt)
> -{
> -       if (!(mas->mas_flags & MA_STATE_BULK))
> -               return;
> -
> -       if (mte_is_root(mas->node))
> -               return;
> -
> -       if (end > mt_min_slots[mt]) {
> -               mas->mas_flags &= ~MA_STATE_REBALANCE;
> -               return;
> -       }
> -}
> -
>  /*
>   * mas_store_b_node() - Store an @entry into the b_node while also copying the
>   * data from a maple encoded node.
> @@ -2109,9 +2052,6 @@ static noinline_for_kasan void mas_store_b_node(struct ma_wr_state *wr_mas,
>         /* Handle new range ending before old range ends */
>         piv = mas_safe_pivot(mas, wr_mas->pivots, offset_end, wr_mas->type);
>         if (piv > mas->last) {
> -               if (piv == ULONG_MAX)
> -                       mas_bulk_rebalance(mas, b_node->b_end, wr_mas->type);
> -
>                 if (offset_end != slot)
>                         wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
>                                                           offset_end);
> @@ -3011,126 +2951,6 @@ static inline void mas_rebalance(struct ma_state *mas,
>         return mas_spanning_rebalance(mas, &mast, empty_count);
>  }
>
> -/*
> - * mas_destroy_rebalance() - Rebalance left-most node while destroying the maple
> - * state.
> - * @mas: The maple state
> - * @end: The end of the left-most node.
> - *
> - * During a mass-insert event (such as forking), it may be necessary to
> - * rebalance the left-most node when it is not sufficient.
> - */
> -static inline void mas_destroy_rebalance(struct ma_state *mas, unsigned char end)
> -{
> -       enum maple_type mt = mte_node_type(mas->node);
> -       struct maple_node reuse, *newnode, *parent, *new_left, *left, *node;
> -       struct maple_enode *eparent, *old_eparent;
> -       unsigned char offset, tmp, split = mt_slots[mt] / 2;
> -       void __rcu **l_slots, **slots;
> -       unsigned long *l_pivs, *pivs, gap;
> -       bool in_rcu = mt_in_rcu(mas->tree);
> -       unsigned char new_height = mas_mt_height(mas);
> -
> -       MA_STATE(l_mas, mas->tree, mas->index, mas->last);
> -
> -       l_mas = *mas;
> -       mas_prev_sibling(&l_mas);
> -
> -       /* set up node. */
> -       if (in_rcu) {
> -               newnode = mas_pop_node(mas);
> -       } else {
> -               newnode = &reuse;
> -       }
> -
> -       node = mas_mn(mas);
> -       newnode->parent = node->parent;
> -       slots = ma_slots(newnode, mt);
> -       pivs = ma_pivots(newnode, mt);
> -       left = mas_mn(&l_mas);
> -       l_slots = ma_slots(left, mt);
> -       l_pivs = ma_pivots(left, mt);
> -       if (!l_slots[split])
> -               split++;
> -       tmp = mas_data_end(&l_mas) - split;
> -
> -       memcpy(slots, l_slots + split + 1, sizeof(void *) * tmp);
> -       memcpy(pivs, l_pivs + split + 1, sizeof(unsigned long) * tmp);
> -       pivs[tmp] = l_mas.max;
> -       memcpy(slots + tmp, ma_slots(node, mt), sizeof(void *) * end);
> -       memcpy(pivs + tmp, ma_pivots(node, mt), sizeof(unsigned long) * end);
> -
> -       l_mas.max = l_pivs[split];
> -       mas->min = l_mas.max + 1;
> -       old_eparent = mt_mk_node(mte_parent(l_mas.node),
> -                            mas_parent_type(&l_mas, l_mas.node));
> -       tmp += end;
> -       if (!in_rcu) {
> -               unsigned char max_p = mt_pivots[mt];
> -               unsigned char max_s = mt_slots[mt];
> -
> -               if (tmp < max_p)
> -                       memset(pivs + tmp, 0,
> -                              sizeof(unsigned long) * (max_p - tmp));
> -
> -               if (tmp < mt_slots[mt])
> -                       memset(slots + tmp, 0, sizeof(void *) * (max_s - tmp));
> -
> -               memcpy(node, newnode, sizeof(struct maple_node));
> -               ma_set_meta(node, mt, 0, tmp - 1);
> -               mte_set_pivot(old_eparent, mte_parent_slot(l_mas.node),
> -                             l_pivs[split]);
> -
> -               /* Remove data from l_pivs. */
> -               tmp = split + 1;
> -               memset(l_pivs + tmp, 0, sizeof(unsigned long) * (max_p - tmp));
> -               memset(l_slots + tmp, 0, sizeof(void *) * (max_s - tmp));
> -               ma_set_meta(left, mt, 0, split);
> -               eparent = old_eparent;
> -
> -               goto done;
> -       }
> -
> -       /* RCU requires replacing both l_mas, mas, and parent. */
> -       mas->node = mt_mk_node(newnode, mt);
> -       ma_set_meta(newnode, mt, 0, tmp);
> -
> -       new_left = mas_pop_node(mas);
> -       new_left->parent = left->parent;
> -       mt = mte_node_type(l_mas.node);
> -       slots = ma_slots(new_left, mt);
> -       pivs = ma_pivots(new_left, mt);
> -       memcpy(slots, l_slots, sizeof(void *) * split);
> -       memcpy(pivs, l_pivs, sizeof(unsigned long) * split);
> -       ma_set_meta(new_left, mt, 0, split);
> -       l_mas.node = mt_mk_node(new_left, mt);
> -
> -       /* replace parent. */
> -       offset = mte_parent_slot(mas->node);
> -       mt = mas_parent_type(&l_mas, l_mas.node);
> -       parent = mas_pop_node(mas);
> -       slots = ma_slots(parent, mt);
> -       pivs = ma_pivots(parent, mt);
> -       memcpy(parent, mte_to_node(old_eparent), sizeof(struct maple_node));
> -       rcu_assign_pointer(slots[offset], mas->node);
> -       rcu_assign_pointer(slots[offset - 1], l_mas.node);
> -       pivs[offset - 1] = l_mas.max;
> -       eparent = mt_mk_node(parent, mt);
> -done:
> -       gap = mas_leaf_max_gap(mas);
> -       mte_set_gap(eparent, mte_parent_slot(mas->node), gap);
> -       gap = mas_leaf_max_gap(&l_mas);
> -       mte_set_gap(eparent, mte_parent_slot(l_mas.node), gap);
> -       mas_ascend(mas);
> -
> -       if (in_rcu) {
> -               mas_replace_node(mas, old_eparent, new_height);
> -               mas_adopt_children(mas, mas->node);
> -       }
> -
> -       mas_update_gap(mas);
> -}
> -
>  /*
>   * mas_split_final_node() - Split the final node in a subtree operation.
>   * @mast: the maple subtree state
> @@ -3837,8 +3657,6 @@ static inline void mas_wr_node_store(struct ma_wr_state *wr_mas,
>
>         if (mas->last == wr_mas->end_piv)
>                 offset_end++; /* don't copy this offset */
> -       else if (unlikely(wr_mas->r_max == ULONG_MAX))
> -               mas_bulk_rebalance(mas, mas->end, wr_mas->type);
>
>         /* set up node. */
>         if (in_rcu) {
> @@ -4255,7 +4073,7 @@ static inline enum store_type mas_wr_store_type(struct ma_wr_state *wr_mas)
>         new_end = mas_wr_new_end(wr_mas);
>         /* Potential spanning rebalance collapsing a node */
>         if (new_end < mt_min_slots[wr_mas->type]) {
> -               if (!mte_is_root(mas->node) && !(mas->mas_flags & MA_STATE_BULK))
> +               if (!mte_is_root(mas->node))
>                         return  wr_rebalance;
>                 return wr_node_store;
>         }
> @@ -5562,25 +5380,7 @@ void mas_destroy(struct ma_state *mas)
>         struct maple_alloc *node;
>         unsigned long total;
>
> -       /*
> -        * When using mas_for_each() to insert an expected number of elements,
> -        * it is possible that the number inserted is less than the expected
> -        * number.  To fix an invalid final node, a check is performed here to
> -        * rebalance the previous node with the final node.
> -        */
> -       if (mas->mas_flags & MA_STATE_REBALANCE) {
> -               unsigned char end;
> -               if (mas_is_err(mas))
> -                       mas_reset(mas);
> -               mas_start(mas);
> -               mtree_range_walk(mas);
> -               end = mas->end + 1;
> -               if (end < mt_min_slot_count(mas->node) - 1)
> -                       mas_destroy_rebalance(mas, end);
> -
> -               mas->mas_flags &= ~MA_STATE_REBALANCE;
> -       }
> -       mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
> +       mas->mas_flags &= ~MA_STATE_PREALLOC;
>
>         total = mas_allocated(mas);
>         while (total) {
> @@ -5600,68 +5400,6 @@ void mas_destroy(struct ma_state *mas)
>  }
>  EXPORT_SYMBOL_GPL(mas_destroy);
>
> -/*
> - * mas_expected_entries() - Set the expected number of entries that will be inserted.
> - * @mas: The maple state
> - * @nr_entries: The number of expected entries.
> - *
> - * This will attempt to pre-allocate enough nodes to store the expected number
> - * of entries.  The allocations will occur using the bulk allocator interface
> - * for speed.  Please call mas_destroy() on the @mas after inserting the entries
> - * to ensure any unused nodes are freed.
> - *
> - * Return: 0 on success, -ENOMEM if memory could not be allocated.
> - */
> -int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
> -{
> -       int nonleaf_cap = MAPLE_ARANGE64_SLOTS - 2;
> -       struct maple_enode *enode = mas->node;
> -       int nr_nodes;
> -       int ret;
> -
> -       /*
> -        * Sometimes it is necessary to duplicate a tree to a new tree, such as
> -        * forking a process and duplicating the VMAs from one tree to a new
> -        * tree.  When such a situation arises, it is known that the new tree is
> -        * not going to be used until the entire tree is populated.  For
> -        * performance reasons, it is best to use a bulk load with RCU disabled.
> -        * This allows for optimistic splitting that favours the left and reuse
> -        * of nodes during the operation.
> -        */
> -
> -       /* Optimize splitting for bulk insert in-order */
> -       mas->mas_flags |= MA_STATE_BULK;
> -
> -       /*
> -        * Avoid overflow, assume a gap between each entry and a trailing null.
> -        * If this is wrong, it just means allocation can happen during
> -        * insertion of entries.
> -        */
> -       nr_nodes = max(nr_entries, nr_entries * 2 + 1);
> -       if (!mt_is_alloc(mas->tree))
> -               nonleaf_cap = MAPLE_RANGE64_SLOTS - 2;
> -
> -       /* Leaves; reduce slots to keep space for expansion */
> -       nr_nodes = DIV_ROUND_UP(nr_nodes, MAPLE_RANGE64_SLOTS - 2);
> -       /* Internal nodes */
> -       nr_nodes += DIV_ROUND_UP(nr_nodes, nonleaf_cap);
> -       /* Add working room for split (2 nodes) + new parents */
> -       mas_node_count_gfp(mas, nr_nodes + 3, GFP_KERNEL);
> -
> -       /* Detect if allocations run out */
> -       mas->mas_flags |= MA_STATE_PREALLOC;
> -
> -       if (!mas_is_err(mas))
> -               return 0;
> -
> -       ret = xa_err(mas->node);
> -       mas->node = enode;
> -       mas_destroy(mas);
> -       return ret;
> -
> -}
> -EXPORT_SYMBOL_GPL(mas_expected_entries);
> -
>  static void mas_may_activate(struct ma_state *mas)
>  {
>         if (!mas->node) {
> diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
> index cb3936595b0d56a9682ff100eba54693a1427829..14fbbee32046a13d54d60dcac2b45be2bd190ac4 100644
> --- a/lib/test_maple_tree.c
> +++ b/lib/test_maple_tree.c
> @@ -2746,139 +2746,6 @@ static noinline void __init check_fuzzer(struct maple_tree *mt)
>         mtree_test_erase(mt, ULONG_MAX - 10);
>  }
>
> -/* duplicate the tree with a specific gap */
> -static noinline void __init check_dup_gaps(struct maple_tree *mt,
> -                                   unsigned long nr_entries, bool zero_start,
> -                                   unsigned long gap)
> -{
> -       unsigned long i = 0;
> -       struct maple_tree newmt;
> -       int ret;
> -       void *tmp;
> -       MA_STATE(mas, mt, 0, 0);
> -       MA_STATE(newmas, &newmt, 0, 0);
> -       struct rw_semaphore newmt_lock;
> -
> -       init_rwsem(&newmt_lock);
> -       mt_set_external_lock(&newmt, &newmt_lock);
> -
> -       if (!zero_start)
> -               i = 1;
> -
> -       mt_zero_nr_tallocated();
> -       for (; i <= nr_entries; i++)
> -               mtree_store_range(mt, i*10, (i+1)*10 - gap,
> -                                 xa_mk_value(i), GFP_KERNEL);
> -
> -       mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
> -       mt_set_non_kernel(99999);
> -       down_write(&newmt_lock);
> -       ret = mas_expected_entries(&newmas, nr_entries);
> -       mt_set_non_kernel(0);
> -       MT_BUG_ON(mt, ret != 0);
> -
> -       rcu_read_lock();
> -       mas_for_each(&mas, tmp, ULONG_MAX) {
> -               newmas.index = mas.index;
> -               newmas.last = mas.last;
> -               mas_store(&newmas, tmp);
> -       }
> -       rcu_read_unlock();
> -       mas_destroy(&newmas);
> -
> -       __mt_destroy(&newmt);
> -       up_write(&newmt_lock);
> -}
> -
> -/* Duplicate many sizes of trees.  Mainly to test expected entry values */
> -static noinline void __init check_dup(struct maple_tree *mt)
> -{
> -       int i;
> -       int big_start = 100010;
> -
> -       /* Check with a value at zero */
> -       for (i = 10; i < 1000; i++) {
> -               mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
> -               check_dup_gaps(mt, i, true, 5);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -       }
> -
> -       cond_resched();
> -       mt_cache_shrink();
> -       /* Check with a value at zero, no gap */
> -       for (i = 1000; i < 2000; i++) {
> -               mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
> -               check_dup_gaps(mt, i, true, 0);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -       }
> -
> -       cond_resched();
> -       mt_cache_shrink();
> -       /* Check with a value at zero and unreasonably large */
> -       for (i = big_start; i < big_start + 10; i++) {
> -               mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
> -               check_dup_gaps(mt, i, true, 5);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -       }
> -
> -       cond_resched();
> -       mt_cache_shrink();
> -       /* Small to medium size not starting at zero*/
> -       for (i = 200; i < 1000; i++) {
> -               mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
> -               check_dup_gaps(mt, i, false, 5);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -       }
> -
> -       cond_resched();
> -       mt_cache_shrink();
> -       /* Unreasonably large not starting at zero*/
> -       for (i = big_start; i < big_start + 10; i++) {
> -               mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
> -               check_dup_gaps(mt, i, false, 5);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -               cond_resched();
> -               mt_cache_shrink();
> -       }
> -
> -       /* Check non-allocation tree not starting at zero */
> -       for (i = 1500; i < 3000; i++) {
> -               mt_init_flags(mt, 0);
> -               check_dup_gaps(mt, i, false, 5);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -               cond_resched();
> -               if (i % 2 == 0)
> -                       mt_cache_shrink();
> -       }
> -
> -       mt_cache_shrink();
> -       /* Check non-allocation tree starting at zero */
> -       for (i = 200; i < 1000; i++) {
> -               mt_init_flags(mt, 0);
> -               check_dup_gaps(mt, i, true, 5);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -               cond_resched();
> -       }
> -
> -       mt_cache_shrink();
> -       /* Unreasonably large */
> -       for (i = big_start + 5; i < big_start + 10; i++) {
> -               mt_init_flags(mt, 0);
> -               check_dup_gaps(mt, i, true, 5);
> -               mtree_destroy(mt);
> -               rcu_barrier();
> -               mt_cache_shrink();
> -               cond_resched();
> -       }
> -}
> -
>  static noinline void __init check_bnode_min_spanning(struct maple_tree *mt)
>  {
>         int i = 50;
> @@ -4077,10 +3944,6 @@ static int __init maple_tree_seed(void)
>         check_fuzzer(&tree);
>         mtree_destroy(&tree);
>
> -       mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> -       check_dup(&tree);
> -       mtree_destroy(&tree);
> -
>         mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>         check_bnode_min_spanning(&tree);
>         mtree_destroy(&tree);
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index 172700fb7784d29f9403003b4484a5ebd7aa316b..c0543060dae2510477963331fb0ccdffd78ea965 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c
> @@ -35455,17 +35455,6 @@ static void check_dfs_preorder(struct maple_tree *mt)
>         MT_BUG_ON(mt, count != e);
>         mtree_destroy(mt);
>
> -       mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
> -       mas_reset(&mas);
> -       mt_zero_nr_tallocated();
> -       mt_set_non_kernel(200);
> -       mas_expected_entries(&mas, max);
> -       for (count = 0; count <= max; count++) {
> -               mas.index = mas.last = count;
> -               mas_store(&mas, xa_mk_value(count));
> -               MT_BUG_ON(mt, mas_is_err(&mas));
> -       }
> -       mas_destroy(&mas);
>         rcu_barrier();
>         /*
>          * pr_info(" ->seq test of 0-%lu %luK in %d active (%d total)\n",
> @@ -36454,27 +36443,6 @@ static inline int check_vma_modification(struct maple_tree *mt)
>         return 0;
>  }
>
> -/*
> - * test to check that bulk stores do not use wr_rebalance as the store
> - * type.
> - */
> -static inline void check_bulk_rebalance(struct maple_tree *mt)
> -{
> -       MA_STATE(mas, mt, ULONG_MAX, ULONG_MAX);
> -       int max = 10;
> -
> -       build_full_tree(mt, 0, 2);
> -
> -       /* erase every entry in the tree */
> -       do {
> -               /* set up bulk store mode */
> -               mas_expected_entries(&mas, max);
> -               mas_erase(&mas);
> -               MT_BUG_ON(mt, mas.store_type == wr_rebalance);
> -       } while (mas_prev(&mas, 0) != NULL);
> -
> -       mas_destroy(&mas);
> -}
>
>  void farmer_tests(void)
>  {
> @@ -36487,10 +36455,6 @@ void farmer_tests(void)
>         check_vma_modification(&tree);
>         mtree_destroy(&tree);
>
> -       mt_init(&tree);
> -       check_bulk_rebalance(&tree);
> -       mtree_destroy(&tree);
> -
>         tree.ma_root = xa_mk_value(0);
>         mt_dump(&tree, mt_dump_dec);
>
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 12/23] tools/testing/vma: Implement vm_refcnt reset
  2025-09-10  8:01 ` [PATCH v8 12/23] tools/testing/vma: Implement vm_refcnt reset Vlastimil Babka
@ 2025-09-25 16:38   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-25 16:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>
> Add the reset of the ref count in vma_lock_init().  This is needed if
> the vma memory is not zeroed on allocation.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>



> ---
>  tools/testing/vma/vma_internal.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index f8cf5b184d5b51dd627ff440943a7af3c549f482..6b6e2b05918c9f95b537f26e20a943b34082825a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -1373,6 +1373,8 @@ static inline void ksm_exit(struct mm_struct *mm)
>
>  static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt)
>  {
> +       if (reset_refcnt)
> +               refcount_set(&vma->vm_refcnt, 0);
>  }
>
>  static inline void vma_numab_state_init(struct vm_area_struct *vma)
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-25  4:35                     ` Suren Baghdasaryan
  2025-09-25  8:52                       ` Harry Yoo
@ 2025-09-26 10:08                       ` Vlastimil Babka
  2025-09-26 15:41                         ` Suren Baghdasaryan
  1 sibling, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-26 10:08 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Harry Yoo, Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, Paul E . McKenney

On 9/25/25 06:35, Suren Baghdasaryan wrote:
> On Thu, Sep 18, 2025 at 1:09 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Expected advantages:
>> - batching the kfree_rcu() operations, that could eventually replace the
>>   existing batching
>> - sheaves can be reused for allocations via barn instead of being
>>   flushed to slabs, which is more efficient
>>   - this includes cases where only some cpus are allowed to process rcu
>>     callbacks (Android)
> 
> nit: I would say it's more CONFIG_RCU_NOCB_CPU related. Android is
> just an instance of that.

OK changed that.

Changes due to your other suggestions:

diff --git a/mm/slub.c b/mm/slub.c
index 8220ce095970..fec0cdc7ef37 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3945,15 +3945,12 @@ void flush_all_rcu_sheaves(void)
                         */
 
                        INIT_WORK(&sfw->work, flush_rcu_sheaf);
-                       sfw->skip = false;
                        sfw->s = s;
                        queue_work_on(cpu, flushwq, &sfw->work);
                }
 
                for_each_online_cpu(cpu) {
                        sfw = &per_cpu(slub_flush, cpu);
-                       if (sfw->skip)
-                               continue;
                        flush_work(&sfw->work);
                }
 
@@ -5643,6 +5640,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
        rcu_sheaf = pcs->rcu_free;
 
+       /*
+        * Since we flush immediately when size reaches capacity, we never reach
+        * this with size already at capacity, so no OOB write is possible.
+        */
        rcu_sheaf->objects[rcu_sheaf->size++] = obj;
 
        if (likely(rcu_sheaf->size < s->sheaf_capacity))



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-26 10:08                       ` Vlastimil Babka
@ 2025-09-26 15:41                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-26 15:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, Paul E . McKenney

On Fri, Sep 26, 2025 at 3:08 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 9/25/25 06:35, Suren Baghdasaryan wrote:
> > On Thu, Sep 18, 2025 at 1:09 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> Expected advantages:
> >> - batching the kfree_rcu() operations, that could eventually replace the
> >>   existing batching
> >> - sheaves can be reused for allocations via barn instead of being
> >>   flushed to slabs, which is more efficient
> >>   - this includes cases where only some cpus are allowed to process rcu
> >>     callbacks (Android)
> >
> > nit: I would say it's more CONFIG_RCU_NOCB_CPU related. Android is
> > just an instance of that.
>
> OK changed that.
>
> Changes due to your other suggestions:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 8220ce095970..fec0cdc7ef37 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3945,15 +3945,12 @@ void flush_all_rcu_sheaves(void)
>                          */
>
>                         INIT_WORK(&sfw->work, flush_rcu_sheaf);
> -                       sfw->skip = false;
>                         sfw->s = s;
>                         queue_work_on(cpu, flushwq, &sfw->work);
>                 }
>
>                 for_each_online_cpu(cpu) {
>                         sfw = &per_cpu(slub_flush, cpu);
> -                       if (sfw->skip)
> -                               continue;
>                         flush_work(&sfw->work);
>                 }
>
> @@ -5643,6 +5640,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>
>         rcu_sheaf = pcs->rcu_free;
>
> +       /*
> +        * Since we flush immediately when size reaches capacity, we never reach
> +        * this with size already at capacity, so no OOB write is possible.
> +        */

Perfect!

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>         rcu_sheaf->objects[rcu_sheaf->size++] = obj;
>
>         if (likely(rcu_sheaf->size < s->sheaf_capacity))
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 13/23] tools/testing: Add support for changes to slab for sheaves
  2025-09-10  8:01 ` [PATCH v8 13/23] tools/testing: Add support for changes to slab for sheaves Vlastimil Babka
@ 2025-09-26 23:28   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-26 23:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The slab changes for sheaves requires more effort in the testing code.
> Unite all the kmem_cache work into the tools/include slab header for
> both the vma and maple tree testing.
>
> The vma test code also requires importing more #defines to allow for
> seamless use of the shared kmem_cache code.
>
> This adds the pthread header to the slab header in the tools directory
> to allow for the pthread_mutex in linux.c.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

The patch does several things and could be split in 3 (code
refactoring, kmem_cache_create change, new definitions like
_slab_flag_bits) but I don't think it's worth a respin.

> ---
>  tools/include/linux/slab.h        | 137 ++++++++++++++++++++++++++++++++++++--
>  tools/testing/shared/linux.c      |  26 ++------
>  tools/testing/shared/maple-shim.c |   1 +
>  tools/testing/vma/vma_internal.h  |  92 +------------------------
>  4 files changed, 142 insertions(+), 114 deletions(-)
>
> diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
> index c87051e2b26f5a7fee0362697fae067076b8e84d..c5c5cc6db5668be2cc94c29065ccfa7ca7b4bb08 100644
> --- a/tools/include/linux/slab.h
> +++ b/tools/include/linux/slab.h
> @@ -4,11 +4,31 @@
>
>  #include <linux/types.h>
>  #include <linux/gfp.h>
> +#include <pthread.h>
>
> -#define SLAB_PANIC 2
>  #define SLAB_RECLAIM_ACCOUNT    0x00020000UL            /* Objects are reclaimable */
>
>  #define kzalloc_node(size, flags, node) kmalloc(size, flags)
> +enum _slab_flag_bits {
> +       _SLAB_KMALLOC,
> +       _SLAB_HWCACHE_ALIGN,
> +       _SLAB_PANIC,
> +       _SLAB_TYPESAFE_BY_RCU,
> +       _SLAB_ACCOUNT,
> +       _SLAB_FLAGS_LAST_BIT
> +};
> +
> +#define __SLAB_FLAG_BIT(nr)    ((unsigned int __force)(1U << (nr)))
> +#define __SLAB_FLAG_UNUSED     ((unsigned int __force)(0U))
> +
> +#define SLAB_HWCACHE_ALIGN     __SLAB_FLAG_BIT(_SLAB_HWCACHE_ALIGN)
> +#define SLAB_PANIC             __SLAB_FLAG_BIT(_SLAB_PANIC)
> +#define SLAB_TYPESAFE_BY_RCU   __SLAB_FLAG_BIT(_SLAB_TYPESAFE_BY_RCU)
> +#ifdef CONFIG_MEMCG
> +# define SLAB_ACCOUNT          __SLAB_FLAG_BIT(_SLAB_ACCOUNT)
> +#else
> +# define SLAB_ACCOUNT          __SLAB_FLAG_UNUSED
> +#endif
>
>  void *kmalloc(size_t size, gfp_t gfp);
>  void kfree(void *p);
> @@ -23,6 +43,86 @@ enum slab_state {
>         FULL
>  };
>
> +struct kmem_cache {
> +       pthread_mutex_t lock;
> +       unsigned int size;
> +       unsigned int align;
> +       unsigned int sheaf_capacity;
> +       int nr_objs;
> +       void *objs;
> +       void (*ctor)(void *);
> +       bool non_kernel_enabled;
> +       unsigned int non_kernel;
> +       unsigned long nr_allocated;
> +       unsigned long nr_tallocated;
> +       bool exec_callback;
> +       void (*callback)(void *);
> +       void *private;
> +};
> +
> +struct kmem_cache_args {
> +       /**
> +        * @align: The required alignment for the objects.
> +        *
> +        * %0 means no specific alignment is requested.
> +        */
> +       unsigned int align;
> +       /**
> +        * @sheaf_capacity: The maximum size of the sheaf.
> +        */
> +       unsigned int sheaf_capacity;
> +       /**
> +        * @useroffset: Usercopy region offset.
> +        *
> +        * %0 is a valid offset, when @usersize is non-%0
> +        */
> +       unsigned int useroffset;
> +       /**
> +        * @usersize: Usercopy region size.
> +        *
> +        * %0 means no usercopy region is specified.
> +        */
> +       unsigned int usersize;
> +       /**
> +        * @freeptr_offset: Custom offset for the free pointer
> +        * in &SLAB_TYPESAFE_BY_RCU caches
> +        *
> +        * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
> +        * outside of the object. This might cause the object to grow in size.
> +        * Cache creators that have a reason to avoid this can specify a custom
> +        * free pointer offset in their struct where the free pointer will be
> +        * placed.
> +        *
> +        * Note that placing the free pointer inside the object requires the
> +        * caller to ensure that no fields are invalidated that are required to
> +        * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
> +        * details).
> +        *
> +        * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
> +        * is specified, %use_freeptr_offset must be set %true.
> +        *
> +        * Note that @ctor currently isn't supported with custom free pointers
> +        * as a @ctor requires an external free pointer.
> +        */
> +       unsigned int freeptr_offset;
> +       /**
> +        * @use_freeptr_offset: Whether a @freeptr_offset is used.
> +        */
> +       bool use_freeptr_offset;
> +       /**
> +        * @ctor: A constructor for the objects.
> +        *
> +        * The constructor is invoked for each object in a newly allocated slab
> +        * page. It is the cache user's responsibility to free object in the
> +        * same state as after calling the constructor, or deal appropriately
> +        * with any differences between a freshly constructed and a reallocated
> +        * object.
> +        *
> +        * %NULL means no constructor.
> +        */
> +       void (*ctor)(void *);
> +};
> +
>  static inline void *kzalloc(size_t size, gfp_t gfp)
>  {
>         return kmalloc(size, gfp | __GFP_ZERO);
> @@ -37,9 +137,38 @@ static inline void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
>  }
>  void kmem_cache_free(struct kmem_cache *cachep, void *objp);
>
> -struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
> -                       unsigned int align, unsigned int flags,
> -                       void (*ctor)(void *));
> +
> +struct kmem_cache *
> +__kmem_cache_create_args(const char *name, unsigned int size,
> +               struct kmem_cache_args *args, unsigned int flags);
> +
> +/* If NULL is passed for @args, use this variant with default arguments. */
> +static inline struct kmem_cache *
> +__kmem_cache_default_args(const char *name, unsigned int size,
> +               struct kmem_cache_args *args, unsigned int flags)
> +{
> +       struct kmem_cache_args kmem_default_args = {};
> +
> +       return __kmem_cache_create_args(name, size, &kmem_default_args, flags);
> +}
> +
> +static inline struct kmem_cache *
> +__kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> +               unsigned int flags, void (*ctor)(void *))
> +{
> +       struct kmem_cache_args kmem_args = {
> +               .align  = align,
> +               .ctor   = ctor,
> +       };
> +
> +       return __kmem_cache_create_args(name, size, &kmem_args, flags);
> +}
> +
> +#define kmem_cache_create(__name, __object_size, __args, ...)           \
> +       _Generic((__args),                                              \
> +               struct kmem_cache_args *: __kmem_cache_create_args,     \
> +               void *: __kmem_cache_default_args,                      \
> +               default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
>
>  void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
>  int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
> diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
> index 0f97fb0d19e19c327aa4843a35b45cc086f4f366..97b8412ccbb6d222604c7b397c53c65618d8d51b 100644
> --- a/tools/testing/shared/linux.c
> +++ b/tools/testing/shared/linux.c
> @@ -16,21 +16,6 @@ int nr_allocated;
>  int preempt_count;
>  int test_verbose;
>
> -struct kmem_cache {
> -       pthread_mutex_t lock;
> -       unsigned int size;
> -       unsigned int align;
> -       int nr_objs;
> -       void *objs;
> -       void (*ctor)(void *);
> -       unsigned int non_kernel;
> -       unsigned long nr_allocated;
> -       unsigned long nr_tallocated;
> -       bool exec_callback;
> -       void (*callback)(void *);
> -       void *private;
> -};
> -
>  void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *))
>  {
>         cachep->callback = callback;
> @@ -234,23 +219,26 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>  }
>
>  struct kmem_cache *
> -kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> -               unsigned int flags, void (*ctor)(void *))
> +__kmem_cache_create_args(const char *name, unsigned int size,
> +                         struct kmem_cache_args *args,
> +                         unsigned int flags)
>  {
>         struct kmem_cache *ret = malloc(sizeof(*ret));
>
>         pthread_mutex_init(&ret->lock, NULL);
>         ret->size = size;
> -       ret->align = align;
> +       ret->align = args->align;
> +       ret->sheaf_capacity = args->sheaf_capacity;
>         ret->nr_objs = 0;
>         ret->nr_allocated = 0;
>         ret->nr_tallocated = 0;
>         ret->objs = NULL;
> -       ret->ctor = ctor;
> +       ret->ctor = args->ctor;
>         ret->non_kernel = 0;
>         ret->exec_callback = false;
>         ret->callback = NULL;
>         ret->private = NULL;
> +
>         return ret;
>  }
>
> diff --git a/tools/testing/shared/maple-shim.c b/tools/testing/shared/maple-shim.c
> index 640df76f483e09f3b6f85612786060dd273e2362..9d7b743415660305416e972fa75b56824211b0eb 100644
> --- a/tools/testing/shared/maple-shim.c
> +++ b/tools/testing/shared/maple-shim.c
> @@ -3,5 +3,6 @@
>  /* Very simple shim around the maple tree. */
>
>  #include "maple-shared.h"
> +#include <linux/slab.h>
>
>  #include "../../../lib/maple_tree.c"
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 6b6e2b05918c9f95b537f26e20a943b34082825a..d5b87fa6a133f6d676488de2538c509e0f0e1d54 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -26,6 +26,7 @@
>  #include <linux/mm.h>
>  #include <linux/rbtree.h>
>  #include <linux/refcount.h>
> +#include <linux/slab.h>
>
>  extern unsigned long stack_guard_gap;
>  #ifdef CONFIG_MMU
> @@ -509,65 +510,6 @@ struct pagetable_move_control {
>                 .len_in = len_,                                         \
>         }
>
> -struct kmem_cache_args {
> -       /**
> -        * @align: The required alignment for the objects.
> -        *
> -        * %0 means no specific alignment is requested.
> -        */
> -       unsigned int align;
> -       /**
> -        * @useroffset: Usercopy region offset.
> -        *
> -        * %0 is a valid offset, when @usersize is non-%0
> -        */
> -       unsigned int useroffset;
> -       /**
> -        * @usersize: Usercopy region size.
> -        *
> -        * %0 means no usercopy region is specified.
> -        */
> -       unsigned int usersize;
> -       /**
> -        * @freeptr_offset: Custom offset for the free pointer
> -        * in &SLAB_TYPESAFE_BY_RCU caches
> -        *
> -        * By default &SLAB_TYPESAFE_BY_RCU caches place the free pointer
> -        * outside of the object. This might cause the object to grow in size.
> -        * Cache creators that have a reason to avoid this can specify a custom
> -        * free pointer offset in their struct where the free pointer will be
> -        * placed.
> -        *
> -        * Note that placing the free pointer inside the object requires the
> -        * caller to ensure that no fields are invalidated that are required to
> -        * guard against object recycling (See &SLAB_TYPESAFE_BY_RCU for
> -        * details).
> -        *
> -        * Using %0 as a value for @freeptr_offset is valid. If @freeptr_offset
> -        * is specified, %use_freeptr_offset must be set %true.
> -        *
> -        * Note that @ctor currently isn't supported with custom free pointers
> -        * as a @ctor requires an external free pointer.
> -        */
> -       unsigned int freeptr_offset;
> -       /**
> -        * @use_freeptr_offset: Whether a @freeptr_offset is used.
> -        */
> -       bool use_freeptr_offset;
> -       /**
> -        * @ctor: A constructor for the objects.
> -        *
> -        * The constructor is invoked for each object in a newly allocated slab
> -        * page. It is the cache user's responsibility to free object in the
> -        * same state as after calling the constructor, or deal appropriately
> -        * with any differences between a freshly constructed and a reallocated
> -        * object.
> -        *
> -        * %NULL means no constructor.
> -        */
> -       void (*ctor)(void *);
> -};
> -
>  static inline void vma_iter_invalidate(struct vma_iterator *vmi)
>  {
>         mas_pause(&vmi->mas);
> @@ -652,38 +594,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>         vma->vm_lock_seq = UINT_MAX;
>  }
>
> -struct kmem_cache {
> -       const char *name;
> -       size_t object_size;
> -       struct kmem_cache_args *args;
> -};
> -
> -static inline struct kmem_cache *__kmem_cache_create(const char *name,
> -                                                    size_t object_size,
> -                                                    struct kmem_cache_args *args)
> -{
> -       struct kmem_cache *ret = malloc(sizeof(struct kmem_cache));
> -
> -       ret->name = name;
> -       ret->object_size = object_size;
> -       ret->args = args;
> -
> -       return ret;
> -}
> -
> -#define kmem_cache_create(__name, __object_size, __args, ...)           \
> -       __kmem_cache_create((__name), (__object_size), (__args))
> -
> -static inline void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> -{
> -       return calloc(1, s->object_size);
> -}
> -
> -static inline void kmem_cache_free(struct kmem_cache *s, void *x)
> -{
> -       free(x);
> -}
> -
>  /*
>   * These are defined in vma.h, but sadly vm_stat_account() is referenced by
>   * kernel/fork.c, so we have to these broadly available there, and temporarily
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 16/23] tools/testing: include maple-shim.c in maple.c
  2025-09-10  8:01 ` [PATCH v8 16/23] tools/testing: include maple-shim.c in maple.c Vlastimil Babka
@ 2025-09-26 23:45   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-26 23:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> There's some duplicated code and we are about to add more functionality
> in maple-shared.h that we will need in the userspace maple test to be
> available, so include it via maple-shim.c
>
> Co-developed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  tools/testing/radix-tree/maple.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index c0543060dae2510477963331fb0ccdffd78ea965..4a35e1e7c64b7ce347cbd1693beeaacb0c4c330e 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c
> @@ -8,14 +8,6 @@
>   * difficult to handle in kernel tests.
>   */
>
> -#define CONFIG_DEBUG_MAPLE_TREE
> -#define CONFIG_MAPLE_SEARCH
> -#define MAPLE_32BIT (MAPLE_NODE_SLOTS > 31)
> -#include "test.h"
> -#include <stdlib.h>
> -#include <time.h>
> -#include <linux/init.h>
> -
>  #define module_init(x)
>  #define module_exit(x)
>  #define MODULE_AUTHOR(x)
> @@ -23,7 +15,9 @@
>  #define MODULE_LICENSE(x)
>  #define dump_stack()   assert(0)
>
> -#include "../../../lib/maple_tree.c"
> +#include "test.h"
> +
> +#include "../shared/maple-shim.c"
>  #include "../../../lib/test_maple_tree.c"
>
>  #define RCU_RANGE_COUNT 1000
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 17/23] testing/radix-tree/maple: Hack around kfree_rcu not existing
  2025-09-10  8:01 ` [PATCH v8 17/23] testing/radix-tree/maple: Hack around kfree_rcu not existing Vlastimil Babka
@ 2025-09-26 23:53   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-26 23:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, Pedro Falcato

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>
> liburcu doesn't have kfree_rcu (or anything similar). Despite that, we
> can hack around it in a trivial fashion, by adding a wrapper.
>
> The wrapper only works for maple_nodes because we cannot get the
> kmem_cache pointer any other way in the test code.
>
> Link: https://lore.kernel.org/all/20250812162124.59417-1-pfalcato@suse.de/
> Suggested-by: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

With one nit below:

> ---
>  tools/testing/shared/maple-shared.h | 11 +++++++++++
>  tools/testing/shared/maple-shim.c   |  6 ++++++
>  2 files changed, 17 insertions(+)
>
> diff --git a/tools/testing/shared/maple-shared.h b/tools/testing/shared/maple-shared.h
> index dc4d30f3860b9bd23b4177c7d7926ac686887815..2a1e9a8594a2834326cd9374738b2a2c7c3f9f7c 100644
> --- a/tools/testing/shared/maple-shared.h
> +++ b/tools/testing/shared/maple-shared.h
> @@ -10,4 +10,15 @@
>  #include <time.h>
>  #include "linux/init.h"
>
> +void maple_rcu_cb(struct rcu_head *head);
> +#define rcu_cb         maple_rcu_cb
> +
> +#define kfree_rcu(_struct, _memb)              \
> +do {                                            \
> +    typeof(_struct) _p_struct = (_struct);      \

Maybe add an assertion that (typeof(_struct) == typeof(struct
maple_node)) to make sure kfree_rcu() is not used for anything else in
the tests?

> +                                                \
> +    call_rcu(&((_p_struct)->_memb), rcu_cb);    \
> +} while(0);
> +
> +
>  #endif /* __MAPLE_SHARED_H__ */
> diff --git a/tools/testing/shared/maple-shim.c b/tools/testing/shared/maple-shim.c
> index 9d7b743415660305416e972fa75b56824211b0eb..16252ee616c0489c80490ff25b8d255427bf9fdc 100644
> --- a/tools/testing/shared/maple-shim.c
> +++ b/tools/testing/shared/maple-shim.c
> @@ -6,3 +6,9 @@
>  #include <linux/slab.h>
>
>  #include "../../../lib/maple_tree.c"
> +
> +void maple_rcu_cb(struct rcu_head *head) {
> +       struct maple_node *node = container_of(head, struct maple_node, rcu);
> +
> +       kmem_cache_free(maple_node_cache, node);
> +}
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 18/23] maple_tree: Use kfree_rcu in ma_free_rcu
  2025-09-17 11:46   ` Harry Yoo
@ 2025-09-27  0:05     ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-27  0:05 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Pedro Falcato

On Wed, Sep 17, 2025 at 4:46 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Wed, Sep 10, 2025 at 10:01:20AM +0200, Vlastimil Babka wrote:
> > From: Pedro Falcato <pfalcato@suse.de>
> >
> > kfree_rcu is an optimized version of call_rcu + kfree. It used to not be
> > possible to call it on non-kmalloc objects, but this restriction was
> > lifted ever since SLOB was dropped from the kernel, and since commit
> > 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()").
> >
> > Thus, replace call_rcu + mt_free_rcu with kfree_rcu.
> >
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
>
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> --
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 19/23] maple_tree: Replace mt_free_one() with kfree()
  2025-09-10  8:01 ` [PATCH v8 19/23] maple_tree: Replace mt_free_one() with kfree() Vlastimil Babka
@ 2025-09-27  0:06   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-27  0:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, Pedro Falcato

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: Pedro Falcato <pfalcato@suse.de>
>
> kfree() is a little shorter and works with kmem_cache_alloc'd pointers
> too. Also lets us remove one more helper.
>
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  lib/maple_tree.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index c706e2e48f884fd156e25be2b17eb5e154774db7..0439aaacf6cb1f39d0d23af2e2a5af1d27ab32be 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -177,11 +177,6 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>         return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>  }
>
> -static inline void mt_free_one(struct maple_node *node)
> -{
> -       kmem_cache_free(maple_node_cache, node);
> -}
> -
>  static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>  {
>         kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> @@ -5092,7 +5087,7 @@ static void mt_free_walk(struct rcu_head *head)
>         mt_free_bulk(node->slot_len, slots);
>
>  free_leaf:
> -       mt_free_one(node);
> +       kfree(node);
>  }
>
>  static inline void __rcu **mte_destroy_descend(struct maple_enode **enode,
> @@ -5176,7 +5171,7 @@ static void mt_destroy_walk(struct maple_enode *enode, struct maple_tree *mt,
>
>  free_leaf:
>         if (free)
> -               mt_free_one(node);
> +               kfree(node);
>         else
>                 mt_clear_meta(mt, node, node->type);
>  }
> @@ -5385,7 +5380,7 @@ void mas_destroy(struct ma_state *mas)
>                         mt_free_bulk(count, (void __rcu **)&node->slot[1]);
>                         total -= count;
>                 }
> -               mt_free_one(ma_mnode_ptr(node));
> +               kfree(ma_mnode_ptr(node));
>                 total--;
>         }
>
> @@ -6373,7 +6368,7 @@ static void mas_dup_free(struct ma_state *mas)
>         }
>
>         node = mte_to_node(mas->node);
> -       mt_free_one(node);
> +       kfree(node);
>  }
>
>  /*
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 20/23] tools/testing: Add support for prefilled slab sheafs
  2025-09-10  8:01 ` [PATCH v8 20/23] tools/testing: Add support for prefilled slab sheafs Vlastimil Babka
@ 2025-09-27  0:28   ` Suren Baghdasaryan
  0 siblings, 0 replies; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-27  0:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>
> Add the prefilled sheaf structs to the slab header and the associated
> functions to the testing/shared/linux.c file.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  tools/include/linux/slab.h   | 28 ++++++++++++++
>  tools/testing/shared/linux.c | 89 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 117 insertions(+)
>
> diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
> index c5c5cc6db5668be2cc94c29065ccfa7ca7b4bb08..94937a699402bd1f31887dfb52b6fd0a3c986f43 100644
> --- a/tools/include/linux/slab.h
> +++ b/tools/include/linux/slab.h
> @@ -123,6 +123,18 @@ struct kmem_cache_args {
>         void (*ctor)(void *);
>  };
>
> +struct slab_sheaf {
> +       union {
> +               struct list_head barn_list;
> +               /* only used for prefilled sheafs */
> +               unsigned int capacity;
> +       };
> +       struct kmem_cache *cache;
> +       unsigned int size;
> +       int node; /* only used for rcu_sheaf */
> +       void *objects[];
> +};
> +
>  static inline void *kzalloc(size_t size, gfp_t gfp)
>  {
>         return kmalloc(size, gfp | __GFP_ZERO);
> @@ -173,5 +185,21 @@ __kmem_cache_create(const char *name, unsigned int size, unsigned int align,
>  void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
>  int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>                           void **list);
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
> +
> +void *
> +kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf *sheaf);
> +
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf *sheaf);
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf **sheafp, unsigned int size);
> +
> +static inline unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
> +{
> +       return sheaf->size;
> +}
>
>  #endif         /* _TOOLS_SLAB_H */
> diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
> index 97b8412ccbb6d222604c7b397c53c65618d8d51b..4ceff7969b78cf8e33cd1e021c68bc9f8a02a7a1 100644
> --- a/tools/testing/shared/linux.c
> +++ b/tools/testing/shared/linux.c
> @@ -137,6 +137,12 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
>         if (kmalloc_verbose)
>                 pr_debug("Bulk free %p[0-%zu]\n", list, size - 1);
>
> +       if (cachep->exec_callback) {
> +               if (cachep->callback)
> +                       cachep->callback(cachep->private);
> +               cachep->exec_callback = false;
> +       }
> +
>         pthread_mutex_lock(&cachep->lock);
>         for (int i = 0; i < size; i++)
>                 kmem_cache_free_locked(cachep, list[i]);
> @@ -242,6 +248,89 @@ __kmem_cache_create_args(const char *name, unsigned int size,
>         return ret;
>  }
>
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
> +{
> +       struct slab_sheaf *sheaf;
> +       unsigned int capacity;
> +
> +       if (s->exec_callback) {
> +               if (s->callback)
> +                       s->callback(s->private);
> +               s->exec_callback = false;
> +       }
> +
> +       capacity = max(size, s->sheaf_capacity);
> +
> +       sheaf = calloc(1, sizeof(*sheaf) + sizeof(void *) * capacity);
> +       if (!sheaf)
> +               return NULL;
> +
> +       sheaf->cache = s;
> +       sheaf->capacity = capacity;
> +       sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
> +       if (!sheaf->size) {
> +               free(sheaf);
> +               return NULL;
> +       }
> +
> +       return sheaf;
> +}
> +
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                struct slab_sheaf **sheafp, unsigned int size)
> +{
> +       struct slab_sheaf *sheaf = *sheafp;
> +       int refill;
> +
> +       if (sheaf->size >= size)
> +               return 0;
> +
> +       if (size > sheaf->capacity) {
> +               sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
> +               if (!sheaf)
> +                       return -ENOMEM;
> +
> +               kmem_cache_return_sheaf(s, gfp, *sheafp);
> +               *sheafp = sheaf;
> +               return 0;
> +       }
> +
> +       refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
> +                                      &sheaf->objects[sheaf->size]);
> +       if (!refill)
> +               return -ENOMEM;
> +
> +       sheaf->size += refill;
> +       return 0;
> +}
> +
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                struct slab_sheaf *sheaf)
> +{
> +       if (sheaf->size)
> +               kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
> +
> +       free(sheaf);
> +}
> +
> +void *
> +kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf *sheaf)
> +{
> +       void *obj;
> +
> +       if (sheaf->size == 0) {
> +               printf("Nothing left in sheaf!\n");
> +               return NULL;
> +       }
> +
> +       obj = sheaf->objects[--sheaf->size];
> +       sheaf->objects[sheaf->size] = NULL;
> +
> +       return obj;
> +}
> +
>  /*
>   * Test the test infrastructure for kem_cache_alloc/free and bulk counterparts.
>   */
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing
  2025-09-10  8:01 ` [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing Vlastimil Babka
@ 2025-09-27  1:08   ` Suren Baghdasaryan
  2025-09-29  7:30     ` Vlastimil Babka
  0 siblings, 1 reply; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-27  1:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>
> Use prefilled sheaves instead of bulk allocations. This should speed up
> the allocations and the return path of unused allocations.
>
> Remove the push and pop of nodes from the maple state as this is now
> handled by the slab layer with sheaves.
>
> Testing has been removed as necessary since the features of the tree
> have been reduced.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Couple nits but otherwise looks great!

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  include/linux/maple_tree.h       |   6 +-
>  lib/maple_tree.c                 | 326 ++++++---------------------
>  tools/testing/radix-tree/maple.c | 461 ++-------------------------------------
>  tools/testing/shared/linux.c     |   5 +-
>  4 files changed, 88 insertions(+), 710 deletions(-)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index bafe143b1f783202e27b32567fffee4149e8e266..166fd67e00d882b1e6de1f80c1b590bba7497cd3 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -442,7 +442,8 @@ struct ma_state {
>         struct maple_enode *node;       /* The node containing this entry */
>         unsigned long min;              /* The minimum index of this node - implied pivot min */
>         unsigned long max;              /* The maximum index of this node - implied pivot max */
> -       struct maple_alloc *alloc;      /* Allocated nodes for this operation */
> +       struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
> +       unsigned long node_request;

No comment for this poor fella?

>         enum maple_status status;       /* The status of the state (active, start, none, etc) */
>         unsigned char depth;            /* depth of tree descent during write */
>         unsigned char offset;
> @@ -490,7 +491,8 @@ struct ma_wr_state {
>                 .status = ma_start,                                     \
>                 .min = 0,                                               \
>                 .max = ULONG_MAX,                                       \
> -               .alloc = NULL,                                          \
> +               .node_request= 0,                                       \

Missing space in assignment.

> +               .sheaf = NULL,                                          \
>                 .mas_flags = 0,                                         \
>                 .store_type = wr_invalid,                               \
>         }
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 0439aaacf6cb1f39d0d23af2e2a5af1d27ab32be..a3fcb20227e506ed209554cc8c041a53f7ef4903 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -182,6 +182,22 @@ static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>         kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
>  }
>
> +static void mt_return_sheaf(struct slab_sheaf *sheaf)
> +{
> +       kmem_cache_return_sheaf(maple_node_cache, GFP_NOWAIT, sheaf);
> +}
> +
> +static struct slab_sheaf *mt_get_sheaf(gfp_t gfp, int count)
> +{
> +       return kmem_cache_prefill_sheaf(maple_node_cache, gfp, count);
> +}
> +
> +static int mt_refill_sheaf(gfp_t gfp, struct slab_sheaf **sheaf,
> +               unsigned int size)
> +{
> +       return kmem_cache_refill_sheaf(maple_node_cache, gfp, sheaf, size);
> +}
> +
>  /*
>   * ma_free_rcu() - Use rcu callback to free a maple node
>   * @node: The node to free
> @@ -574,67 +590,6 @@ static __always_inline bool mte_dead_node(const struct maple_enode *enode)
>         return ma_dead_node(node);
>  }
>
> -/*
> - * mas_allocated() - Get the number of nodes allocated in a maple state.
> - * @mas: The maple state
> - *
> - * The ma_state alloc member is overloaded to hold a pointer to the first
> - * allocated node or to the number of requested nodes to allocate.  If bit 0 is
> - * set, then the alloc contains the number of requested nodes.  If there is an
> - * allocated node, then the total allocated nodes is in that node.
> - *
> - * Return: The total number of nodes allocated
> - */
> -static inline unsigned long mas_allocated(const struct ma_state *mas)
> -{
> -       if (!mas->alloc || ((unsigned long)mas->alloc & 0x1))
> -               return 0;
> -
> -       return mas->alloc->total;
> -}
> -
> -/*
> - * mas_set_alloc_req() - Set the requested number of allocations.
> - * @mas: the maple state
> - * @count: the number of allocations.
> - *
> - * The requested number of allocations is either in the first allocated node,
> - * located in @mas->alloc->request_count, or directly in @mas->alloc if there is
> - * no allocated node.  Set the request either in the node or do the necessary
> - * encoding to store in @mas->alloc directly.
> - */
> -static inline void mas_set_alloc_req(struct ma_state *mas, unsigned long count)
> -{
> -       if (!mas->alloc || ((unsigned long)mas->alloc & 0x1)) {
> -               if (!count)
> -                       mas->alloc = NULL;
> -               else
> -                       mas->alloc = (struct maple_alloc *)(((count) << 1U) | 1U);
> -               return;
> -       }
> -
> -       mas->alloc->request_count = count;
> -}
> -
> -/*
> - * mas_alloc_req() - get the requested number of allocations.
> - * @mas: The maple state
> - *
> - * The alloc count is either stored directly in @mas, or in
> - * @mas->alloc->request_count if there is at least one node allocated.  Decode
> - * the request count if it's stored directly in @mas->alloc.
> - *
> - * Return: The allocation request count.
> - */
> -static inline unsigned int mas_alloc_req(const struct ma_state *mas)
> -{
> -       if ((unsigned long)mas->alloc & 0x1)
> -               return (unsigned long)(mas->alloc) >> 1;
> -       else if (mas->alloc)
> -               return mas->alloc->request_count;
> -       return 0;
> -}
> -
>  /*
>   * ma_pivots() - Get a pointer to the maple node pivots.
>   * @node: the maple node
> @@ -1120,77 +1075,15 @@ static int mas_ascend(struct ma_state *mas)
>   */
>  static inline struct maple_node *mas_pop_node(struct ma_state *mas)
>  {
> -       struct maple_alloc *ret, *node = mas->alloc;
> -       unsigned long total = mas_allocated(mas);
> -       unsigned int req = mas_alloc_req(mas);
> +       struct maple_node *ret;
>
> -       /* nothing or a request pending. */
> -       if (WARN_ON(!total))
> +       if (WARN_ON_ONCE(!mas->sheaf))
>                 return NULL;
>
> -       if (total == 1) {
> -               /* single allocation in this ma_state */
> -               mas->alloc = NULL;
> -               ret = node;
> -               goto single_node;
> -       }
> -
> -       if (node->node_count == 1) {
> -               /* Single allocation in this node. */
> -               mas->alloc = node->slot[0];
> -               mas->alloc->total = node->total - 1;
> -               ret = node;
> -               goto new_head;
> -       }
> -       node->total--;
> -       ret = node->slot[--node->node_count];
> -       node->slot[node->node_count] = NULL;
> -
> -single_node:
> -new_head:
> -       if (req) {
> -               req++;
> -               mas_set_alloc_req(mas, req);
> -       }
> -
> +       ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
>         memset(ret, 0, sizeof(*ret));
> -       return (struct maple_node *)ret;
> -}
> -
> -/*
> - * mas_push_node() - Push a node back on the maple state allocation.
> - * @mas: The maple state
> - * @used: The used maple node
> - *
> - * Stores the maple node back into @mas->alloc for reuse.  Updates allocated and
> - * requested node count as necessary.
> - */
> -static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
> -{
> -       struct maple_alloc *reuse = (struct maple_alloc *)used;
> -       struct maple_alloc *head = mas->alloc;
> -       unsigned long count;
> -       unsigned int requested = mas_alloc_req(mas);
>
> -       count = mas_allocated(mas);
> -
> -       reuse->request_count = 0;
> -       reuse->node_count = 0;
> -       if (count) {
> -               if (head->node_count < MAPLE_ALLOC_SLOTS) {
> -                       head->slot[head->node_count++] = reuse;
> -                       head->total++;
> -                       goto done;
> -               }
> -               reuse->slot[0] = head;
> -               reuse->node_count = 1;
> -       }
> -
> -       reuse->total = count + 1;
> -       mas->alloc = reuse;
> -done:
> -       if (requested > 1)
> -               mas_set_alloc_req(mas, requested - 1);
> +       return ret;
>  }
>
>  /*
> @@ -1200,75 +1093,32 @@ static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
>   */
>  static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>  {
> -       struct maple_alloc *node;
> -       unsigned long allocated = mas_allocated(mas);
> -       unsigned int requested = mas_alloc_req(mas);
> -       unsigned int count;
> -       void **slots = NULL;
> -       unsigned int max_req = 0;
> -
> -       if (!requested)
> -               return;
> +       if (unlikely(mas->sheaf)) {
> +               unsigned long refill = mas->node_request;
>
> -       mas_set_alloc_req(mas, 0);
> -       if (mas->mas_flags & MA_STATE_PREALLOC) {
> -               if (allocated)
> +               if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
> +                       mas->node_request = 0;
>                         return;
> -               WARN_ON(!allocated);
> -       }
> -
> -       if (!allocated || mas->alloc->node_count == MAPLE_ALLOC_SLOTS) {
> -               node = (struct maple_alloc *)mt_alloc_one(gfp);
> -               if (!node)
> -                       goto nomem_one;
> -
> -               if (allocated) {
> -                       node->slot[0] = mas->alloc;
> -                       node->node_count = 1;
> -               } else {
> -                       node->node_count = 0;
>                 }
>
> -               mas->alloc = node;
> -               node->total = ++allocated;
> -               node->request_count = 0;
> -               requested--;
> -       }
> +               if (mt_refill_sheaf(gfp, &mas->sheaf, refill))
> +                       goto error;
>
> -       node = mas->alloc;
> -       while (requested) {
> -               max_req = MAPLE_ALLOC_SLOTS - node->node_count;
> -               slots = (void **)&node->slot[node->node_count];
> -               max_req = min(requested, max_req);
> -               count = mt_alloc_bulk(gfp, max_req, slots);
> -               if (!count)
> -                       goto nomem_bulk;
> -
> -               if (node->node_count == 0) {
> -                       node->slot[0]->node_count = 0;
> -                       node->slot[0]->request_count = 0;
> -               }
> +               mas->node_request = 0;
> +               return;
> +       }
>
> -               node->node_count += count;
> -               allocated += count;
> -               /* find a non-full node*/
> -               do {
> -                       node = node->slot[0];
> -               } while (unlikely(node->node_count == MAPLE_ALLOC_SLOTS));
> -               requested -= count;
> +       mas->sheaf = mt_get_sheaf(gfp, mas->node_request);
> +       if (likely(mas->sheaf)) {
> +               mas->node_request = 0;
> +               return;
>         }
> -       mas->alloc->total = allocated;
> -       return;
>
> -nomem_bulk:
> -       /* Clean up potential freed allocations on bulk failure */
> -       memset(slots, 0, max_req * sizeof(unsigned long));
> -       mas->alloc->total = allocated;
> -nomem_one:
> -       mas_set_alloc_req(mas, requested);
> +error:
>         mas_set_err(mas, -ENOMEM);
>  }
>
> +
>  /*
>   * mas_free() - Free an encoded maple node
>   * @mas: The maple state
> @@ -1279,42 +1129,7 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>   */
>  static inline void mas_free(struct ma_state *mas, struct maple_enode *used)
>  {
> -       struct maple_node *tmp = mte_to_node(used);
> -
> -       if (mt_in_rcu(mas->tree))
> -               ma_free_rcu(tmp);
> -       else
> -               mas_push_node(mas, tmp);
> -}
> -
> -/*
> - * mas_node_count_gfp() - Check if enough nodes are allocated and request more
> - * if there is not enough nodes.
> - * @mas: The maple state
> - * @count: The number of nodes needed
> - * @gfp: the gfp flags
> - */
> -static void mas_node_count_gfp(struct ma_state *mas, int count, gfp_t gfp)
> -{
> -       unsigned long allocated = mas_allocated(mas);
> -
> -       if (allocated < count) {
> -               mas_set_alloc_req(mas, count - allocated);
> -               mas_alloc_nodes(mas, gfp);
> -       }
> -}
> -
> -/*
> - * mas_node_count() - Check if enough nodes are allocated and request more if
> - * there is not enough nodes.
> - * @mas: The maple state
> - * @count: The number of nodes needed
> - *
> - * Note: Uses GFP_NOWAIT for gfp flags.
> - */
> -static void mas_node_count(struct ma_state *mas, int count)
> -{
> -       return mas_node_count_gfp(mas, count, GFP_NOWAIT);
> +       ma_free_rcu(mte_to_node(used));
>  }
>
>  /*
> @@ -2451,10 +2266,7 @@ static inline void mas_topiary_node(struct ma_state *mas,
>         enode = tmp_mas->node;
>         tmp = mte_to_node(enode);
>         mte_set_node_dead(enode);
> -       if (in_rcu)
> -               ma_free_rcu(tmp);
> -       else
> -               mas_push_node(mas, tmp);
> +       ma_free_rcu(tmp);
>  }
>
>  /*
> @@ -3980,7 +3792,7 @@ static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas)
>   *
>   * Return: Number of nodes required for preallocation.
>   */
> -static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
> +static inline void mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
>  {
>         struct ma_state *mas = wr_mas->mas;
>         unsigned char height = mas_mt_height(mas);
> @@ -4026,7 +3838,7 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
>                 WARN_ON_ONCE(1);
>         }
>
> -       return ret;
> +       mas->node_request = ret;
>  }
>
>  /*
> @@ -4087,15 +3899,15 @@ static inline enum store_type mas_wr_store_type(struct ma_wr_state *wr_mas)
>   */
>  static inline void mas_wr_preallocate(struct ma_wr_state *wr_mas, void *entry)
>  {
> -       int request;
> +       struct ma_state *mas = wr_mas->mas;
>
>         mas_wr_prealloc_setup(wr_mas);
> -       wr_mas->mas->store_type = mas_wr_store_type(wr_mas);
> -       request = mas_prealloc_calc(wr_mas, entry);
> -       if (!request)
> +       mas->store_type = mas_wr_store_type(wr_mas);
> +       mas_prealloc_calc(wr_mas, entry);
> +       if (!mas->node_request)
>                 return;
>
> -       mas_node_count(wr_mas->mas, request);
> +       mas_alloc_nodes(mas, GFP_NOWAIT);
>  }
>
>  /**
> @@ -5208,7 +5020,6 @@ static inline void mte_destroy_walk(struct maple_enode *enode,
>   */
>  void *mas_store(struct ma_state *mas, void *entry)
>  {
> -       int request;
>         MA_WR_STATE(wr_mas, mas, entry);
>
>         trace_ma_write(__func__, mas, 0, entry);
> @@ -5238,11 +5049,11 @@ void *mas_store(struct ma_state *mas, void *entry)
>                 return wr_mas.content;
>         }
>
> -       request = mas_prealloc_calc(&wr_mas, entry);
> -       if (!request)
> +       mas_prealloc_calc(&wr_mas, entry);
> +       if (!mas->node_request)
>                 goto store;
>
> -       mas_node_count(mas, request);
> +       mas_alloc_nodes(mas, GFP_NOWAIT);
>         if (mas_is_err(mas))
>                 return NULL;
>
> @@ -5330,20 +5141,19 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>  int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>  {
>         MA_WR_STATE(wr_mas, mas, entry);
> -       int ret = 0;
> -       int request;
>
>         mas_wr_prealloc_setup(&wr_mas);
>         mas->store_type = mas_wr_store_type(&wr_mas);
> -       request = mas_prealloc_calc(&wr_mas, entry);
> -       if (!request)
> +       mas_prealloc_calc(&wr_mas, entry);
> +       if (!mas->node_request)
>                 goto set_flag;
>
>         mas->mas_flags &= ~MA_STATE_PREALLOC;
> -       mas_node_count_gfp(mas, request, gfp);
> +       mas_alloc_nodes(mas, gfp);
>         if (mas_is_err(mas)) {
> -               mas_set_alloc_req(mas, 0);
> -               ret = xa_err(mas->node);
> +               int ret = xa_err(mas->node);
> +
> +               mas->node_request = 0;
>                 mas_destroy(mas);
>                 mas_reset(mas);
>                 return ret;
> @@ -5351,7 +5161,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>
>  set_flag:
>         mas->mas_flags |= MA_STATE_PREALLOC;
> -       return ret;
> +       return 0;
>  }
>  EXPORT_SYMBOL_GPL(mas_preallocate);
>
> @@ -5365,26 +5175,13 @@ EXPORT_SYMBOL_GPL(mas_preallocate);
>   */
>  void mas_destroy(struct ma_state *mas)
>  {
> -       struct maple_alloc *node;
> -       unsigned long total;
> -
>         mas->mas_flags &= ~MA_STATE_PREALLOC;
>
> -       total = mas_allocated(mas);
> -       while (total) {
> -               node = mas->alloc;
> -               mas->alloc = node->slot[0];
> -               if (node->node_count > 1) {
> -                       size_t count = node->node_count - 1;
> -
> -                       mt_free_bulk(count, (void __rcu **)&node->slot[1]);
> -                       total -= count;
> -               }
> -               kfree(ma_mnode_ptr(node));
> -               total--;
> -       }
> +       mas->node_request = 0;
> +       if (mas->sheaf)
> +               mt_return_sheaf(mas->sheaf);
>
> -       mas->alloc = NULL;
> +       mas->sheaf = NULL;
>  }
>  EXPORT_SYMBOL_GPL(mas_destroy);
>
> @@ -6019,7 +5816,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>                 mas_alloc_nodes(mas, gfp);
>         }
>
> -       if (!mas_allocated(mas))
> +       if (!mas->sheaf)
>                 return false;
>
>         mas->status = ma_start;
> @@ -7414,8 +7211,9 @@ void mas_dump(const struct ma_state *mas)
>
>         pr_err("[%u/%u] index=%lx last=%lx\n", mas->offset, mas->end,
>                mas->index, mas->last);
> -       pr_err("     min=%lx max=%lx alloc=" PTR_FMT ", depth=%u, flags=%x\n",
> -              mas->min, mas->max, mas->alloc, mas->depth, mas->mas_flags);
> +       pr_err("     min=%lx max=%lx sheaf=" PTR_FMT ", request %lu depth=%u, flags=%x\n",
> +              mas->min, mas->max, mas->sheaf, mas->node_request, mas->depth,
> +              mas->mas_flags);
>         if (mas->index > mas->last)
>                 pr_err("Check index & last\n");
>  }
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index 4a35e1e7c64b7ce347cbd1693beeaacb0c4c330e..72a8fe8e832a4150c6567b711768eba6a3fa6768 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c
> @@ -57,430 +57,6 @@ struct rcu_reader_struct {
>         struct rcu_test_struct2 *test;
>  };
>
> -static int get_alloc_node_count(struct ma_state *mas)
> -{
> -       int count = 1;
> -       struct maple_alloc *node = mas->alloc;
> -
> -       if (!node || ((unsigned long)node & 0x1))
> -               return 0;
> -       while (node->node_count) {
> -               count += node->node_count;
> -               node = node->slot[0];
> -       }
> -       return count;
> -}
> -
> -static void check_mas_alloc_node_count(struct ma_state *mas)
> -{
> -       mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 1, GFP_KERNEL);
> -       mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 3, GFP_KERNEL);
> -       MT_BUG_ON(mas->tree, get_alloc_node_count(mas) != mas->alloc->total);
> -       mas_destroy(mas);
> -}
> -
> -/*
> - * check_new_node() - Check the creation of new nodes and error path
> - * verification.
> - */
> -static noinline void __init check_new_node(struct maple_tree *mt)
> -{
> -
> -       struct maple_node *mn, *mn2, *mn3;
> -       struct maple_alloc *smn;
> -       struct maple_node *nodes[100];
> -       int i, j, total;
> -
> -       MA_STATE(mas, mt, 0, 0);
> -
> -       check_mas_alloc_node_count(&mas);
> -
> -       /* Try allocating 3 nodes */
> -       mtree_lock(mt);
> -       mt_set_non_kernel(0);
> -       /* request 3 nodes to be allocated. */
> -       mas_node_count(&mas, 3);
> -       /* Allocation request of 3. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 3);
> -       /* Allocate failed. */
> -       MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, mas.alloc == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
> -       mas_push_node(&mas, mn);
> -       mas_reset(&mas);
> -       mas_destroy(&mas);
> -       mtree_unlock(mt);
> -
> -
> -       /* Try allocating 1 node, then 2 more */
> -       mtree_lock(mt);
> -       /* Set allocation request to 1. */
> -       mas_set_alloc_req(&mas, 1);
> -       /* Check Allocation request of 1. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
> -       mas_set_err(&mas, -ENOMEM);
> -       /* Validate allocation request. */
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       /* Eat the requested node. */
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, mn->slot[0] != NULL);
> -       MT_BUG_ON(mt, mn->slot[1] != NULL);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -
> -       mn->parent = ma_parent_ptr(mn);
> -       ma_free_rcu(mn);
> -       mas.status = ma_start;
> -       mas_destroy(&mas);
> -       /* Allocate 3 nodes, will fail. */
> -       mas_node_count(&mas, 3);
> -       /* Drop the lock and allocate 3 nodes. */
> -       mas_nomem(&mas, GFP_KERNEL);
> -       /* Ensure 3 are allocated. */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       /* Allocation request of 0. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 0);
> -
> -       MT_BUG_ON(mt, mas.alloc == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
> -       /* Ensure we counted 3. */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       /* Free. */
> -       mas_reset(&mas);
> -       mas_destroy(&mas);
> -
> -       /* Set allocation request to 1. */
> -       mas_set_alloc_req(&mas, 1);
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
> -       mas_set_err(&mas, -ENOMEM);
> -       /* Validate allocation request. */
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 1);
> -       /* Check the node is only one node. */
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, mn->slot[0] != NULL);
> -       MT_BUG_ON(mt, mn->slot[1] != NULL);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count);
> -
> -       mas_set_alloc_req(&mas, 2); /* request 2 more. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 2);
> -       mas_set_err(&mas, -ENOMEM);
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       MT_BUG_ON(mt, mas.alloc == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
> -       for (i = 2; i >= 0; i--) {
> -               mn = mas_pop_node(&mas);
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i);
> -               MT_BUG_ON(mt, !mn);
> -               MT_BUG_ON(mt, not_empty(mn));
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -       }
> -
> -       total = 64;
> -       mas_set_alloc_req(&mas, total); /* request 2 more. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != total);
> -       mas_set_err(&mas, -ENOMEM);
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       for (i = total; i > 0; i--) {
> -               unsigned int e = 0; /* expected node_count */
> -
> -               if (!MAPLE_32BIT) {
> -                       if (i >= 35)
> -                               e = i - 34;
> -                       else if (i >= 5)
> -                               e = i - 4;
> -                       else if (i >= 2)
> -                               e = i - 1;
> -               } else {
> -                       if (i >= 4)
> -                               e = i - 3;
> -                       else if (i >= 1)
> -                               e = i - 1;
> -                       else
> -                               e = 0;
> -               }
> -
> -               MT_BUG_ON(mt, mas.alloc->node_count != e);
> -               mn = mas_pop_node(&mas);
> -               MT_BUG_ON(mt, not_empty(mn));
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
> -               MT_BUG_ON(mt, !mn);
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -       }
> -
> -       total = 100;
> -       for (i = 1; i < total; i++) {
> -               mas_set_alloc_req(&mas, i);
> -               mas_set_err(&mas, -ENOMEM);
> -               MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -               for (j = i; j > 0; j--) {
> -                       mn = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
> -                       MT_BUG_ON(mt, !mn);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mas_push_node(&mas, mn);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != j);
> -                       mn = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -
> -               mas_set_alloc_req(&mas, i);
> -               mas_set_err(&mas, -ENOMEM);
> -               MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -               for (j = 0; j <= i/2; j++) {
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
> -                       nodes[j] = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
> -               }
> -
> -               while (j) {
> -                       j--;
> -                       mas_push_node(&mas, nodes[j]);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i);
> -               for (j = 0; j <= i/2; j++) {
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
> -                       mn = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
> -               }
> -               mas_reset(&mas);
> -               MT_BUG_ON(mt, mas_nomem(&mas, GFP_KERNEL));
> -               mas_destroy(&mas);
> -
> -       }
> -
> -       /* Set allocation request. */
> -       total = 500;
> -       mas_node_count(&mas, total);
> -       /* Drop the lock and allocate the nodes. */
> -       mas_nomem(&mas, GFP_KERNEL);
> -       MT_BUG_ON(mt, !mas.alloc);
> -       i = 1;
> -       smn = mas.alloc;
> -       while (i < total) {
> -               for (j = 0; j < MAPLE_ALLOC_SLOTS; j++) {
> -                       i++;
> -                       MT_BUG_ON(mt, !smn->slot[j]);
> -                       if (i == total)
> -                               break;
> -               }
> -               smn = smn->slot[0]; /* next. */
> -       }
> -       MT_BUG_ON(mt, mas_allocated(&mas) != total);
> -       mas_reset(&mas);
> -       mas_destroy(&mas); /* Free. */
> -
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       for (i = 1; i < 128; i++) {
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
> -               for (j = i; j > 0; j--) { /*Free the requests */
> -                       mn = mas_pop_node(&mas); /* get the next node. */
> -                       MT_BUG_ON(mt, mn == NULL);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       }
> -
> -       for (i = 1; i < MAPLE_NODE_MASK + 1; i++) {
> -               MA_STATE(mas2, mt, 0, 0);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
> -               for (j = 1; j <= i; j++) { /* Move the allocations to mas2 */
> -                       mn = mas_pop_node(&mas); /* get the next node. */
> -                       MT_BUG_ON(mt, mn == NULL);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mas_push_node(&mas2, mn);
> -                       MT_BUG_ON(mt, mas_allocated(&mas2) != j);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -               MT_BUG_ON(mt, mas_allocated(&mas2) != i);
> -
> -               for (j = i; j > 0; j--) { /*Free the requests */
> -                       MT_BUG_ON(mt, mas_allocated(&mas2) != j);
> -                       mn = mas_pop_node(&mas2); /* get the next node. */
> -                       MT_BUG_ON(mt, mn == NULL);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
> -       }
> -
> -
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
> -       MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
> -
> -       mn = mas_pop_node(&mas); /* get the next node. */
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS);
> -       MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS - 1);
> -
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
> -
> -       /* Check the limit of pop/push/pop */
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 2); /* Request */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
> -       MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_alloc_req(&mas));
> -       MT_BUG_ON(mt, mas.alloc->node_count != 1);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count  != MAPLE_ALLOC_SLOTS);
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas.alloc->node_count != 1);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       mn->parent = ma_parent_ptr(mn);
> -       ma_free_rcu(mn);
> -       for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
> -               mn = mas_pop_node(&mas);
> -               MT_BUG_ON(mt, not_empty(mn));
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -       }
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -
> -
> -       for (i = 3; i < MAPLE_NODE_MASK * 3; i++) {
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mas_push_node(&mas, mn); /* put it back */
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn2 = mas_pop_node(&mas); /* get the next node. */
> -               mas_push_node(&mas, mn); /* put them back */
> -               mas_push_node(&mas, mn2);
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn2 = mas_pop_node(&mas); /* get the next node. */
> -               mn3 = mas_pop_node(&mas); /* get the next node. */
> -               mas_push_node(&mas, mn); /* put them back */
> -               mas_push_node(&mas, mn2);
> -               mas_push_node(&mas, mn3);
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mas_destroy(&mas);
> -       }
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, 5); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 5);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, 10); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 10);
> -       mas_destroy(&mas);
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS - 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS - 1);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, 10 + MAPLE_ALLOC_SLOTS - 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 10 + MAPLE_ALLOC_SLOTS - 1);
> -       mas_destroy(&mas);
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 2); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 2);
> -       mas_destroy(&mas);
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 1);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 3 + 2); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 3 + 2);
> -       mas_destroy(&mas);
> -
> -       mtree_unlock(mt);
> -}
> -
>  /*
>   * Check erasing including RCU.
>   */
> @@ -35507,6 +35083,13 @@ static unsigned char get_vacant_height(struct ma_wr_state *wr_mas, void *entry)
>         return vacant_height;
>  }
>
> +static int mas_allocated(struct ma_state *mas)
> +{
> +       if (mas->sheaf)
> +               return kmem_cache_sheaf_size(mas->sheaf);
> +
> +       return 0;
> +}
>  /* Preallocation testing */
>  static noinline void __init check_prealloc(struct maple_tree *mt)
>  {
> @@ -35525,7 +35108,10 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
>
>         /* Spanning store */
>         mas_set_range(&mas, 470, 500);
> -       MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
> +
> +       mas_wr_preallocate(&wr_mas, ptr);
> +       MT_BUG_ON(mt, mas.store_type != wr_spanning_store);
> +       MT_BUG_ON(mt, mas_is_err(&mas));
>         allocated = mas_allocated(&mas);
>         height = mas_mt_height(&mas);
>         vacant_height = get_vacant_height(&wr_mas, ptr);
> @@ -35535,6 +35121,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
>         allocated = mas_allocated(&mas);
>         MT_BUG_ON(mt, allocated != 0);
>
> +       mas_wr_preallocate(&wr_mas, ptr);
>         MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
>         allocated = mas_allocated(&mas);
>         height = mas_mt_height(&mas);
> @@ -35575,20 +35162,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
>         mn->parent = ma_parent_ptr(mn);
>         ma_free_rcu(mn);
>
> -       MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
> -       allocated = mas_allocated(&mas);
> -       height = mas_mt_height(&mas);
> -       vacant_height = get_vacant_height(&wr_mas, ptr);
> -       MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != allocated);
> -       MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
> -       mas_destroy(&mas);
> -       allocated = mas_allocated(&mas);
> -       MT_BUG_ON(mt, allocated != 0);
> -
>         MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
>         allocated = mas_allocated(&mas);
>         height = mas_mt_height(&mas);
> @@ -36389,11 +35962,17 @@ static void check_nomem_writer_race(struct maple_tree *mt)
>         check_load(mt, 6, xa_mk_value(0xC));
>         mtree_unlock(mt);
>
> +       mt_set_non_kernel(0);
>         /* test for the same race but with mas_store_gfp() */
>         mtree_store_range(mt, 0, 5, xa_mk_value(0xA), GFP_KERNEL);
>         mtree_store_range(mt, 6, 10, NULL, GFP_KERNEL);
>
>         mas_set_range(&mas, 0, 5);
> +
> +       /* setup writer 2 that will trigger the race condition */
> +       mt_set_private(mt);
> +       mt_set_callback(writer2);
> +
>         mtree_lock(mt);
>         mas_store_gfp(&mas, NULL, GFP_KERNEL);
>
> @@ -36508,10 +36087,6 @@ void farmer_tests(void)
>         check_erase_testset(&tree);
>         mtree_destroy(&tree);
>
> -       mt_init_flags(&tree, 0);
> -       check_new_node(&tree);
> -       mtree_destroy(&tree);
> -
>         if (!MAPLE_32BIT) {
>                 mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>                 check_rcu_simulated(&tree);
> diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
> index 4ceff7969b78cf8e33cd1e021c68bc9f8a02a7a1..8c72571559583759456c2b469a2abc2611117c13 100644
> --- a/tools/testing/shared/linux.c
> +++ b/tools/testing/shared/linux.c
> @@ -64,7 +64,8 @@ void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
>
>         if (!(gfp & __GFP_DIRECT_RECLAIM)) {
>                 if (!cachep->non_kernel) {
> -                       cachep->exec_callback = true;
> +                       if (cachep->callback)
> +                               cachep->exec_callback = true;
>                         return NULL;
>                 }
>
> @@ -210,6 +211,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>                 for (i = 0; i < size; i++)
>                         __kmem_cache_free_locked(cachep, p[i]);
>                 pthread_mutex_unlock(&cachep->lock);
> +               if (cachep->callback)
> +                       cachep->exec_callback = true;
>                 return 0;
>         }
>
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 22/23] maple_tree: Add single node allocation support to maple state
  2025-09-10  8:01 ` [PATCH v8 22/23] maple_tree: Add single node allocation support to maple state Vlastimil Babka
@ 2025-09-27  1:17   ` Suren Baghdasaryan
  2025-09-29  7:39     ` Vlastimil Babka
  0 siblings, 1 reply; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-09-27  1:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The fast path through a write will require replacing a single node in
> the tree.  Using a sheaf (32 nodes) is too heavy for the fast path, so
> special case the node store operation by just allocating one node in the
> maple state.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/maple_tree.h       |  4 +++-
>  lib/maple_tree.c                 | 47 +++++++++++++++++++++++++++++++++++-----
>  tools/testing/radix-tree/maple.c |  9 ++++++--
>  3 files changed, 51 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index 166fd67e00d882b1e6de1f80c1b590bba7497cd3..562a1e9e5132b5b1fa8f8402a7cadd8abb65e323 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -443,6 +443,7 @@ struct ma_state {
>         unsigned long min;              /* The minimum index of this node - implied pivot min */
>         unsigned long max;              /* The maximum index of this node - implied pivot max */
>         struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
> +       struct maple_node *alloc;       /* allocated nodes */
>         unsigned long node_request;
>         enum maple_status status;       /* The status of the state (active, start, none, etc) */
>         unsigned char depth;            /* depth of tree descent during write */
> @@ -491,8 +492,9 @@ struct ma_wr_state {
>                 .status = ma_start,                                     \
>                 .min = 0,                                               \
>                 .max = ULONG_MAX,                                       \
> -               .node_request= 0,                                       \
>                 .sheaf = NULL,                                          \
> +               .alloc = NULL,                                          \
> +               .node_request= 0,                                       \
>                 .mas_flags = 0,                                         \
>                 .store_type = wr_invalid,                               \
>         }
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index a3fcb20227e506ed209554cc8c041a53f7ef4903..a912e6a1d4378e72b967027b60f8f564476ad14e 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -1073,16 +1073,23 @@ static int mas_ascend(struct ma_state *mas)
>   *
>   * Return: A pointer to a maple node.
>   */
> -static inline struct maple_node *mas_pop_node(struct ma_state *mas)
> +static __always_inline struct maple_node *mas_pop_node(struct ma_state *mas)
>  {
>         struct maple_node *ret;
>
> +       if (mas->alloc) {
> +               ret = mas->alloc;
> +               mas->alloc = NULL;
> +               goto out;
> +       }
> +
>         if (WARN_ON_ONCE(!mas->sheaf))
>                 return NULL;
>
>         ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
> -       memset(ret, 0, sizeof(*ret));
>
> +out:
> +       memset(ret, 0, sizeof(*ret));
>         return ret;
>  }
>
> @@ -1093,9 +1100,34 @@ static inline struct maple_node *mas_pop_node(struct ma_state *mas)
>   */
>  static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>  {
> -       if (unlikely(mas->sheaf)) {
> -               unsigned long refill = mas->node_request;
> +       if (!mas->node_request)
> +               return;
> +
> +       if (mas->node_request == 1) {
> +               if (mas->sheaf)
> +                       goto use_sheaf;

Hmm, I don't get the above logic. One node is requested and instead of
using possibly available mas->alloc, we jump to using mas->sheaf and
freeing mas->alloc... That does not sound efficient. What am I
missing?

> +
> +               if (mas->alloc)
> +                       return;
>
> +               mas->alloc = mt_alloc_one(gfp);
> +               if (!mas->alloc)
> +                       goto error;
> +
> +               mas->node_request = 0;
> +               return;
> +       }
> +
> +use_sheaf:
> +       if (unlikely(mas->alloc)) {
> +               kfree(mas->alloc);
> +               mas->alloc = NULL;
> +       }
> +
> +       if (mas->sheaf) {
> +               unsigned long refill;
> +
> +               refill = mas->node_request;
>                 if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
>                         mas->node_request = 0;
>                         return;
> @@ -5180,8 +5212,11 @@ void mas_destroy(struct ma_state *mas)
>         mas->node_request = 0;
>         if (mas->sheaf)
>                 mt_return_sheaf(mas->sheaf);
> -
>         mas->sheaf = NULL;
> +
> +       if (mas->alloc)
> +               kfree(mas->alloc);
> +       mas->alloc = NULL;
>  }
>  EXPORT_SYMBOL_GPL(mas_destroy);
>
> @@ -5816,7 +5851,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>                 mas_alloc_nodes(mas, gfp);
>         }
>
> -       if (!mas->sheaf)
> +       if (!mas->sheaf && !mas->alloc)
>                 return false;
>
>         mas->status = ma_start;
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index 72a8fe8e832a4150c6567b711768eba6a3fa6768..83260f2efb1990b71093e456950069c24d75560e 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c
> @@ -35085,10 +35085,15 @@ static unsigned char get_vacant_height(struct ma_wr_state *wr_mas, void *entry)
>
>  static int mas_allocated(struct ma_state *mas)
>  {
> +       int total = 0;
> +
> +       if (mas->alloc)
> +               total++;
> +
>         if (mas->sheaf)
> -               return kmem_cache_sheaf_size(mas->sheaf);
> +               total += kmem_cache_sheaf_size(mas->sheaf);
>
> -       return 0;
> +       return total;
>  }
>  /* Preallocation testing */
>  static noinline void __init check_prealloc(struct maple_tree *mt)
>
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing
  2025-09-27  1:08   ` Suren Baghdasaryan
@ 2025-09-29  7:30     ` Vlastimil Babka
  2025-09-29 16:51       ` Liam R. Howlett
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-29  7:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On 9/27/25 03:08, Suren Baghdasaryan wrote:
> On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>
>> Use prefilled sheaves instead of bulk allocations. This should speed up
>> the allocations and the return path of unused allocations.
>>
>> Remove the push and pop of nodes from the maple state as this is now
>> handled by the slab layer with sheaves.
>>
>> Testing has been removed as necessary since the features of the tree
>> have been reduced.
>>
>> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Couple nits but otherwise looks great!
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> 
>> ---
>>  include/linux/maple_tree.h       |   6 +-
>>  lib/maple_tree.c                 | 326 ++++++---------------------
>>  tools/testing/radix-tree/maple.c | 461 ++-------------------------------------
>>  tools/testing/shared/linux.c     |   5 +-
>>  4 files changed, 88 insertions(+), 710 deletions(-)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index bafe143b1f783202e27b32567fffee4149e8e266..166fd67e00d882b1e6de1f80c1b590bba7497cd3 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -442,7 +442,8 @@ struct ma_state {
>>         struct maple_enode *node;       /* The node containing this entry */
>>         unsigned long min;              /* The minimum index of this node - implied pivot min */
>>         unsigned long max;              /* The maximum index of this node - implied pivot max */
>> -       struct maple_alloc *alloc;      /* Allocated nodes for this operation */
>> +       struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
>> +       unsigned long node_request;
> 
> No comment for this poor fella?

adding: /* The number of nodes to allocate for this operation */



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 22/23] maple_tree: Add single node allocation support to maple state
  2025-09-27  1:17   ` Suren Baghdasaryan
@ 2025-09-29  7:39     ` Vlastimil Babka
  0 siblings, 0 replies; 95+ messages in thread
From: Vlastimil Babka @ 2025-09-29  7:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On 9/27/25 03:17, Suren Baghdasaryan wrote:
> On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>>
>> The fast path through a write will require replacing a single node in
>> the tree.  Using a sheaf (32 nodes) is too heavy for the fast path, so
>> special case the node store operation by just allocating one node in the
>> maple state.
>>
>> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  include/linux/maple_tree.h       |  4 +++-
>>  lib/maple_tree.c                 | 47 +++++++++++++++++++++++++++++++++++-----
>>  tools/testing/radix-tree/maple.c |  9 ++++++--
>>  3 files changed, 51 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index 166fd67e00d882b1e6de1f80c1b590bba7497cd3..562a1e9e5132b5b1fa8f8402a7cadd8abb65e323 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -443,6 +443,7 @@ struct ma_state {
>>         unsigned long min;              /* The minimum index of this node - implied pivot min */
>>         unsigned long max;              /* The maximum index of this node - implied pivot max */
>>         struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
>> +       struct maple_node *alloc;       /* allocated nodes */

Replacing with: /* A single allocated node for fast path writes */

since I'm touching it anyway due to previous patch fixup.

>>
>> @@ -1093,9 +1100,34 @@ static inline struct maple_node *mas_pop_node(struct ma_state *mas)
>>   */
>>  static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>>  {
>> -       if (unlikely(mas->sheaf)) {
>> -               unsigned long refill = mas->node_request;
>> +       if (!mas->node_request)
>> +               return;
>> +
>> +       if (mas->node_request == 1) {
>> +               if (mas->sheaf)
>> +                       goto use_sheaf;
> 
> Hmm, I don't get the above logic. One node is requested and instead of
> using possibly available mas->alloc, we jump to using mas->sheaf and
> freeing mas->alloc... That does not sound efficient. What am I
> missing?

I'm not changing it now due to merge window, only cosmetic changes and r-b
tags. Can leave it to a follow-up optimization.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing
  2025-09-29  7:30     ` Vlastimil Babka
@ 2025-09-29 16:51       ` Liam R. Howlett
  0 siblings, 0 replies; 95+ messages in thread
From: Liam R. Howlett @ 2025-09-29 16:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

* Vlastimil Babka <vbabka@suse.cz> [250929 03:30]:
> On 9/27/25 03:08, Suren Baghdasaryan wrote:
> > On Wed, Sep 10, 2025 at 1:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >>
> >> Use prefilled sheaves instead of bulk allocations. This should speed up
> >> the allocations and the return path of unused allocations.
> >>
> >> Remove the push and pop of nodes from the maple state as this is now
> >> handled by the slab layer with sheaves.
> >>
> >> Testing has been removed as necessary since the features of the tree
> >> have been reduced.
> >>
> >> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > Couple nits but otherwise looks great!
> > 
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > 
> >> ---
> >>  include/linux/maple_tree.h       |   6 +-
> >>  lib/maple_tree.c                 | 326 ++++++---------------------
> >>  tools/testing/radix-tree/maple.c | 461 ++-------------------------------------
> >>  tools/testing/shared/linux.c     |   5 +-
> >>  4 files changed, 88 insertions(+), 710 deletions(-)
> >>
> >> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> >> index bafe143b1f783202e27b32567fffee4149e8e266..166fd67e00d882b1e6de1f80c1b590bba7497cd3 100644
> >> --- a/include/linux/maple_tree.h
> >> +++ b/include/linux/maple_tree.h
> >> @@ -442,7 +442,8 @@ struct ma_state {
> >>         struct maple_enode *node;       /* The node containing this entry */
> >>         unsigned long min;              /* The minimum index of this node - implied pivot min */
> >>         unsigned long max;              /* The maximum index of this node - implied pivot max */
> >> -       struct maple_alloc *alloc;      /* Allocated nodes for this operation */
> >> +       struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
> >> +       unsigned long node_request;
> > 
> > No comment for this poor fella?
> 
> adding: /* The number of nodes to allocate for this operation */
> 

Thanks.  That sounds better than my planned /* requested nodes */


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 00/23] SLUB percpu sheaves
  2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
                   ` (22 preceding siblings ...)
  2025-09-10  8:01 ` [PATCH v8 23/23] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
@ 2025-10-07  6:34 ` Christoph Hellwig
  2025-10-07  8:03   ` Vlastimil Babka
  23 siblings, 1 reply; 95+ messages in thread
From: Christoph Hellwig @ 2025-10-07  6:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Alexei Starovoitov, Sebastian Andrzej Siewior,
	Venkat Rao Bagalkote, Qianfeng Rong, Wei Yang,
	Matthew Wilcox (Oracle),
	Andrew Morton, Lorenzo Stoakes, WangYuli, Jann Horn,
	Pedro Falcato

On Wed, Sep 10, 2025 at 10:01:02AM +0200, Vlastimil Babka wrote:
> Hi,
> 
> I'm sending full v8 due to more changes in the middle of the series that
> resulted in later patches being fixed up due to conflicts (details in
> the changelog below).
> The v8 will replace+extend the v7 in slab/for-next.

So I've been reading through this and wonder how the preallocation
using shaves is to be used.  Do you have example code for that
somewhere?



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 00/23] SLUB percpu sheaves
  2025-10-07  6:34 ` [PATCH v8 00/23] SLUB percpu sheaves Christoph Hellwig
@ 2025-10-07  8:03   ` Vlastimil Babka
  2025-10-08  6:04     ` Christoph Hellwig
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-10-07  8:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Alexei Starovoitov, Sebastian Andrzej Siewior,
	Venkat Rao Bagalkote, Qianfeng Rong, Wei Yang,
	Matthew Wilcox (Oracle),
	Andrew Morton, Lorenzo Stoakes, WangYuli, Jann Horn,
	Pedro Falcato

On 10/7/25 08:34, Christoph Hellwig wrote:
> On Wed, Sep 10, 2025 at 10:01:02AM +0200, Vlastimil Babka wrote:
>> Hi,
>> 
>> I'm sending full v8 due to more changes in the middle of the series that
>> resulted in later patches being fixed up due to conflicts (details in
>> the changelog below).
>> The v8 will replace+extend the v7 in slab/for-next.
> 
> So I've been reading through this and wonder how the preallocation
> using shaves is to be used.  Do you have example code for that
> somewhere?

Maple tree uses this but through its internal layers it's not super clear.

Basically it's for situations where you have an upper bound on the objects
you might need to allocate in some restricted context where you can't fail
but also can't reclaim etc. The steps are:

- kmem_cache_create() with kmem_cache_args.sheaf_capacity to be enough for
any reasonable upper bounds you might need (but also at least e.g. 32 to get
the performance effect of sheaves in general - this will be autotuned in the
followup work). If there's a possibility you might need more than this
capacity in some rare cases, it's fine, just will be slower when it happens.

- kmem_cache_prefill_sheaf() with size being your upper bound for the
following critical section operation, you might get more but not less if
that succeeds

(enter critical section)

- kmem_cache_alloc_from_sheaf(gfp, sheaf) - this is guaranteed to succeed as
many times as the size you requested in your prefill. gfp is there only for
__GFP_ZERO or __GFP_ACCOUNT, where the latter will breach the memcg limit
instead of failing

- kmem_cache_sheaf_size() tells you how much is left in the prefilled sheaf

(exit critical section)

- kmem_cache_return_sheaf() will return the sheaf with unused objects

later freeing the objects allocated via prefilled sheaf is done normally

- alternatively kmem_cache_refill_sheaf() for another round

The whole point compared to preallocation via bulk alloc/bulk free is if you
have a large upper bound for some corner case but only need few objects
typically, this way will be more effective on average.

If you preallocate and know you will use everything, you can keep using
kmem_cache_alloc_bulk() and this won't give you extra benefits.








^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 00/23] SLUB percpu sheaves
  2025-10-07  8:03   ` Vlastimil Babka
@ 2025-10-08  6:04     ` Christoph Hellwig
  2025-10-15  8:32       ` Vlastimil Babka
  0 siblings, 1 reply; 95+ messages in thread
From: Christoph Hellwig @ 2025-10-08  6:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Hellwig, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Venkat Rao Bagalkote, Qianfeng Rong, Wei Yang,
	Matthew Wilcox (Oracle),
	Andrew Morton, Lorenzo Stoakes, WangYuli, Jann Horn,
	Pedro Falcato

On Tue, Oct 07, 2025 at 10:03:04AM +0200, Vlastimil Babka wrote:
> Basically it's for situations where you have an upper bound on the objects
> you might need to allocate in some restricted context where you can't fail
> but also can't reclaim etc. The steps are:

Ok, so you still need a step where you reserve, which can fail and
only after that guarantee you can allocate up to the reservation?  I.e.
not a replacement for mempools?  Just asking because I recently had
to implement a mempool_alloc_batch to allow grabbing multiple objects
out of a mempool safely for something I'm working on.



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 00/23] SLUB percpu sheaves
  2025-10-08  6:04     ` Christoph Hellwig
@ 2025-10-15  8:32       ` Vlastimil Babka
  2025-10-22  6:47         ` Christoph Hellwig
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-10-15  8:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Alexei Starovoitov, Sebastian Andrzej Siewior,
	Venkat Rao Bagalkote, Qianfeng Rong, Wei Yang,
	Matthew Wilcox (Oracle),
	Andrew Morton, Lorenzo Stoakes, WangYuli, Jann Horn,
	Pedro Falcato

On 10/8/25 08:04, Christoph Hellwig wrote:
> On Tue, Oct 07, 2025 at 10:03:04AM +0200, Vlastimil Babka wrote:
>> Basically it's for situations where you have an upper bound on the objects
>> you might need to allocate in some restricted context where you can't fail
>> but also can't reclaim etc. The steps are:
> 
> Ok, so you still need a step where you reserve, which can fail and
> only after that guarantee you can allocate up to the reservation?  I.e.
> not a replacement for mempools?  Just asking because I recently had

Yeah, not a replacement for mempools which have their special semantics.

> to implement a mempool_alloc_batch to allow grabbing multiple objects
> out of a mempool safely for something I'm working on.

I can imagine allocating multiple objects can be difficult to achieve with
the mempool's guaranteed progress semantics. Maybe the mempool could serve
prefilled sheaves?


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache
  2025-09-10  8:01 ` [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
  2025-09-12  2:20   ` Liam R. Howlett
@ 2025-10-16 15:16   ` D, Suneeth
  2025-10-16 16:15     ` Vlastimil Babka
  1 sibling, 1 reply; 95+ messages in thread
From: D, Suneeth @ 2025-10-16 15:16 UTC (permalink / raw)
  To: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

Hi Vlastimil Babka,

On 9/10/2025 1:31 PM, Vlastimil Babka wrote:
> Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
> improve its performance. Note this will not immediately take advantage
> of sheaf batching of kfree_rcu() operations due to the maple tree using
> call_rcu with custom callbacks. The followup changes to maple tree will
> change that and also make use of the prefilled sheaves functionality.
> 


We run will-it-scale-process-mmap2 micro-benchmark as part of our weekly 
CI for Kernel Performance Regression testing between a stable vs rc 
kernel. In this week's run we were able to observe severe regression on 
AMD platforms (Turin and Bergamo) with running the micro-benchmark 
between the kernels v6.17 and v6.18-rc1 in the range of 12-13% (Turin) 
and 22-26% (Bergamo). Bisecting further landed me onto this commit 
(59faa4da7cd4565cbce25358495556b75bb37022) as first bad commit. The 
following were the machines' configuration and test parameters used:-

Model name:           AMD EPYC 128-Core Processor [Bergamo]
Thread(s) per core:   2
Core(s) per socket:   128
Socket(s):            1
Total online memory:  258G

Model name:           AMD EPYC 64-Core Processor [Turin]
Thread(s) per core:   2
Core(s) per socket:   64
Socket(s):            1
Total online memory:  258G

Test params:

     nr_task: [1 8 64 128 192 256]
     mode: process
     test: mmap2
     kpi: per_process_ops
     cpufreq_governor: performance

The following are the stats after bisection:-
(the KPI used here is per_process_ops)

kernel_versions      					 per_process_ops
---------------      					 ---------------
v6.17.0 	                                       - 258291
v6.18.0-rc1 	                                       - 225839
v6.17.0-rc3-59faa4da7                                  - 212152
v6.17.0-rc3-3accabda4da1(one commit before bad commit) - 265054

Recreation steps:

1) git clone https://github.com/antonblanchard/will-it-scale.git
2) git clone https://github.com/intel/lkp-tests.git
3) cd will-it-scale && git apply 
lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
4) make
5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256

NOTE: [5] is specific to machine's architecture. starting from 1 is the 
array of no.of tasks that you'd wish to run the testcase which here is 
no.cores per CCX, per NUMA node/ per Socket, nr_threads.

I also ran the micro-benchmark with tools/testing/perf record and 
following is the collected data:-

# perf diff perf.data.old perf.data
No kallsyms or vmlinux with build-id 
0fc9c7b62ade1502af5d6a060914732523f367ef was found
Warning:
43 out of order events recorded.
Warning:
54 out of order events recorded.
# Event 'cycles:P'
#
# Baseline  Delta Abs  Shared Object           Symbol
# ........  .........  ...................... 
..............................................
#
               +51.51%  [kernel.kallsyms]       [k] 
native_queued_spin_lock_slowpath
               +14.39%  [kernel.kallsyms]       [k] perf_iterate_ctx
                +2.52%  [kernel.kallsyms]       [k] unmap_page_range
                +1.75%  [kernel.kallsyms]       [k] mas_wr_node_store
                +1.47%  [kernel.kallsyms]       [k] __pi_memset
                +1.38%  [kernel.kallsyms]       [k] mt_free_rcu
                +1.36%  [kernel.kallsyms]       [k] free_pgd_range
                +1.10%  [kernel.kallsyms]       [k] __pi_memcpy
                +0.96%  [kernel.kallsyms]       [k] __kmem_cache_alloc_bulk
                +0.92%  [kernel.kallsyms]       [k] __mmap_region
                +0.79%  [kernel.kallsyms]       [k] mas_empty_area_rev
                +0.74%  [kernel.kallsyms]       [k] __cond_resched
                +0.73%  [kernel.kallsyms]       [k] mas_walk
                +0.59%  [kernel.kallsyms]       [k] mas_pop_node
                +0.57%  [kernel.kallsyms]       [k] perf_event_mmap_output
                +0.49%  [kernel.kallsyms]       [k] mas_find
                +0.48%  [kernel.kallsyms]       [k] mas_next_slot
                +0.46%  [kernel.kallsyms]       [k] kmem_cache_free
                +0.42%  [kernel.kallsyms]       [k] mas_leaf_max_gap
                +0.42%  [kernel.kallsyms]       [k] 
__call_rcu_common.constprop.0
                +0.39%  [kernel.kallsyms]       [k] entry_SYSCALL_64
                +0.38%  [kernel.kallsyms]       [k] mas_prev_slot
                +0.38%  [kernel.kallsyms]       [k] kmem_cache_alloc_noprof
                +0.37%  [kernel.kallsyms]       [k] mas_store_gfp


> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   lib/maple_tree.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 4f0e30b57b0cef9e5cf791f3f64f5898752db402..d034f170ac897341b40cfd050b6aee86b6d2cf60 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -6040,9 +6040,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>   
>   void __init maple_tree_init(void)
>   {
> +	struct kmem_cache_args args = {
> +		.align  = sizeof(struct maple_node),
> +		.sheaf_capacity = 32,
> +	};
> +
>   	maple_node_cache = kmem_cache_create("maple_node",
> -			sizeof(struct maple_node), sizeof(struct maple_node),
> -			SLAB_PANIC, NULL);
> +			sizeof(struct maple_node), &args,
> +			SLAB_PANIC);
>   }
>   
>   /**
> 

---
Thanks and Regards
Suneeth D



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache
  2025-10-16 15:16   ` D, Suneeth
@ 2025-10-16 16:15     ` Vlastimil Babka
  2025-10-17 18:26       ` D, Suneeth
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-10-16 16:15 UTC (permalink / raw)
  To: D, Suneeth, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

On 10/16/25 17:16, D, Suneeth wrote:
> Hi Vlastimil Babka,
> 
> On 9/10/2025 1:31 PM, Vlastimil Babka wrote:
>> Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
>> improve its performance. Note this will not immediately take advantage
>> of sheaf batching of kfree_rcu() operations due to the maple tree using
>> call_rcu with custom callbacks. The followup changes to maple tree will
>> change that and also make use of the prefilled sheaves functionality.
>> 
> 
> 
> We run will-it-scale-process-mmap2 micro-benchmark as part of our weekly 
> CI for Kernel Performance Regression testing between a stable vs rc 
> kernel. In this week's run we were able to observe severe regression on 
> AMD platforms (Turin and Bergamo) with running the micro-benchmark 
> between the kernels v6.17 and v6.18-rc1 in the range of 12-13% (Turin) 
> and 22-26% (Bergamo). Bisecting further landed me onto this commit 
> (59faa4da7cd4565cbce25358495556b75bb37022) as first bad commit. The 
> following were the machines' configuration and test parameters used:-
> 
> Model name:           AMD EPYC 128-Core Processor [Bergamo]
> Thread(s) per core:   2
> Core(s) per socket:   128
> Socket(s):            1
> Total online memory:  258G
> 
> Model name:           AMD EPYC 64-Core Processor [Turin]
> Thread(s) per core:   2
> Core(s) per socket:   64
> Socket(s):            1
> Total online memory:  258G
> 
> Test params:
> 
>      nr_task: [1 8 64 128 192 256]
>      mode: process
>      test: mmap2
>      kpi: per_process_ops
>      cpufreq_governor: performance
> 
> The following are the stats after bisection:-
> (the KPI used here is per_process_ops)
> 
> kernel_versions      					 per_process_ops
> ---------------      					 ---------------
> v6.17.0 	                                       - 258291
> v6.18.0-rc1 	                                       - 225839
> v6.17.0-rc3-59faa4da7                                  - 212152
> v6.17.0-rc3-3accabda4da1(one commit before bad commit) - 265054

Thanks for the info. Is there any difference if you increase the
sheaf_capacity in the commit from 32 to a higher value? For example 120 to
match what the automatically calculated cpu partial slabs target would be.
I think there's a lock contention on the barn lock causing the regression.
By matching the cpu partial slabs value we should have same batching factor
for the barn lock as there would be on the node list_lock before sheaves.

Thanks.

> Recreation steps:
> 
> 1) git clone https://github.com/antonblanchard/will-it-scale.git
> 2) git clone https://github.com/intel/lkp-tests.git
> 3) cd will-it-scale && git apply 
> lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
> 4) make
> 5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256
> 
> NOTE: [5] is specific to machine's architecture. starting from 1 is the 
> array of no.of tasks that you'd wish to run the testcase which here is 
> no.cores per CCX, per NUMA node/ per Socket, nr_threads.
> 
> I also ran the micro-benchmark with tools/testing/perf record and 
> following is the collected data:-
> 
> # perf diff perf.data.old perf.data
> No kallsyms or vmlinux with build-id 
> 0fc9c7b62ade1502af5d6a060914732523f367ef was found
> Warning:
> 43 out of order events recorded.
> Warning:
> 54 out of order events recorded.
> # Event 'cycles:P'
> #
> # Baseline  Delta Abs  Shared Object           Symbol
> # ........  .........  ...................... 
> ..............................................
> #
>                +51.51%  [kernel.kallsyms]       [k] 
> native_queued_spin_lock_slowpath
>                +14.39%  [kernel.kallsyms]       [k] perf_iterate_ctx
>                 +2.52%  [kernel.kallsyms]       [k] unmap_page_range
>                 +1.75%  [kernel.kallsyms]       [k] mas_wr_node_store
>                 +1.47%  [kernel.kallsyms]       [k] __pi_memset
>                 +1.38%  [kernel.kallsyms]       [k] mt_free_rcu
>                 +1.36%  [kernel.kallsyms]       [k] free_pgd_range
>                 +1.10%  [kernel.kallsyms]       [k] __pi_memcpy
>                 +0.96%  [kernel.kallsyms]       [k] __kmem_cache_alloc_bulk
>                 +0.92%  [kernel.kallsyms]       [k] __mmap_region
>                 +0.79%  [kernel.kallsyms]       [k] mas_empty_area_rev
>                 +0.74%  [kernel.kallsyms]       [k] __cond_resched
>                 +0.73%  [kernel.kallsyms]       [k] mas_walk
>                 +0.59%  [kernel.kallsyms]       [k] mas_pop_node
>                 +0.57%  [kernel.kallsyms]       [k] perf_event_mmap_output
>                 +0.49%  [kernel.kallsyms]       [k] mas_find
>                 +0.48%  [kernel.kallsyms]       [k] mas_next_slot
>                 +0.46%  [kernel.kallsyms]       [k] kmem_cache_free
>                 +0.42%  [kernel.kallsyms]       [k] mas_leaf_max_gap
>                 +0.42%  [kernel.kallsyms]       [k] 
> __call_rcu_common.constprop.0
>                 +0.39%  [kernel.kallsyms]       [k] entry_SYSCALL_64
>                 +0.38%  [kernel.kallsyms]       [k] mas_prev_slot
>                 +0.38%  [kernel.kallsyms]       [k] kmem_cache_alloc_noprof
>                 +0.37%  [kernel.kallsyms]       [k] mas_store_gfp
> 
> 
>> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>   lib/maple_tree.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>> 
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index 4f0e30b57b0cef9e5cf791f3f64f5898752db402..d034f170ac897341b40cfd050b6aee86b6d2cf60 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -6040,9 +6040,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>>   
>>   void __init maple_tree_init(void)
>>   {
>> +	struct kmem_cache_args args = {
>> +		.align  = sizeof(struct maple_node),
>> +		.sheaf_capacity = 32,
>> +	};
>> +
>>   	maple_node_cache = kmem_cache_create("maple_node",
>> -			sizeof(struct maple_node), sizeof(struct maple_node),
>> -			SLAB_PANIC, NULL);
>> +			sizeof(struct maple_node), &args,
>> +			SLAB_PANIC);
>>   }
>>   
>>   /**
>> 
> 
> ---
> Thanks and Regards
> Suneeth D
> 



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache
  2025-10-16 16:15     ` Vlastimil Babka
@ 2025-10-17 18:26       ` D, Suneeth
  0 siblings, 0 replies; 95+ messages in thread
From: D, Suneeth @ 2025-10-17 18:26 UTC (permalink / raw)
  To: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree

Hi Vlastimil Babka,

On 10/16/2025 9:45 PM, Vlastimil Babka wrote:
> On 10/16/25 17:16, D, Suneeth wrote:
>> Hi Vlastimil Babka,
>>
>> On 9/10/2025 1:31 PM, Vlastimil Babka wrote:
>>> Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
>>> improve its performance. Note this will not immediately take advantage
>>> of sheaf batching of kfree_rcu() operations due to the maple tree using
>>> call_rcu with custom callbacks. The followup changes to maple tree will
>>> change that and also make use of the prefilled sheaves functionality.
>>>
>>
>>
>> We run will-it-scale-process-mmap2 micro-benchmark as part of our weekly
>> CI for Kernel Performance Regression testing between a stable vs rc
>> kernel. In this week's run we were able to observe severe regression on
>> AMD platforms (Turin and Bergamo) with running the micro-benchmark
>> between the kernels v6.17 and v6.18-rc1 in the range of 12-13% (Turin)
>> and 22-26% (Bergamo). Bisecting further landed me onto this commit
>> (59faa4da7cd4565cbce25358495556b75bb37022) as first bad commit. The
>> following were the machines' configuration and test parameters used:-
>>
>> Model name:           AMD EPYC 128-Core Processor [Bergamo]
>> Thread(s) per core:   2
>> Core(s) per socket:   128
>> Socket(s):            1
>> Total online memory:  258G
>>
>> Model name:           AMD EPYC 64-Core Processor [Turin]
>> Thread(s) per core:   2
>> Core(s) per socket:   64
>> Socket(s):            1
>> Total online memory:  258G
>>
>> Test params:
>>
>>       nr_task: [1 8 64 128 192 256]
>>       mode: process
>>       test: mmap2
>>       kpi: per_process_ops
>>       cpufreq_governor: performance
>>
>> The following are the stats after bisection:-
>> (the KPI used here is per_process_ops)
>>
>> kernel_versions      					 per_process_ops
>> ---------------      					 ---------------
>> v6.17.0 	                                       - 258291
>> v6.18.0-rc1 	                                       - 225839
>> v6.17.0-rc3-59faa4da7                                  - 212152
>> v6.17.0-rc3-3accabda4da1(one commit before bad commit) - 265054
> 
> Thanks for the info. Is there any difference if you increase the
> sheaf_capacity in the commit from 32 to a higher value? For example 120 to
> match what the automatically calculated cpu partial slabs target would be.
> I think there's a lock contention on the barn lock causing the regression.
> By matching the cpu partial slabs value we should have same batching factor
> for the barn lock as there would be on the node list_lock before sheaves.
> 
> Thanks.
> 

I tried changing the sheaf_capacity from 32 to 120 and tested it. The 
numbers are improving around 28% w.r.t baseline(6.17) with 
will-it-scale-mmap2-process testcase.

v6.17.0(w/o sheaf) %diff v6.18-rc1(sheaf=32)  %diff v6.18-rc1(sheaf=120)
------------------ ----- -------------------  ----- --------------------
260222              -13   225839               +28   334079

Thanks.

>> Recreation steps:
>>
>> 1) git clone https://github.com/antonblanchard/will-it-scale.git
>> 2) git clone https://github.com/intel/lkp-tests.git
>> 3) cd will-it-scale && git apply
>> lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
>> 4) make
>> 5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256
>>
>> NOTE: [5] is specific to machine's architecture. starting from 1 is the
>> array of no.of tasks that you'd wish to run the testcase which here is
>> no.cores per CCX, per NUMA node/ per Socket, nr_threads.
>>
>> I also ran the micro-benchmark with tools/testing/perf record and
>> following is the collected data:-
>>
>> # perf diff perf.data.old perf.data
>> No kallsyms or vmlinux with build-id
>> 0fc9c7b62ade1502af5d6a060914732523f367ef was found
>> Warning:
>> 43 out of order events recorded.
>> Warning:
>> 54 out of order events recorded.
>> # Event 'cycles:P'
>> #
>> # Baseline  Delta Abs  Shared Object           Symbol
>> # ........  .........  ......................
>> ..............................................
>> #
>>                 +51.51%  [kernel.kallsyms]       [k]
>> native_queued_spin_lock_slowpath
>>                 +14.39%  [kernel.kallsyms]       [k] perf_iterate_ctx
>>                  +2.52%  [kernel.kallsyms]       [k] unmap_page_range
>>                  +1.75%  [kernel.kallsyms]       [k] mas_wr_node_store
>>                  +1.47%  [kernel.kallsyms]       [k] __pi_memset
>>                  +1.38%  [kernel.kallsyms]       [k] mt_free_rcu
>>                  +1.36%  [kernel.kallsyms]       [k] free_pgd_range
>>                  +1.10%  [kernel.kallsyms]       [k] __pi_memcpy
>>                  +0.96%  [kernel.kallsyms]       [k] __kmem_cache_alloc_bulk
>>                  +0.92%  [kernel.kallsyms]       [k] __mmap_region
>>                  +0.79%  [kernel.kallsyms]       [k] mas_empty_area_rev
>>                  +0.74%  [kernel.kallsyms]       [k] __cond_resched
>>                  +0.73%  [kernel.kallsyms]       [k] mas_walk
>>                  +0.59%  [kernel.kallsyms]       [k] mas_pop_node
>>                  +0.57%  [kernel.kallsyms]       [k] perf_event_mmap_output
>>                  +0.49%  [kernel.kallsyms]       [k] mas_find
>>                  +0.48%  [kernel.kallsyms]       [k] mas_next_slot
>>                  +0.46%  [kernel.kallsyms]       [k] kmem_cache_free
>>                  +0.42%  [kernel.kallsyms]       [k] mas_leaf_max_gap
>>                  +0.42%  [kernel.kallsyms]       [k]
>> __call_rcu_common.constprop.0
>>                  +0.39%  [kernel.kallsyms]       [k] entry_SYSCALL_64
>>                  +0.38%  [kernel.kallsyms]       [k] mas_prev_slot
>>                  +0.38%  [kernel.kallsyms]       [k] kmem_cache_alloc_noprof
>>                  +0.37%  [kernel.kallsyms]       [k] mas_store_gfp
>>
>>
>>> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>> ---
>>>    lib/maple_tree.c | 9 +++++++--
>>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>> index 4f0e30b57b0cef9e5cf791f3f64f5898752db402..d034f170ac897341b40cfd050b6aee86b6d2cf60 100644
>>> --- a/lib/maple_tree.c
>>> +++ b/lib/maple_tree.c
>>> @@ -6040,9 +6040,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>>>    
>>>    void __init maple_tree_init(void)
>>>    {
>>> +	struct kmem_cache_args args = {
>>> +		.align  = sizeof(struct maple_node),
>>> +		.sheaf_capacity = 32,
>>> +	};
>>> +
>>>    	maple_node_cache = kmem_cache_create("maple_node",
>>> -			sizeof(struct maple_node), sizeof(struct maple_node),
>>> -			SLAB_PANIC, NULL);
>>> +			sizeof(struct maple_node), &args,
>>> +			SLAB_PANIC);
>>>    }
>>>    
>>>    /**
>>>
>>
>> ---
>> Thanks and Regards
>> Suneeth D
>>
> 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 00/23] SLUB percpu sheaves
  2025-10-15  8:32       ` Vlastimil Babka
@ 2025-10-22  6:47         ` Christoph Hellwig
  0 siblings, 0 replies; 95+ messages in thread
From: Christoph Hellwig @ 2025-10-22  6:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Hellwig, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Venkat Rao Bagalkote, Qianfeng Rong, Wei Yang,
	Matthew Wilcox (Oracle),
	Andrew Morton, Lorenzo Stoakes, WangYuli, Jann Horn,
	Pedro Falcato

On Wed, Oct 15, 2025 at 10:32:44AM +0200, Vlastimil Babka wrote:
> Yeah, not a replacement for mempools which have their special semantics.
> 
> > to implement a mempool_alloc_batch to allow grabbing multiple objects
> > out of a mempool safely for something I'm working on.
> 
> I can imagine allocating multiple objects can be difficult to achieve with
> the mempool's guaranteed progress semantics. Maybe the mempool could serve
> prefilled sheaves?

It doesn't look too bad, but I'd be happy for even better versions.

This is wht I have:

---
From 9d25a3ce6cff11b7853381921c53a51e51f27689 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Mon, 8 Sep 2025 18:22:12 +0200
Subject: mempool: add mempool_{alloc,free}_bulk

Add a version of the mempool allocator that works for batch allocations
of multiple objects.  Calling mempool_alloc in a loop is not safe
because it could deadlock if multiple threads are attemping such an
allocation at the same time.

As an extra benefit the interface is build so that the same array
can be used for alloc_pages_bulk / release_pages so that at least
for page backed mempools the fast path can use a nice batch optimization.

Still WIP, this needs proper documentation, and mempool also seems to
miss error injection to actually easily test the pool code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/mempool.h |   6 ++
 mm/mempool.c            | 131 ++++++++++++++++++++++++----------------
 2 files changed, 86 insertions(+), 51 deletions(-)

diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 34941a4b9026..59f14e94596f 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -66,9 +66,15 @@ extern void mempool_destroy(mempool_t *pool);
 extern void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask) __malloc;
 #define mempool_alloc(...)						\
 	alloc_hooks(mempool_alloc_noprof(__VA_ARGS__))
+int mempool_alloc_bulk_noprof(mempool_t *pool, void **elem,
+		unsigned int count, gfp_t gfp_mask);
+#define mempool_alloc_bulk(...)						\
+	alloc_hooks(mempool_alloc_bulk_noprof(__VA_ARGS__))
 
 extern void *mempool_alloc_preallocated(mempool_t *pool) __malloc;
 extern void mempool_free(void *element, mempool_t *pool);
+unsigned int mempool_free_bulk(mempool_t *pool, void **elem,
+		unsigned int count);
 
 /*
  * A mempool_alloc_t and mempool_free_t that get the memory from
diff --git a/mm/mempool.c b/mm/mempool.c
index 1c38e873e546..d8884aef2666 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -371,26 +371,13 @@ int mempool_resize(mempool_t *pool, int new_min_nr)
 }
 EXPORT_SYMBOL(mempool_resize);
 
-/**
- * mempool_alloc - allocate an element from a specific memory pool
- * @pool:      pointer to the memory pool which was allocated via
- *             mempool_create().
- * @gfp_mask:  the usual allocation bitmask.
- *
- * this function only sleeps if the alloc_fn() function sleeps or
- * returns NULL. Note that due to preallocation, this function
- * *never* fails when called from process contexts. (it might
- * fail if called from an IRQ context.)
- * Note: using __GFP_ZERO is not supported.
- *
- * Return: pointer to the allocated element or %NULL on error.
- */
-void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask)
+int mempool_alloc_bulk_noprof(mempool_t *pool, void **elem,
+		unsigned int count, gfp_t gfp_mask)
 {
-	void *element;
 	unsigned long flags;
 	wait_queue_entry_t wait;
 	gfp_t gfp_temp;
+	unsigned int i;
 
 	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
 	might_alloc(gfp_mask);
@@ -401,15 +388,24 @@ void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask)
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
+	i = 0;
 repeat_alloc:
+	for (; i < count; i++) {
+		if (!elem[i])
+			elem[i] = pool->alloc(gfp_temp, pool->pool_data);
+		if (unlikely(!elem[i]))
+			goto use_pool;
+	}
 
-	element = pool->alloc(gfp_temp, pool->pool_data);
-	if (likely(element != NULL))
-		return element;
+	return 0;
 
+use_pool:
 	spin_lock_irqsave(&pool->lock, flags);
-	if (likely(pool->curr_nr)) {
-		element = remove_element(pool);
+	if (likely(pool->curr_nr >= count - i)) {
+		for (; i < count; i++) {
+			if (!elem[i])
+				elem[i] = remove_element(pool);
+		}
 		spin_unlock_irqrestore(&pool->lock, flags);
 		/* paired with rmb in mempool_free(), read comment there */
 		smp_wmb();
@@ -417,8 +413,9 @@ void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask)
 		 * Update the allocation stack trace as this is more useful
 		 * for debugging.
 		 */
-		kmemleak_update_trace(element);
-		return element;
+		for (i = 0; i < count; i++)
+			kmemleak_update_trace(elem[i]);
+		return 0;
 	}
 
 	/*
@@ -434,10 +431,12 @@ void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask)
 	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		return NULL;
+		if (i > 0)
+			mempool_free_bulk(pool, elem + i, count - i);
+		return -ENOMEM;
 	}
 
-	/* Let's wait for someone else to return an element to @pool */
+	/* Let's wait for someone else to return elements to @pool */
 	init_wait(&wait);
 	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
 
@@ -452,6 +451,30 @@ void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask)
 	finish_wait(&pool->wait, &wait);
 	goto repeat_alloc;
 }
+EXPORT_SYMBOL_GPL(mempool_alloc_bulk_noprof);
+
+/**
+ * mempool_alloc - allocate an element from a specific memory pool
+ * @pool:      pointer to the memory pool which was allocated via
+ *             mempool_create().
+ * @gfp_mask:  the usual allocation bitmask.
+ *
+ * this function only sleeps if the alloc_fn() function sleeps or
+ * returns NULL. Note that due to preallocation, this function
+ * *never* fails when called from process contexts. (it might
+ * fail if called from an IRQ context.)
+ * Note: using __GFP_ZERO is not supported.
+ *
+ * Return: pointer to the allocated element or %NULL on error.
+ */
+void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask)
+{
+	void *elem[1] = { };
+
+	if (mempool_alloc_bulk_noprof(pool, elem, 1, gfp_mask) < 0)
+		return NULL;
+	return elem[0];
+}
 EXPORT_SYMBOL(mempool_alloc_noprof);
 
 /**
@@ -491,20 +514,11 @@ void *mempool_alloc_preallocated(mempool_t *pool)
 }
 EXPORT_SYMBOL(mempool_alloc_preallocated);
 
-/**
- * mempool_free - return an element to the pool.
- * @element:   pool element pointer.
- * @pool:      pointer to the memory pool which was allocated via
- *             mempool_create().
- *
- * this function only sleeps if the free_fn() function sleeps.
- */
-void mempool_free(void *element, mempool_t *pool)
+unsigned int mempool_free_bulk(mempool_t *pool, void **elem, unsigned int count)
 {
 	unsigned long flags;
-
-	if (unlikely(element == NULL))
-		return;
+	bool added = false;
+	unsigned int freed = 0;
 
 	/*
 	 * Paired with the wmb in mempool_alloc().  The preceding read is
@@ -541,15 +555,11 @@ void mempool_free(void *element, mempool_t *pool)
 	 */
 	if (unlikely(READ_ONCE(pool->curr_nr) < pool->min_nr)) {
 		spin_lock_irqsave(&pool->lock, flags);
-		if (likely(pool->curr_nr < pool->min_nr)) {
-			add_element(pool, element);
-			spin_unlock_irqrestore(&pool->lock, flags);
-			if (wq_has_sleeper(&pool->wait))
-				wake_up(&pool->wait);
-			return;
+		while (pool->curr_nr < pool->min_nr && freed < count) {
+			add_element(pool, elem[freed++]);
+			added = true;
 		}
 		spin_unlock_irqrestore(&pool->lock, flags);
-	}
 
 	/*
 	 * Handle the min_nr = 0 edge case:
@@ -560,20 +570,39 @@ void mempool_free(void *element, mempool_t *pool)
 	 * allocation of element when both min_nr and curr_nr are 0, and
 	 * any active waiters are properly awakened.
 	 */
-	if (unlikely(pool->min_nr == 0 &&
+	} else if (unlikely(pool->min_nr == 0 &&
 		     READ_ONCE(pool->curr_nr) == 0)) {
 		spin_lock_irqsave(&pool->lock, flags);
 		if (likely(pool->curr_nr == 0)) {
-			add_element(pool, element);
-			spin_unlock_irqrestore(&pool->lock, flags);
-			if (wq_has_sleeper(&pool->wait))
-				wake_up(&pool->wait);
-			return;
+			add_element(pool, elem[freed++]);
+			added = true;
 		}
 		spin_unlock_irqrestore(&pool->lock, flags);
 	}
 
-	pool->free(element, pool->pool_data);
+	if (unlikely(added) && wq_has_sleeper(&pool->wait))
+		wake_up(&pool->wait);
+
+	return freed;
+}
+EXPORT_SYMBOL_GPL(mempool_free_bulk);
+
+/**
+ * mempool_free - return an element to the pool.
+ * @element:   pool element pointer.
+ * @pool:      pointer to the memory pool which was allocated via
+ *             mempool_create().
+ *
+ * this function only sleeps if the free_fn() function sleeps.
+ */
+void mempool_free(void *element, mempool_t *pool)
+{
+	if (likely(element)) {
+		void *elem[1] = { element };
+
+		if (!mempool_free_bulk(pool, elem, 1))
+			pool->free(element, pool->pool_data);
+	}
 }
 EXPORT_SYMBOL(mempool_free);
 
-- 
2.47.3



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-09-10  8:01 ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
  2025-09-12  0:38   ` Sergey Senozhatsky
  2025-09-17  8:30   ` Harry Yoo
@ 2025-10-31 21:32   ` Daniel Gomez
  2025-11-03  3:17     ` Harry Yoo
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  2 siblings, 2 replies; 95+ messages in thread
From: Daniel Gomez @ 2025-10-31 21:32 UTC (permalink / raw)
  To: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi



On 10/09/2025 10.01, Vlastimil Babka wrote:
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
> 
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
> 
> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> contexts where kfree_rcu() is called might not be compatible with taking
> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> spinlock - the current kfree_rcu() implementation avoids doing that.
> 
> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> that have them. This is not a cheap operation, but the barrier usage is
> rare - currently kmem_cache_destroy() or on module unload.
> 
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Hi Vlastimil,

This patch increases kmod selftest (stress module loader) runtime by about
~50-60%, from ~200s to ~300s total execution time. My tested kernel has
CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
causing this, or how to address it?


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-10-31 21:32   ` Daniel Gomez
@ 2025-11-03  3:17     ` Harry Yoo
  2025-11-05 11:25       ` Vlastimil Babka
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  1 sibling, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-11-03  3:17 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree, linux-modules, Luis Chamberlain, Petr Pavlu,
	Sami Tolvanen, Aaron Tomlin, Lucas De Marchi

On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
> 
> 
> On 10/09/2025 10.01, Vlastimil Babka wrote:
> > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > addition to main and spare sheaves.
> > 
> > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > the sheaf is detached and submitted to call_rcu() with a handler that
> > will try to put it in the barn, or flush to slab pages using bulk free,
> > when the barn is full. Then a new empty sheaf must be obtained to put
> > more objects there.
> > 
> > It's possible that no free sheaves are available to use for a new
> > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > kfree_rcu() implementation.
> > 
> > Expected advantages:
> > - batching the kfree_rcu() operations, that could eventually replace the
> >   existing batching
> > - sheaves can be reused for allocations via barn instead of being
> >   flushed to slabs, which is more efficient
> >   - this includes cases where only some cpus are allowed to process rcu
> >     callbacks (Android)
> > 
> > Possible disadvantage:
> > - objects might be waiting for more than their grace period (it is
> >   determined by the last object freed into the sheaf), increasing memory
> >   usage - but the existing batching does that too.
> > 
> > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > implementation favors smaller memory footprint over performance.
> > 
> > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > contexts where kfree_rcu() is called might not be compatible with taking
> > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > spinlock - the current kfree_rcu() implementation avoids doing that.
> > 
> > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > that have them. This is not a cheap operation, but the barrier usage is
> > rare - currently kmem_cache_destroy() or on module unload.
> > 
> > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > many had to fall back to the existing implementation.
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Hi Vlastimil,
> 
> This patch increases kmod selftest (stress module loader) runtime by about
> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> causing this, or how to address it?

This is likely due to increased kvfree_rcu_barrier() during module unload.

It currently iterates over all CPUs x slab caches (that enabled sheaves,
there should be only a few now) pair to make sure rcu sheaf is flushed
by the time kvfree_rcu_barrier() returns.

Just being curious, do you have any serious workload that depends on
the performance of module unload?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-03  3:17     ` Harry Yoo
@ 2025-11-05 11:25       ` Vlastimil Babka
  2025-11-27 14:00         ` Daniel Gomez
  0 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-11-05 11:25 UTC (permalink / raw)
  To: Harry Yoo, Daniel Gomez, Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi

On 11/3/25 04:17, Harry Yoo wrote:
> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
>> 
>> 
>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>> > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> > addition to main and spare sheaves.
>> > 
>> > kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> > the sheaf is detached and submitted to call_rcu() with a handler that
>> > will try to put it in the barn, or flush to slab pages using bulk free,
>> > when the barn is full. Then a new empty sheaf must be obtained to put
>> > more objects there.
>> > 
>> > It's possible that no free sheaves are available to use for a new
>> > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> > kfree_rcu() implementation.
>> > 
>> > Expected advantages:
>> > - batching the kfree_rcu() operations, that could eventually replace the
>> >   existing batching
>> > - sheaves can be reused for allocations via barn instead of being
>> >   flushed to slabs, which is more efficient
>> >   - this includes cases where only some cpus are allowed to process rcu
>> >     callbacks (Android)
>> > 
>> > Possible disadvantage:
>> > - objects might be waiting for more than their grace period (it is
>> >   determined by the last object freed into the sheaf), increasing memory
>> >   usage - but the existing batching does that too.
>> > 
>> > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> > implementation favors smaller memory footprint over performance.
>> > 
>> > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>> > contexts where kfree_rcu() is called might not be compatible with taking
>> > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>> > spinlock - the current kfree_rcu() implementation avoids doing that.
>> > 
>> > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>> > that have them. This is not a cheap operation, but the barrier usage is
>> > rare - currently kmem_cache_destroy() or on module unload.
>> > 
>> > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>> > count how many kfree_rcu() used the rcu_free sheaf successfully and how
>> > many had to fall back to the existing implementation.
>> > 
>> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> 
>> Hi Vlastimil,
>> 
>> This patch increases kmod selftest (stress module loader) runtime by about
>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>> causing this, or how to address it?
> 
> This is likely due to increased kvfree_rcu_barrier() during module unload.

Hm so there are actually two possible sources of this. One is that the
module creates some kmem_cache and calls kmem_cache_destroy() on it before
unloading. That does kvfree_rcu_barrier() which iterates all caches via
flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
that single cache.

The other source is codetag_unload_module(), and I'm afraid it's this one as
it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?
Disabling it should help in this case, if you don't need memory allocation
profiling for that stress test. I think there's some space for improvement -
when compiled in but memalloc profiling never enabled during the uptime,
this could probably be skipped? Suren?

> It currently iterates over all CPUs x slab caches (that enabled sheaves,
> there should be only a few now) pair to make sure rcu sheaf is flushed
> by the time kvfree_rcu_barrier() returns.

Yeah, also it's done under slab_mutex. Is the stress test trying to unload
multiple modules in parallel? That would make things worse, although I'd
expect there's a lot serialization in this area already.

Unfortunately it will get worse with sheaves extended to all caches. We
could probably mark caches once they allocate their first rcu_free sheaf
(should not add visible overhead) and keep skipping those that never did.
> Just being curious, do you have any serious workload that depends on
> the performance of module unload?



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-10-31 21:32   ` Daniel Gomez
  2025-11-03  3:17     ` Harry Yoo
@ 2025-11-27 11:38     ` Jon Hunter
  2025-11-27 11:50       ` Jon Hunter
                         ` (2 more replies)
  1 sibling, 3 replies; 95+ messages in thread
From: Jon Hunter @ 2025-11-27 11:38 UTC (permalink / raw)
  To: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra



On 31/10/2025 21:32, Daniel Gomez wrote:
> 
> 
> On 10/09/2025 10.01, Vlastimil Babka wrote:
>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> addition to main and spare sheaves.
>>
>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> the sheaf is detached and submitted to call_rcu() with a handler that
>> will try to put it in the barn, or flush to slab pages using bulk free,
>> when the barn is full. Then a new empty sheaf must be obtained to put
>> more objects there.
>>
>> It's possible that no free sheaves are available to use for a new
>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> kfree_rcu() implementation.
>>
>> Expected advantages:
>> - batching the kfree_rcu() operations, that could eventually replace the
>>    existing batching
>> - sheaves can be reused for allocations via barn instead of being
>>    flushed to slabs, which is more efficient
>>    - this includes cases where only some cpus are allowed to process rcu
>>      callbacks (Android)
>>
>> Possible disadvantage:
>> - objects might be waiting for more than their grace period (it is
>>    determined by the last object freed into the sheaf), increasing memory
>>    usage - but the existing batching does that too.
>>
>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> implementation favors smaller memory footprint over performance.
>>
>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>> contexts where kfree_rcu() is called might not be compatible with taking
>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>> spinlock - the current kfree_rcu() implementation avoids doing that.
>>
>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>> that have them. This is not a cheap operation, but the barrier usage is
>> rare - currently kmem_cache_destroy() or on module unload.
>>
>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>> many had to fall back to the existing implementation.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Hi Vlastimil,
> 
> This patch increases kmod selftest (stress module loader) runtime by about
> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> causing this, or how to address it?
> 

I have been looking into a regression for Linux v6.18-rc where time 
taken to run some internal graphics tests on our Tegra234 device has 
increased from around 35% causing the tests to timeout. Bisect is 
pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.

I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am 
not sure if there are any downsides to disabling this?

Thanks
Jon

-- 
nvpublic



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
@ 2025-11-27 11:50       ` Jon Hunter
  2025-11-27 12:33       ` Harry Yoo
  2025-11-27 13:18       ` Vlastimil Babka
  2 siblings, 0 replies; 95+ messages in thread
From: Jon Hunter @ 2025-11-27 11:50 UTC (permalink / raw)
  To: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra


On 27/11/2025 11:38, Jon Hunter wrote:
> 
> 
> On 31/10/2025 21:32, Daniel Gomez wrote:
>>
>>
>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>>> addition to main and spare sheaves.
>>>
>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>>> the sheaf is detached and submitted to call_rcu() with a handler that
>>> will try to put it in the barn, or flush to slab pages using bulk free,
>>> when the barn is full. Then a new empty sheaf must be obtained to put
>>> more objects there.
>>>
>>> It's possible that no free sheaves are available to use for a new
>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>>> kfree_rcu() implementation.
>>>
>>> Expected advantages:
>>> - batching the kfree_rcu() operations, that could eventually replace the
>>>    existing batching
>>> - sheaves can be reused for allocations via barn instead of being
>>>    flushed to slabs, which is more efficient
>>>    - this includes cases where only some cpus are allowed to process rcu
>>>      callbacks (Android)
>>>
>>> Possible disadvantage:
>>> - objects might be waiting for more than their grace period (it is
>>>    determined by the last object freed into the sheaf), increasing 
>>> memory
>>>    usage - but the existing batching does that too.
>>>
>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>>> implementation favors smaller memory footprint over performance.
>>>
>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>>> contexts where kfree_rcu() is called might not be compatible with taking
>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>>> spinlock - the current kfree_rcu() implementation avoids doing that.
>>>
>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>>> that have them. This is not a cheap operation, but the barrier usage is
>>> rare - currently kmem_cache_destroy() or on module unload.
>>>
>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>>> many had to fall back to the existing implementation.
>>>
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> Hi Vlastimil,
>>
>> This patch increases kmod selftest (stress module loader) runtime by 
>> about
>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what 
>> might be
>> causing this, or how to address it?
>>
> 
> I have been looking into a regression for Linux v6.18-rc where time 
> taken to run some internal graphics tests on our Tegra234 device has 
> increased from around 35% causing the tests to timeout. Bisect is 

I meant 'increased by around 35%'.

> pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> 
> I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am 
> not sure if there are any downsides to disabling this?
> 
> Thanks
> Jon
> 

-- 
nvpublic



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  2025-11-27 11:50       ` Jon Hunter
@ 2025-11-27 12:33       ` Harry Yoo
  2025-11-27 12:48         ` Harry Yoo
  2025-11-27 13:18       ` Vlastimil Babka
  2 siblings, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-11-27 12:33 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra

On Thu, Nov 27, 2025 at 11:38:49AM +0000, Jon Hunter wrote:
> 
> 
> On 31/10/2025 21:32, Daniel Gomez wrote:
> > 
> > 
> > On 10/09/2025 10.01, Vlastimil Babka wrote:
> > > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > > addition to main and spare sheaves.
> > > 
> > > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > > the sheaf is detached and submitted to call_rcu() with a handler that
> > > will try to put it in the barn, or flush to slab pages using bulk free,
> > > when the barn is full. Then a new empty sheaf must be obtained to put
> > > more objects there.
> > > 
> > > It's possible that no free sheaves are available to use for a new
> > > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > > kfree_rcu() implementation.
> > > 
> > > Expected advantages:
> > > - batching the kfree_rcu() operations, that could eventually replace the
> > >    existing batching
> > > - sheaves can be reused for allocations via barn instead of being
> > >    flushed to slabs, which is more efficient
> > >    - this includes cases where only some cpus are allowed to process rcu
> > >      callbacks (Android)
> > > 
> > > Possible disadvantage:
> > > - objects might be waiting for more than their grace period (it is
> > >    determined by the last object freed into the sheaf), increasing memory
> > >    usage - but the existing batching does that too.
> > > 
> > > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > > implementation favors smaller memory footprint over performance.
> > > 
> > > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > > contexts where kfree_rcu() is called might not be compatible with taking
> > > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > > spinlock - the current kfree_rcu() implementation avoids doing that.
> > > 
> > > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > > that have them. This is not a cheap operation, but the barrier usage is
> > > rare - currently kmem_cache_destroy() or on module unload.
> > > 
> > > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > > many had to fall back to the existing implementation.
> > > 
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > Hi Vlastimil,
> > 
> > This patch increases kmod selftest (stress module loader) runtime by about
> > ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> > CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> > causing this, or how to address it?
> > 
> 
> I have been looking into a regression for Linux v6.18-rc where time taken to
> run some internal graphics tests on our Tegra234 device has increased from
> around 35% causing the tests to timeout. Bisect is pointing to this commit
> and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.

Thanks for reporting! Uh, this has been put aside while I was busy working
on other stuff... but now that we have two people complaining about this,
I'll allocate some time to investigate and improve it.

It'll take some time though :)

> I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am not
> sure if there are any downsides to disabling this?

I would not recommend doing that, unless you want to sacrifice overall
performance just for the test. Disabling it could create too many RCU
grace periods in the system.

> 
> Thanks
> Jon
> 
> -- 
> nvpublic

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 12:33       ` Harry Yoo
@ 2025-11-27 12:48         ` Harry Yoo
  2025-11-28  8:57           ` Jon Hunter
  0 siblings, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-11-27 12:48 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra

On Thu, Nov 27, 2025 at 09:33:46PM +0900, Harry Yoo wrote:
> On Thu, Nov 27, 2025 at 11:38:49AM +0000, Jon Hunter wrote:
> > 
> > 
> > On 31/10/2025 21:32, Daniel Gomez wrote:
> > > 
> > > 
> > > On 10/09/2025 10.01, Vlastimil Babka wrote:
> > > > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > > > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > > > addition to main and spare sheaves.
> > > > 
> > > > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > > > the sheaf is detached and submitted to call_rcu() with a handler that
> > > > will try to put it in the barn, or flush to slab pages using bulk free,
> > > > when the barn is full. Then a new empty sheaf must be obtained to put
> > > > more objects there.
> > > > 
> > > > It's possible that no free sheaves are available to use for a new
> > > > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > > > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > > > kfree_rcu() implementation.
> > > > 
> > > > Expected advantages:
> > > > - batching the kfree_rcu() operations, that could eventually replace the
> > > >    existing batching
> > > > - sheaves can be reused for allocations via barn instead of being
> > > >    flushed to slabs, which is more efficient
> > > >    - this includes cases where only some cpus are allowed to process rcu
> > > >      callbacks (Android)
> > > > 
> > > > Possible disadvantage:
> > > > - objects might be waiting for more than their grace period (it is
> > > >    determined by the last object freed into the sheaf), increasing memory
> > > >    usage - but the existing batching does that too.
> > > > 
> > > > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > > > implementation favors smaller memory footprint over performance.
> > > > 
> > > > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > > > contexts where kfree_rcu() is called might not be compatible with taking
> > > > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > > > spinlock - the current kfree_rcu() implementation avoids doing that.
> > > > 
> > > > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > > > that have them. This is not a cheap operation, but the barrier usage is
> > > > rare - currently kmem_cache_destroy() or on module unload.
> > > > 
> > > > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > > > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > > > many had to fall back to the existing implementation.
> > > > 
> > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > 
> > > Hi Vlastimil,
> > > 
> > > This patch increases kmod selftest (stress module loader) runtime by about
> > > ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> > > CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> > > causing this, or how to address it?
> > > 
> > 
> > I have been looking into a regression for Linux v6.18-rc where time taken to
> > run some internal graphics tests on our Tegra234 device has increased from
> > around 35% causing the tests to timeout. Bisect is pointing to this commit
> > and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> 
> Thanks for reporting! Uh, this has been put aside while I was busy working
> on other stuff... but now that we have two people complaining about this,
> I'll allocate some time to investigate and improve it.
> 
> It'll take some time though :)

By the way, how many CPUs do you have on your system, and does your
kernel have CONFIG_CODE_TAGGING enabled?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  2025-11-27 11:50       ` Jon Hunter
  2025-11-27 12:33       ` Harry Yoo
@ 2025-11-27 13:18       ` Vlastimil Babka
  2025-11-28  8:59         ` Jon Hunter
  2 siblings, 1 reply; 95+ messages in thread
From: Vlastimil Babka @ 2025-11-27 13:18 UTC (permalink / raw)
  To: Jon Hunter, Daniel Gomez, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra

On 11/27/25 12:38, Jon Hunter wrote:
> 
> 
> On 31/10/2025 21:32, Daniel Gomez wrote:
>> 
>> 
>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>> 
>> Hi Vlastimil,
>> 
>> This patch increases kmod selftest (stress module loader) runtime by about
>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>> causing this, or how to address it?
>> 
> 
> I have been looking into a regression for Linux v6.18-rc where time 
> taken to run some internal graphics tests on our Tegra234 device has 
> increased from around 35% causing the tests to timeout. Bisect is 
> pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.

Do the tegra tests involve (frequent) module unloads too, then? Or calling
kmem_cache_destroy() somewhere?

Thanks,
Vlastimil

> I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am 
> not sure if there are any downsides to disabling this?
> 
> Thanks
> Jon
> 



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-05 11:25       ` Vlastimil Babka
@ 2025-11-27 14:00         ` Daniel Gomez
  2025-11-27 19:29           ` Suren Baghdasaryan
  0 siblings, 1 reply; 95+ messages in thread
From: Daniel Gomez @ 2025-11-27 14:00 UTC (permalink / raw)
  To: Vlastimil Babka, Harry Yoo, Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, bpf,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi



On 05/11/2025 12.25, Vlastimil Babka wrote:
> On 11/3/25 04:17, Harry Yoo wrote:
>> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
>>>
>>>
>>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>>>> addition to main and spare sheaves.
>>>>
>>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>>>> the sheaf is detached and submitted to call_rcu() with a handler that
>>>> will try to put it in the barn, or flush to slab pages using bulk free,
>>>> when the barn is full. Then a new empty sheaf must be obtained to put
>>>> more objects there.
>>>>
>>>> It's possible that no free sheaves are available to use for a new
>>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>>>> kfree_rcu() implementation.
>>>>
>>>> Expected advantages:
>>>> - batching the kfree_rcu() operations, that could eventually replace the
>>>>   existing batching
>>>> - sheaves can be reused for allocations via barn instead of being
>>>>   flushed to slabs, which is more efficient
>>>>   - this includes cases where only some cpus are allowed to process rcu
>>>>     callbacks (Android)
>>>>
>>>> Possible disadvantage:
>>>> - objects might be waiting for more than their grace period (it is
>>>>   determined by the last object freed into the sheaf), increasing memory
>>>>   usage - but the existing batching does that too.
>>>>
>>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>>>> implementation favors smaller memory footprint over performance.
>>>>
>>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>>>> contexts where kfree_rcu() is called might not be compatible with taking
>>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>>>> spinlock - the current kfree_rcu() implementation avoids doing that.
>>>>
>>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>>>> that have them. This is not a cheap operation, but the barrier usage is
>>>> rare - currently kmem_cache_destroy() or on module unload.
>>>>
>>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>>>> many had to fall back to the existing implementation.
>>>>
>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>
>>> Hi Vlastimil,
>>>
>>> This patch increases kmod selftest (stress module loader) runtime by about
>>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>>> causing this, or how to address it?
>>
>> This is likely due to increased kvfree_rcu_barrier() during module unload.
> 
> Hm so there are actually two possible sources of this. One is that the
> module creates some kmem_cache and calls kmem_cache_destroy() on it before
> unloading. That does kvfree_rcu_barrier() which iterates all caches via
> flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
> have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
> that single cache.

Thanks for the feedback. And thanks to Jon who has revived this again.

> 
> The other source is codetag_unload_module(), and I'm afraid it's this one as
> it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?

Yes, we do have that enabled.

> Disabling it should help in this case, if you don't need memory allocation
> profiling for that stress test. I think there's some space for improvement -
> when compiled in but memalloc profiling never enabled during the uptime,
> this could probably be skipped? Suren?
> 
>> It currently iterates over all CPUs x slab caches (that enabled sheaves,
>> there should be only a few now) pair to make sure rcu sheaf is flushed
>> by the time kvfree_rcu_barrier() returns.
> 
> Yeah, also it's done under slab_mutex. Is the stress test trying to unload
> multiple modules in parallel? That would make things worse, although I'd
> expect there's a lot serialization in this area already.

AFAIK, the kmod stress test does not unload modules in parallel. Module unload
happens one at a time before each test iteration. However, test 0008 and 0009
run 300 total sequential module unloads.

ALL_TESTS="$ALL_TESTS 0008:150:1"
ALL_TESTS="$ALL_TESTS 0009:150:1"

> 
> Unfortunately it will get worse with sheaves extended to all caches. We
> could probably mark caches once they allocate their first rcu_free sheaf
> (should not add visible overhead) and keep skipping those that never did.
>> Just being curious, do you have any serious workload that depends on
>> the performance of module unload?

Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking?
Happy to test this again if you have a patch or something in mind.

In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf
folks in case they have a workload.

But I don't have a particular workload in mind.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 14:00         ` Daniel Gomez
@ 2025-11-27 19:29           ` Suren Baghdasaryan
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  0 siblings, 1 reply; 95+ messages in thread
From: Suren Baghdasaryan @ 2025-11-27 19:29 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Vlastimil Babka, Harry Yoo, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	linux-modules, bpf, Luis Chamberlain, Petr Pavlu, Sami Tolvanen,
	Aaron Tomlin, Lucas De Marchi

On Thu, Nov 27, 2025 at 6:01 AM Daniel Gomez <da.gomez@kernel.org> wrote:
>
>
>
> On 05/11/2025 12.25, Vlastimil Babka wrote:
> > On 11/3/25 04:17, Harry Yoo wrote:
> >> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
> >>>
> >>>
> >>> On 10/09/2025 10.01, Vlastimil Babka wrote:
> >>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> >>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> >>>> addition to main and spare sheaves.
> >>>>
> >>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> >>>> the sheaf is detached and submitted to call_rcu() with a handler that
> >>>> will try to put it in the barn, or flush to slab pages using bulk free,
> >>>> when the barn is full. Then a new empty sheaf must be obtained to put
> >>>> more objects there.
> >>>>
> >>>> It's possible that no free sheaves are available to use for a new
> >>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> >>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> >>>> kfree_rcu() implementation.
> >>>>
> >>>> Expected advantages:
> >>>> - batching the kfree_rcu() operations, that could eventually replace the
> >>>>   existing batching
> >>>> - sheaves can be reused for allocations via barn instead of being
> >>>>   flushed to slabs, which is more efficient
> >>>>   - this includes cases where only some cpus are allowed to process rcu
> >>>>     callbacks (Android)
> >>>>
> >>>> Possible disadvantage:
> >>>> - objects might be waiting for more than their grace period (it is
> >>>>   determined by the last object freed into the sheaf), increasing memory
> >>>>   usage - but the existing batching does that too.
> >>>>
> >>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> >>>> implementation favors smaller memory footprint over performance.
> >>>>
> >>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> >>>> contexts where kfree_rcu() is called might not be compatible with taking
> >>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> >>>> spinlock - the current kfree_rcu() implementation avoids doing that.
> >>>>
> >>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> >>>> that have them. This is not a cheap operation, but the barrier usage is
> >>>> rare - currently kmem_cache_destroy() or on module unload.
> >>>>
> >>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> >>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> >>>> many had to fall back to the existing implementation.
> >>>>
> >>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>
> >>> Hi Vlastimil,
> >>>
> >>> This patch increases kmod selftest (stress module loader) runtime by about
> >>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> >>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> >>> causing this, or how to address it?
> >>
> >> This is likely due to increased kvfree_rcu_barrier() during module unload.
> >
> > Hm so there are actually two possible sources of this. One is that the
> > module creates some kmem_cache and calls kmem_cache_destroy() on it before
> > unloading. That does kvfree_rcu_barrier() which iterates all caches via
> > flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
> > have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
> > that single cache.
>
> Thanks for the feedback. And thanks to Jon who has revived this again.
>
> >
> > The other source is codetag_unload_module(), and I'm afraid it's this one as
> > it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?
>
> Yes, we do have that enabled.

Sorry I missed this discussion before.
IIUC, the performance is impacted because kvfree_rcu_barrier() has to
flush_all_rcu_sheaves(), therefore is more costly than before.

>
> > Disabling it should help in this case, if you don't need memory allocation
> > profiling for that stress test. I think there's some space for improvement -
> > when compiled in but memalloc profiling never enabled during the uptime,
> > this could probably be skipped? Suren?

I think yes, we should be able to skip kvfree_rcu_barrier() inside
codetag_unload_module() if profiling was not enabled.
kvfree_rcu_barrier() is there to ensure all potential kfree_rcu()'s
for module allocations are finished before destroying the tags. I'll
need to add an additional "sticky" flag to record that profiling was
used so that we detect a case when it was enabled, then disabled
before module unloading. I can work on it next week.

> >
> >> It currently iterates over all CPUs x slab caches (that enabled sheaves,
> >> there should be only a few now) pair to make sure rcu sheaf is flushed
> >> by the time kvfree_rcu_barrier() returns.
> >
> > Yeah, also it's done under slab_mutex. Is the stress test trying to unload
> > multiple modules in parallel? That would make things worse, although I'd
> > expect there's a lot serialization in this area already.
>
> AFAIK, the kmod stress test does not unload modules in parallel. Module unload
> happens one at a time before each test iteration. However, test 0008 and 0009
> run 300 total sequential module unloads.
>
> ALL_TESTS="$ALL_TESTS 0008:150:1"
> ALL_TESTS="$ALL_TESTS 0009:150:1"
>
> >
> > Unfortunately it will get worse with sheaves extended to all caches. We
> > could probably mark caches once they allocate their first rcu_free sheaf
> > (should not add visible overhead) and keep skipping those that never did.
> >> Just being curious, do you have any serious workload that depends on
> >> the performance of module unload?
>
> Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking?
> Happy to test this again if you have a patch or something in mind.
>
> In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf
> folks in case they have a workload.
>
> But I don't have a particular workload in mind.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 12:48         ` Harry Yoo
@ 2025-11-28  8:57           ` Jon Hunter
  2025-12-01  6:55             ` Harry Yoo
  0 siblings, 1 reply; 95+ messages in thread
From: Jon Hunter @ 2025-11-28  8:57 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra


On 27/11/2025 12:48, Harry Yoo wrote:

...

>>> I have been looking into a regression for Linux v6.18-rc where time taken to
>>> run some internal graphics tests on our Tegra234 device has increased from
>>> around 35% causing the tests to timeout. Bisect is pointing to this commit
>>> and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
>>
>> Thanks for reporting! Uh, this has been put aside while I was busy working
>> on other stuff... but now that we have two people complaining about this,
>> I'll allocate some time to investigate and improve it.
>>
>> It'll take some time though :)
> 
> By the way, how many CPUs do you have on your system, and does your
> kernel have CONFIG_CODE_TAGGING enabled?

For this device there are 12 CPUs. I don't see CONFIG_CODE_TAGGING enabled.

Thanks
Jon

-- 
nvpublic



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 13:18       ` Vlastimil Babka
@ 2025-11-28  8:59         ` Jon Hunter
  0 siblings, 0 replies; 95+ messages in thread
From: Jon Hunter @ 2025-11-28  8:59 UTC (permalink / raw)
  To: Vlastimil Babka, Daniel Gomez, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra


On 27/11/2025 13:18, Vlastimil Babka wrote:
> On 11/27/25 12:38, Jon Hunter wrote:
>>
>>
>> On 31/10/2025 21:32, Daniel Gomez wrote:
>>>
>>>
>>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>>>
>>> Hi Vlastimil,
>>>
>>> This patch increases kmod selftest (stress module loader) runtime by about
>>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>>> causing this, or how to address it?
>>>
>>
>> I have been looking into a regression for Linux v6.18-rc where time
>> taken to run some internal graphics tests on our Tegra234 device has
>> increased from around 35% causing the tests to timeout. Bisect is
>> pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> 
> Do the tegra tests involve (frequent) module unloads too, then? Or calling
> kmem_cache_destroy() somewhere?

In this specific case I am not running the tegra-tests but we have a 
internal testsuite of GPU related tests. I don't believe that believe 
this is unloading any modules. I can take a look next week to see if 
kmem_cache_destroy() is getting called somewhere when these tests run.

Thanks
Jon

-- 
nvpublic



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-27 19:29           ` Suren Baghdasaryan
@ 2025-11-28 11:37             ` Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
                                 ` (2 more replies)
  0 siblings, 3 replies; 95+ messages in thread
From: Harry Yoo @ 2025-11-28 11:37 UTC (permalink / raw)
  To: surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, harry.yoo,
	linux-kernel, linux-mm, linux-modules, lucas.demarchi,
	maple-tree, mcgrof, petr.pavlu, rcu, rientjes, roman.gushchin,
	samitolvanen, sidhartha.kumar, urezki, vbabka, jonathanh

Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
caches when a cache is destroyed. This is unnecessary when destroying
a slab cache; only the RCU sheaves belonging to the cache being destroyed
need to be flushed.

As suggested by Vlastimil Babka, introduce a weaker form of
kvfree_rcu_barrier() that operates on a specific slab cache and call it
on cache destruction.

The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
5900X machine (1 socket), by loading slub_kunit module.

Before:
  Total calls: 19
  Average latency (us): 8529
  Total time (us): 162069

After:
  Total calls: 19
  Average latency (us): 3804
  Total time (us): 72287

Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---

Not sure if the regression is worse on the reporters' machines due to
higher core count (or because some cores were busy doing other things,
dunno).

Hopefully this will reduce the time to complete tests,
and Suren could add his patch on top of this ;)

 include/linux/slab.h |  5 ++++
 mm/slab.h            |  1 +
 mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
 mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
 4 files changed, 73 insertions(+), 40 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index cf443f064a66..937c93d44e8c 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1149,6 +1149,10 @@ static inline void kvfree_rcu_barrier(void)
 {
 	rcu_barrier();
 }
+static inline void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
+{
+	rcu_barrier();
+}
 
 static inline void kfree_rcu_scheduler_running(void) { }
 #else
@@ -1156,6 +1160,7 @@ void kvfree_rcu_barrier(void);
 
 void kfree_rcu_scheduler_running(void);
 #endif
+void kvfree_rcu_barrier_on_cache(struct kmem_cache *s);
 
 /**
  * kmalloc_size_roundup - Report allocation bucket size for the given size
diff --git a/mm/slab.h b/mm/slab.h
index f730e012553c..e767aa7e91b0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -422,6 +422,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 
 bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
 void flush_all_rcu_sheaves(void);
+void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
 
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 84dfff4f7b1f..dd8a49d6f9cc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -492,7 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		return;
 
 	/* in-flight kfree_rcu()'s may include objects from our cache */
-	kvfree_rcu_barrier();
+	kvfree_rcu_barrier_on_cache(s);
 
 	if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
 	    (s->flags & SLAB_TYPESAFE_BY_RCU)) {
@@ -2038,25 +2038,13 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 }
 EXPORT_SYMBOL_GPL(kvfree_call_rcu);
 
-/**
- * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
- *
- * Note that a single argument of kvfree_rcu() call has a slow path that
- * triggers synchronize_rcu() following by freeing a pointer. It is done
- * before the return from the function. Therefore for any single-argument
- * call that will result in a kfree() to a cache that is to be destroyed
- * during module exit, it is developer's responsibility to ensure that all
- * such calls have returned before the call to kmem_cache_destroy().
- */
-void kvfree_rcu_barrier(void)
+static inline void __kvfree_rcu_barrier(void)
 {
 	struct kfree_rcu_cpu_work *krwp;
 	struct kfree_rcu_cpu *krcp;
 	bool queued;
 	int i, cpu;
 
-	flush_all_rcu_sheaves();
-
 	/*
 	 * Firstly we detach objects and queue them over an RCU-batch
 	 * for all CPUs. Finally queued works are flushed for each CPU.
@@ -2118,8 +2106,43 @@ void kvfree_rcu_barrier(void)
 		}
 	}
 }
+
+/**
+ * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
+ *
+ * Note that a single argument of kvfree_rcu() call has a slow path that
+ * triggers synchronize_rcu() following by freeing a pointer. It is done
+ * before the return from the function. Therefore for any single-argument
+ * call that will result in a kfree() to a cache that is to be destroyed
+ * during module exit, it is developer's responsibility to ensure that all
+ * such calls have returned before the call to kmem_cache_destroy().
+ */
+void kvfree_rcu_barrier(void)
+{
+	flush_all_rcu_sheaves();
+	__kvfree_rcu_barrier();
+}
 EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
 
+/**
+ * kvfree_rcu_barrier_on_cache - Wait for in-flight kvfree_rcu() calls on a
+ *                               specific slab cache.
+ * @s: slab cache to wait for
+ *
+ * See the description of kvfree_rcu_barrier() for details.
+ */
+void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
+{
+	if (s->cpu_sheaves)
+		flush_rcu_sheaves_on_cache(s);
+	/*
+	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
+	 * on a specific slab cache.
+	 */
+	__kvfree_rcu_barrier();
+}
+EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
+
 static unsigned long
 kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
@@ -2215,4 +2238,3 @@ void __init kvfree_rcu_init(void)
 }
 
 #endif /* CONFIG_KVFREE_RCU_BATCHED */
-
diff --git a/mm/slub.c b/mm/slub.c
index 785e25a14999..7cec2220712b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4118,42 +4118,47 @@ static void flush_rcu_sheaf(struct work_struct *w)
 
 
 /* needed for kvfree_rcu_barrier() */
-void flush_all_rcu_sheaves(void)
+void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
-	struct kmem_cache *s;
 	unsigned int cpu;
 
-	cpus_read_lock();
-	mutex_lock(&slab_mutex);
+	mutex_lock(&flush_lock);
 
-	list_for_each_entry(s, &slab_caches, list) {
-		if (!s->cpu_sheaves)
-			continue;
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
 
-		mutex_lock(&flush_lock);
+		/*
+		 * we don't check if rcu_free sheaf exists - racing
+		 * __kfree_rcu_sheaf() might have just removed it.
+		 * by executing flush_rcu_sheaf() on the cpu we make
+		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
+		 */
 
-		for_each_online_cpu(cpu) {
-			sfw = &per_cpu(slub_flush, cpu);
+		INIT_WORK(&sfw->work, flush_rcu_sheaf);
+		sfw->s = s;
+		queue_work_on(cpu, flushwq, &sfw->work);
+	}
 
-			/*
-			 * we don't check if rcu_free sheaf exists - racing
-			 * __kfree_rcu_sheaf() might have just removed it.
-			 * by executing flush_rcu_sheaf() on the cpu we make
-			 * sure the __kfree_rcu_sheaf() finished its call_rcu()
-			 */
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
+		flush_work(&sfw->work);
+	}
 
-			INIT_WORK(&sfw->work, flush_rcu_sheaf);
-			sfw->s = s;
-			queue_work_on(cpu, flushwq, &sfw->work);
-		}
+	mutex_unlock(&flush_lock);
+}
 
-		for_each_online_cpu(cpu) {
-			sfw = &per_cpu(slub_flush, cpu);
-			flush_work(&sfw->work);
-		}
+void flush_all_rcu_sheaves(void)
+{
+	struct kmem_cache *s;
+
+	cpus_read_lock();
+	mutex_lock(&slab_mutex);
 
-		mutex_unlock(&flush_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		if (!s->cpu_sheaves)
+			continue;
+		flush_rcu_sheaves_on_cache(s);
 	}
 
 	mutex_unlock(&slab_mutex);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
@ 2025-11-28 12:22               ` Harry Yoo
  2025-11-28 12:38               ` Daniel Gomez
  2025-12-02  9:29               ` Jon Hunter
  2 siblings, 0 replies; 95+ messages in thread
From: Harry Yoo @ 2025-11-28 12:22 UTC (permalink / raw)
  To: surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, jonathanh

On Fri, Nov 28, 2025 at 08:37:40PM +0900, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>   Total calls: 19
>   Average latency (us): 8529
>   Total time (us): 162069
> 
> After:
>   Total calls: 19
>   Average latency (us): 3804
>   Total time (us): 72287

Ooh, I just realized that I messed up the config and
have only two cores enabled. Will update the numbers after enabling 22 more :)

> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
> 
> Not sure if the regression is worse on the reporters' machines due to
> higher core count (or because some cores were busy doing other things,
> dunno).
> 
> Hopefully this will reduce the time to complete tests,
> and Suren could add his patch on top of this ;)
> 
>  include/linux/slab.h |  5 ++++
>  mm/slab.h            |  1 +
>  mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
>  mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
>  4 files changed, 73 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index cf443f064a66..937c93d44e8c 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -1149,6 +1149,10 @@ static inline void kvfree_rcu_barrier(void)
>  {
>  	rcu_barrier();
>  }
> +static inline void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
> +{
> +	rcu_barrier();
> +}
>  
>  static inline void kfree_rcu_scheduler_running(void) { }
>  #else
> @@ -1156,6 +1160,7 @@ void kvfree_rcu_barrier(void);
>  
>  void kfree_rcu_scheduler_running(void);
>  #endif
> +void kvfree_rcu_barrier_on_cache(struct kmem_cache *s);
>  
>  /**
>   * kmalloc_size_roundup - Report allocation bucket size for the given size
> diff --git a/mm/slab.h b/mm/slab.h
> index f730e012553c..e767aa7e91b0 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -422,6 +422,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>  
>  bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
>  void flush_all_rcu_sheaves(void);
> +void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
>  
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 84dfff4f7b1f..dd8a49d6f9cc 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -492,7 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
>  		return;
>  
>  	/* in-flight kfree_rcu()'s may include objects from our cache */
> -	kvfree_rcu_barrier();
> +	kvfree_rcu_barrier_on_cache(s);
>  
>  	if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
>  	    (s->flags & SLAB_TYPESAFE_BY_RCU)) {
> @@ -2038,25 +2038,13 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  }
>  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
>  
> -/**
> - * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
> - *
> - * Note that a single argument of kvfree_rcu() call has a slow path that
> - * triggers synchronize_rcu() following by freeing a pointer. It is done
> - * before the return from the function. Therefore for any single-argument
> - * call that will result in a kfree() to a cache that is to be destroyed
> - * during module exit, it is developer's responsibility to ensure that all
> - * such calls have returned before the call to kmem_cache_destroy().
> - */
> -void kvfree_rcu_barrier(void)
> +static inline void __kvfree_rcu_barrier(void)
>  {
>  	struct kfree_rcu_cpu_work *krwp;
>  	struct kfree_rcu_cpu *krcp;
>  	bool queued;
>  	int i, cpu;
>  
> -	flush_all_rcu_sheaves();
> -
>  	/*
>  	 * Firstly we detach objects and queue them over an RCU-batch
>  	 * for all CPUs. Finally queued works are flushed for each CPU.
> @@ -2118,8 +2106,43 @@ void kvfree_rcu_barrier(void)
>  		}
>  	}
>  }
> +
> +/**
> + * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
> + *
> + * Note that a single argument of kvfree_rcu() call has a slow path that
> + * triggers synchronize_rcu() following by freeing a pointer. It is done
> + * before the return from the function. Therefore for any single-argument
> + * call that will result in a kfree() to a cache that is to be destroyed
> + * during module exit, it is developer's responsibility to ensure that all
> + * such calls have returned before the call to kmem_cache_destroy().
> + */
> +void kvfree_rcu_barrier(void)
> +{
> +	flush_all_rcu_sheaves();
> +	__kvfree_rcu_barrier();
> +}
>  EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
>  
> +/**
> + * kvfree_rcu_barrier_on_cache - Wait for in-flight kvfree_rcu() calls on a
> + *                               specific slab cache.
> + * @s: slab cache to wait for
> + *
> + * See the description of kvfree_rcu_barrier() for details.
> + */
> +void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
> +{
> +	if (s->cpu_sheaves)
> +		flush_rcu_sheaves_on_cache(s);
> +	/*
> +	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
> +	 * on a specific slab cache.
> +	 */
> +	__kvfree_rcu_barrier();
> +}
> +EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
> +
>  static unsigned long
>  kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> @@ -2215,4 +2238,3 @@ void __init kvfree_rcu_init(void)
>  }
>  
>  #endif /* CONFIG_KVFREE_RCU_BATCHED */
> -
> diff --git a/mm/slub.c b/mm/slub.c
> index 785e25a14999..7cec2220712b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4118,42 +4118,47 @@ static void flush_rcu_sheaf(struct work_struct *w)
>  
>  
>  /* needed for kvfree_rcu_barrier() */
> -void flush_all_rcu_sheaves(void)
> +void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
>  {
>  	struct slub_flush_work *sfw;
> -	struct kmem_cache *s;
>  	unsigned int cpu;
>  
> -	cpus_read_lock();
> -	mutex_lock(&slab_mutex);
> +	mutex_lock(&flush_lock);
>  
> -	list_for_each_entry(s, &slab_caches, list) {
> -		if (!s->cpu_sheaves)
> -			continue;
> +	for_each_online_cpu(cpu) {
> +		sfw = &per_cpu(slub_flush, cpu);
>  
> -		mutex_lock(&flush_lock);
> +		/*
> +		 * we don't check if rcu_free sheaf exists - racing
> +		 * __kfree_rcu_sheaf() might have just removed it.
> +		 * by executing flush_rcu_sheaf() on the cpu we make
> +		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
> +		 */
>  
> -		for_each_online_cpu(cpu) {
> -			sfw = &per_cpu(slub_flush, cpu);
> +		INIT_WORK(&sfw->work, flush_rcu_sheaf);
> +		sfw->s = s;
> +		queue_work_on(cpu, flushwq, &sfw->work);
> +	}
>  
> -			/*
> -			 * we don't check if rcu_free sheaf exists - racing
> -			 * __kfree_rcu_sheaf() might have just removed it.
> -			 * by executing flush_rcu_sheaf() on the cpu we make
> -			 * sure the __kfree_rcu_sheaf() finished its call_rcu()
> -			 */
> +	for_each_online_cpu(cpu) {
> +		sfw = &per_cpu(slub_flush, cpu);
> +		flush_work(&sfw->work);
> +	}
>  
> -			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> -			sfw->s = s;
> -			queue_work_on(cpu, flushwq, &sfw->work);
> -		}
> +	mutex_unlock(&flush_lock);
> +}
>  
> -		for_each_online_cpu(cpu) {
> -			sfw = &per_cpu(slub_flush, cpu);
> -			flush_work(&sfw->work);
> -		}
> +void flush_all_rcu_sheaves(void)
> +{
> +	struct kmem_cache *s;
> +
> +	cpus_read_lock();
> +	mutex_lock(&slab_mutex);
>  
> -		mutex_unlock(&flush_lock);
> +	list_for_each_entry(s, &slab_caches, list) {
> +		if (!s->cpu_sheaves)
> +			continue;
> +		flush_rcu_sheaves_on_cache(s);
>  	}
>  
>  	mutex_unlock(&slab_mutex);
> -- 
> 2.43.0
> 

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
@ 2025-11-28 12:38               ` Daniel Gomez
  2025-12-02  9:29               ` Jon Hunter
  2 siblings, 0 replies; 95+ messages in thread
From: Daniel Gomez @ 2025-11-28 12:38 UTC (permalink / raw)
  To: Harry Yoo, surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, jonathanh



On 28/11/2025 12.37, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>   Total calls: 19
>   Average latency (us): 8529
>   Total time (us): 162069
> 
> After:
>   Total calls: 19
>   Average latency (us): 3804
>   Total time (us): 72287
> 
> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---

Thanks Harry for the patch,

A quick test on a different machine from the one I originally used to report
this shows a decrease from 214s to 100s.

LGTM,

Tested-by: Daniel Gomez <da.gomez@samsung.com>

> 
> Not sure if the regression is worse on the reporters' machines due to
> higher core count (or because some cores were busy doing other things,
> dunno).

FWIW, CI modules run on an 8 core VM. Depending on the host CPU, this made the
absolute number different but equivalent performance degradation.

> 
> Hopefully this will reduce the time to complete tests,
> and Suren could add his patch on top of this ;)
> 
>  include/linux/slab.h |  5 ++++
>  mm/slab.h            |  1 +
>  mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
>  mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
>  4 files changed, 73 insertions(+), 40 deletions(-)


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-28  8:57           ` Jon Hunter
@ 2025-12-01  6:55             ` Harry Yoo
  0 siblings, 0 replies; 95+ messages in thread
From: Harry Yoo @ 2025-12-01  6:55 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra

On Fri, Nov 28, 2025 at 08:57:28AM +0000, Jon Hunter wrote:
> 
> On 27/11/2025 12:48, Harry Yoo wrote:
> 
> ...
> 
> > > > I have been looking into a regression for Linux v6.18-rc where time taken to
> > > > run some internal graphics tests on our Tegra234 device has increased from
> > > > around 35% causing the tests to timeout. Bisect is pointing to this commit
> > > > and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> > > 
> > > Thanks for reporting! Uh, this has been put aside while I was busy working
> > > on other stuff... but now that we have two people complaining about this,
> > > I'll allocate some time to investigate and improve it.
> > > 
> > > It'll take some time though :)
> > 
> > By the way, how many CPUs do you have on your system, and does your
> > kernel have CONFIG_CODE_TAGGING enabled?
> 
> For this device there are 12 CPUs. I don't see CONFIG_CODE_TAGGING enabled.

Thanks! Then it's probably due to kmem_cache_destroy().
Please let me know this patch improves your test execution time.

https://lore.kernel.org/linux-mm/20251128113740.90129-1-harry.yoo@oracle.com/

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
  2025-09-10  8:01 ` [PATCH v8 03/23] slab: add opt-in caching layer of percpu sheaves Vlastimil Babka
@ 2025-12-02  8:48   ` Hao Li
  2025-12-02  8:55     ` Hao Li
  2025-12-02  9:00   ` slub: add barn_get_full_sheaf() and refine empty-main sheaf replacement Hao Li
  1 sibling, 1 reply; 95+ messages in thread
From: Hao Li @ 2025-12-02  8:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Venkat Rao Bagalkote

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1



From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:33:21 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 8250
Lines: 265

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1




From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:44:16 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 17732
Lines: 557

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1



From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:33:21 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 8250
Lines: 265

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1





From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:45:09 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 36697
Lines: 1141

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1



From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:33:21 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 8250
Lines: 265

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1




From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:44:16 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 17732
Lines: 557

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1



From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:33:21 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 8250
Lines: 265

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1


From haoli.tcs@gmail.com Tue Dec  2 16:24:49 2025
Date: Tue, 2 Dec 2025 16:31:49 +0800
From: Hao Li <haoli.tcs@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>, 
	"Liam R. Howlett" <Liam.Howlett@oracle.com>, Christoph Lameter <cl@gentwo.org>, 
	David Rientjes <rientjes@google.com>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Harry Yoo <harry.yoo@oracle.com>, Uladzislau Rezki <urezki@gmail.com>, 
	Sidhartha Kumar <sidhartha.kumar@oracle.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	rcu@vger.kernel.org, maple-tree@lists.infradead.org, 
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
Message-ID: <fqbhzbjw3nvbylvxayboyug7qayfref4xti7udztygd6zbhcun@b7s2yjihrwos>
Mutt-References: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
References: <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
 <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20250910-slub-percpu-caches-v8-3-ca3099d8352c@suse.cz>
Mutt-Fcc: ~/sent
Status: RO
Content-Length: 3509
Lines: 119

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1








^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf
  2025-12-02  8:48   ` [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf Hao Li
@ 2025-12-02  8:55     ` Hao Li
  0 siblings, 0 replies; 95+ messages in thread
From: Hao Li @ 2025-12-02  8:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Venkat Rao Bagalkote

Sorry, my postponed is messed up. I will resend.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* slub: add barn_get_full_sheaf() and refine empty-main sheaf replacement
  2025-09-10  8:01 ` [PATCH v8 03/23] slab: add opt-in caching layer of percpu sheaves Vlastimil Babka
  2025-12-02  8:48   ` [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf Hao Li
@ 2025-12-02  9:00   ` Hao Li
  2025-12-03  5:46     ` Harry Yoo
  1 sibling, 1 reply; 95+ messages in thread
From: Hao Li @ 2025-12-02  9:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	Sidhartha Kumar, linux-mm, linux-kernel, rcu, maple-tree,
	Venkat Rao Bagalkote

Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
the per-node barn without requiring an empty sheaf in exchange.

Use this helper in __pcs_replace_empty_main() to change how an empty main
per-CPU sheaf is handled:

  - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
    full sheaf from the barn via barn_get_full_sheaf(). On success, park
    the empty main sheaf in pcs->spare and install the full sheaf as the
    new pcs->main.

  - If pcs->spare already exists and has objects, keep the existing
    behavior of simply swapping pcs->main and pcs->spare.

  - Only when both pcs->main and pcs->spare are empty do we fall back to
    barn_replace_empty_sheaf() and trade the empty main sheaf into the
    barn in exchange for a full one.

This makes the empty-main path more symmetric with __pcs_replace_full_main(),
which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
empty sheaf from the barn. It also matches the documented design more closely:

  "When both percpu sheaves are found empty during an allocation, an empty
   sheaf may be replaced with a full one from the per-node barn."

Signed-off-by: Hao Li <haoli.tcs@gmail.com>
---

* This patch is based on b4/sheaves-for-all branch

 mm/slub.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a94c64f56504..1fd28aa204e1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2746,6 +2746,32 @@ static void pcs_destroy(struct kmem_cache *s)
 	s->cpu_sheaves = NULL;
 }
 
+static struct slab_sheaf *barn_get_full_sheaf(struct node_barn *barn,
+					       bool allow_spin)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	if (!data_race(barn->nr_full))
+		return NULL;
+
+	if (likely(allow_spin))
+		spin_lock_irqsave(&barn->lock, flags);
+	else if (!spin_trylock_irqsave(&barn->lock, flags))
+		return NULL;
+
+	if (likely(barn->nr_full)) {
+		full = list_first_entry(&barn->sheaves_full,
+					 struct slab_sheaf, barn_list);
+		list_del(&full->barn_list);
+		barn->nr_full--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+
 static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
 					       bool allow_spin)
 {
@@ -4120,7 +4146,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct slab_sheaf *empty = NULL;
 	struct slab_sheaf *full;
 	struct node_barn *barn;
-	bool can_alloc;
+	bool can_alloc, allow_spin;
 
 	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
@@ -4130,10 +4156,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	if (pcs->spare && pcs->spare->size > 0) {
-		swap(pcs->main, pcs->spare);
-		return pcs;
-	}
+	allow_spin = gfpflags_allow_spinning(gfp);
 
 	barn = get_barn(s);
 	if (!barn) {
@@ -4141,8 +4164,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 		return NULL;
 	}
 
-	full = barn_replace_empty_sheaf(barn, pcs->main,
-					gfpflags_allow_spinning(gfp));
+	if (!pcs->spare) {
+		full = barn_get_full_sheaf(barn, allow_spin);
+		if (full) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
+	} else if (pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	/* both main and spare are empty */
+
+	full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
 
 	if (full) {
 		stat(s, BARN_GET);
-- 
2.50.1



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
  2025-11-28 12:38               ` Daniel Gomez
@ 2025-12-02  9:29               ` Jon Hunter
  2025-12-02 10:18                 ` Harry Yoo
  2 siblings, 1 reply; 95+ messages in thread
From: Jon Hunter @ 2025-12-02  9:29 UTC (permalink / raw)
  To: Harry Yoo, surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, linux-tegra


On 28/11/2025 11:37, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>    Total calls: 19
>    Average latency (us): 8529
>    Total time (us): 162069
> 
> After:
>    Total calls: 19
>    Average latency (us): 3804
>    Total time (us): 72287
> 
> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---

Thanks for the rapid fix. I have been testing this and can confirm that 
this does fix the performance regression I was seeing.

BTW shouldn't we add a 'Fixes:' tag above? I would like to ensure that 
this gets picked up for v6.18 stable.

Otherwise ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon

-- 
nvpublic



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-12-02  9:29               ` Jon Hunter
@ 2025-12-02 10:18                 ` Harry Yoo
  0 siblings, 0 replies; 95+ messages in thread
From: Harry Yoo @ 2025-12-02 10:18 UTC (permalink / raw)
  To: Jon Hunter
  Cc: surenb, Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel,
	linux-mm, linux-modules, lucas.demarchi, maple-tree, mcgrof,
	petr.pavlu, rcu, rientjes, roman.gushchin, samitolvanen,
	sidhartha.kumar, urezki, vbabka, linux-tegra

On Tue, Dec 02, 2025 at 09:29:17AM +0000, Jon Hunter wrote:
> 
> On 28/11/2025 11:37, Harry Yoo wrote:
> > Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> > caches when a cache is destroyed. This is unnecessary when destroying
> > a slab cache; only the RCU sheaves belonging to the cache being destroyed
> > need to be flushed.
> > 
> > As suggested by Vlastimil Babka, introduce a weaker form of
> > kvfree_rcu_barrier() that operates on a specific slab cache and call it
> > on cache destruction.
> > 
> > The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> > 5900X machine (1 socket), by loading slub_kunit module.
> > 
> > Before:
> >    Total calls: 19
> >    Average latency (us): 8529
> >    Total time (us): 162069
> > 
> > After:
> >    Total calls: 19
> >    Average latency (us): 3804
> >    Total time (us): 72287
> > 
> > Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> > Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> > Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> 
> Thanks for the rapid fix. I have been testing this and can confirm that this
> does fix the performance regression I was seeing.

Great!

> BTW shouldn't we add a 'Fixes:' tag above? I would like to ensure that this
> gets picked up for v6.18 stable.

Good point, I added Cc: stable and Fixes: tags.
(and your and Daniel's Reported-and-tested-by: tags)

> Otherwise ...
> 
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thank you Jon and Daniel a lot for reporting regression and testing the fix!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: slub: add barn_get_full_sheaf() and refine empty-main sheaf replacement
  2025-12-02  9:00   ` slub: add barn_get_full_sheaf() and refine empty-main sheaf replacement Hao Li
@ 2025-12-03  5:46     ` Harry Yoo
  2025-12-03 11:15       ` Hao Li
  0 siblings, 1 reply; 95+ messages in thread
From: Harry Yoo @ 2025-12-03  5:46 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree, Venkat Rao Bagalkote

On Tue, Dec 02, 2025 at 05:00:08PM +0800, Hao Li wrote:
> Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
> the per-node barn without requiring an empty sheaf in exchange.
> 
> Use this helper in __pcs_replace_empty_main() to change how an empty main
> per-CPU sheaf is handled:
> 
>   - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
>     full sheaf from the barn via barn_get_full_sheaf(). On success, park
>     the empty main sheaf in pcs->spare and install the full sheaf as the
>     new pcs->main.
> 
>   - If pcs->spare already exists and has objects, keep the existing
>     behavior of simply swapping pcs->main and pcs->spare.
> 
>   - Only when both pcs->main and pcs->spare are empty do we fall back to
>     barn_replace_empty_sheaf() and trade the empty main sheaf into the
>     barn in exchange for a full one.

Hi Hao,

Yeah this is a very subtle difference between __pcs_replace_full_main()
and __pcs_replace_empty_main(), that the former installs the full main
sheaf in pcs->spare, while the latter replaces the empty main sheaf with
a full sheaf from the barn without populating pcs->spare.

Is it intentional, Vlastimil?

> This makes the empty-main path more symmetric with __pcs_replace_full_main(),
> which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
> empty sheaf from the barn. It also matches the documented design more closely:
> 
>   "When both percpu sheaves are found empty during an allocation, an empty
>    sheaf may be replaced with a full one from the per-node barn."

I'm not convinced that this change is worthwhile by adding more code;
you probably need to make a stronger argument for why it should be done.

> Signed-off-by: Hao Li <haoli.tcs@gmail.com>
> ---

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: slub: add barn_get_full_sheaf() and refine empty-main sheaf replacement
  2025-12-03  5:46     ` Harry Yoo
@ 2025-12-03 11:15       ` Hao Li
  0 siblings, 0 replies; 95+ messages in thread
From: Hao Li @ 2025-12-03 11:15 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree, Venkat Rao Bagalkote

On Wed, Dec 03, 2025 at 02:46:22PM +0900, Harry Yoo wrote:
> On Tue, Dec 02, 2025 at 05:00:08PM +0800, Hao Li wrote:
> > Introduce barn_get_full_sheaf(), a helper that detaches a full sheaf from
> > the per-node barn without requiring an empty sheaf in exchange.
> > 
> > Use this helper in __pcs_replace_empty_main() to change how an empty main
> > per-CPU sheaf is handled:
> > 
> >   - If pcs->spare is NULL and pcs->main is empty, first try to obtain a
> >     full sheaf from the barn via barn_get_full_sheaf(). On success, park
> >     the empty main sheaf in pcs->spare and install the full sheaf as the
> >     new pcs->main.
> > 
> >   - If pcs->spare already exists and has objects, keep the existing
> >     behavior of simply swapping pcs->main and pcs->spare.
> > 
> >   - Only when both pcs->main and pcs->spare are empty do we fall back to
> >     barn_replace_empty_sheaf() and trade the empty main sheaf into the
> >     barn in exchange for a full one.
> 
> Hi Hao,
> 
> Yeah this is a very subtle difference between __pcs_replace_full_main()
> and __pcs_replace_empty_main(), that the former installs the full main
> sheaf in pcs->spare, while the latter replaces the empty main sheaf with
> a full sheaf from the barn without populating pcs->spare.

Exactly.

> 
> Is it intentional, Vlastimil?
> 
> > This makes the empty-main path more symmetric with __pcs_replace_full_main(),
> > which for a full main sheaf parks the full sheaf in pcs->spare and pulls an
> > empty sheaf from the barn. It also matches the documented design more closely:
> > 
> >   "When both percpu sheaves are found empty during an allocation, an empty
> >    sheaf may be replaced with a full one from the per-node barn."
> 
> I'm not convinced that this change is worthwhile by adding more code;
> you probably need to make a stronger argument for why it should be done.

Hi Harry,

Let me explain my intuition in more detail.

Previously, when pcs->main was empty and pcs->spare was NULL, we used
barn_replace_empty_sheaf() to trade the empty main sheaf into the barn
in exchange for a full one. As a result, pcs->main became full, but
pcs->spare remained NULL. Later, when frees filled pcs->main again,
__pcs_replace_full_main() had to call into the barn to obtain an empty
sheaf, because there was still no local spare to use.

With this patch, when pcs->main is empty and pcs->spare is NULL,
__pcs_replace_empty_main() instead uses barn_get_full_sheaf() to pull a
full sheaf from the barn while keeping the now‑empty main sheaf locally
as pcs->spare. The next time pcs->main becomes full,
__pcs_replace_full_main() can simply swap main and spare, with no barn
operations and no need to allocate a new empty sheaf.

In other words, although we still need one barn operation when main
first becomes empty in __pcs_replace_empty_main(), we avoid a future
barn operation on the subsequent “main full” path in
__pcs_replace_full_main.

Thanks.

> 
> > Signed-off-by: Hao Li <haoli.tcs@gmail.com>
> > ---
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2025-12-03 11:15 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-10  8:01 [PATCH v8 00/23] SLUB percpu sheaves Vlastimil Babka
2025-09-10  8:01 ` [PATCH v8 01/23] locking/local_lock: Expose dep_map in local_trylock_t Vlastimil Babka
2025-09-24 16:49   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 02/23] slab: simplify init_kmem_cache_nodes() error handling Vlastimil Babka
2025-09-24 16:52   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 03/23] slab: add opt-in caching layer of percpu sheaves Vlastimil Babka
2025-12-02  8:48   ` [PATCH] slub: add barn_get_full_sheaf() and refine empty-main sheaf Hao Li
2025-12-02  8:55     ` Hao Li
2025-12-02  9:00   ` slub: add barn_get_full_sheaf() and refine empty-main sheaf replacement Hao Li
2025-12-03  5:46     ` Harry Yoo
2025-12-03 11:15       ` Hao Li
2025-09-10  8:01 ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-09-12  0:38   ` Sergey Senozhatsky
2025-09-12  7:03     ` Vlastimil Babka
2025-09-17  8:30   ` Harry Yoo
2025-09-17  9:55     ` Vlastimil Babka
2025-09-17 11:32       ` Harry Yoo
2025-09-17 12:05         ` Vlastimil Babka
2025-09-17 13:07           ` Harry Yoo
2025-09-17 13:21             ` Vlastimil Babka
2025-09-17 13:34               ` Harry Yoo
2025-09-17 14:14                 ` Vlastimil Babka
2025-09-18  8:09                   ` Vlastimil Babka
2025-09-19  6:47                     ` Harry Yoo
2025-09-19  7:02                       ` Vlastimil Babka
2025-09-19  8:59                         ` Harry Yoo
2025-09-25  4:35                     ` Suren Baghdasaryan
2025-09-25  8:52                       ` Harry Yoo
2025-09-25 13:38                         ` Suren Baghdasaryan
2025-09-26 10:08                       ` Vlastimil Babka
2025-09-26 15:41                         ` Suren Baghdasaryan
2025-09-17 11:36       ` Paul E. McKenney
2025-09-17 12:13         ` Vlastimil Babka
2025-10-31 21:32   ` Daniel Gomez
2025-11-03  3:17     ` Harry Yoo
2025-11-05 11:25       ` Vlastimil Babka
2025-11-27 14:00         ` Daniel Gomez
2025-11-27 19:29           ` Suren Baghdasaryan
2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
2025-11-28 12:22               ` Harry Yoo
2025-11-28 12:38               ` Daniel Gomez
2025-12-02  9:29               ` Jon Hunter
2025-12-02 10:18                 ` Harry Yoo
2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
2025-11-27 11:50       ` Jon Hunter
2025-11-27 12:33       ` Harry Yoo
2025-11-27 12:48         ` Harry Yoo
2025-11-28  8:57           ` Jon Hunter
2025-12-01  6:55             ` Harry Yoo
2025-11-27 13:18       ` Vlastimil Babka
2025-11-28  8:59         ` Jon Hunter
2025-09-10  8:01 ` [PATCH v8 05/23] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-09-10  8:01 ` [PATCH v8 06/23] slab: determine barn status racily outside of lock Vlastimil Babka
2025-09-10  8:01 ` [PATCH v8 07/23] slab: skip percpu sheaves for remote object freeing Vlastimil Babka
2025-09-25 16:14   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 08/23] slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
2025-09-25 16:27   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 09/23] maple_tree: remove redundant __GFP_NOWARN Vlastimil Babka
2025-09-10  8:01 ` [PATCH v8 10/23] tools/testing/vma: clean up stubs in vma_internal.h Vlastimil Babka
2025-09-10  8:01 ` [PATCH v8 11/23] maple_tree: Drop bulk insert support Vlastimil Babka
2025-09-25 16:38   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 12/23] tools/testing/vma: Implement vm_refcnt reset Vlastimil Babka
2025-09-25 16:38   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 13/23] tools/testing: Add support for changes to slab for sheaves Vlastimil Babka
2025-09-26 23:28   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 14/23] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
2025-09-10  8:01 ` [PATCH v8 15/23] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2025-09-12  2:20   ` Liam R. Howlett
2025-10-16 15:16   ` D, Suneeth
2025-10-16 16:15     ` Vlastimil Babka
2025-10-17 18:26       ` D, Suneeth
2025-09-10  8:01 ` [PATCH v8 16/23] tools/testing: include maple-shim.c in maple.c Vlastimil Babka
2025-09-26 23:45   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 17/23] testing/radix-tree/maple: Hack around kfree_rcu not existing Vlastimil Babka
2025-09-26 23:53   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 18/23] maple_tree: Use kfree_rcu in ma_free_rcu Vlastimil Babka
2025-09-17 11:46   ` Harry Yoo
2025-09-27  0:05     ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 19/23] maple_tree: Replace mt_free_one() with kfree() Vlastimil Babka
2025-09-27  0:06   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 20/23] tools/testing: Add support for prefilled slab sheafs Vlastimil Babka
2025-09-27  0:28   ` Suren Baghdasaryan
2025-09-10  8:01 ` [PATCH v8 21/23] maple_tree: Prefilled sheaf conversion and testing Vlastimil Babka
2025-09-27  1:08   ` Suren Baghdasaryan
2025-09-29  7:30     ` Vlastimil Babka
2025-09-29 16:51       ` Liam R. Howlett
2025-09-10  8:01 ` [PATCH v8 22/23] maple_tree: Add single node allocation support to maple state Vlastimil Babka
2025-09-27  1:17   ` Suren Baghdasaryan
2025-09-29  7:39     ` Vlastimil Babka
2025-09-10  8:01 ` [PATCH v8 23/23] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
2025-10-07  6:34 ` [PATCH v8 00/23] SLUB percpu sheaves Christoph Hellwig
2025-10-07  8:03   ` Vlastimil Babka
2025-10-08  6:04     ` Christoph Hellwig
2025-10-15  8:32       ` Vlastimil Babka
2025-10-22  6:47         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox